Back to Index

Why should anyone care about Evals? — Manu Goyal, Braintrust


Transcript

Manu Reviewer: Peter van de Ven All right, who is excited about evals? All right, what can I do to get those juices flowing? I'm Manu, and I work at Braintrust, where we build a platform to do evals and a whole bunch of other stuff. So I was thinking we could just start by talking a little bit about my own personal evals journey.

Now, you might see this picture and say, "Oh, what an adorable little boy absorbed in his Nintendo 64 video game." But if you look a little closer, you'll see a boy who's deeply disappointed, with the state of technology in his society. Because this boy, he knows that technology is not meant to be shackled to the constraints of rule-based systems, doomed to do the same thing over and over and over.

No, technology is meant to come alive, to grow and adapt, and really be a thought partner to mankind. So I knew this in this moment, which is why I decided to devote my career to being a software engineer in the AI industry. And so I dropped the Nintendo, and I started grinding away on Leet Code.

And soon enough, I landed a job in the self-driving car industry. Now, we can all learn a lot about self-driving cars, but the thing I took away was that you can spend all day tuning the model, changing the architecture, adjusting the loss function, all good stuff, but it's never going to be enough for you to actually ship it to production.

I can't say, "Oh, my image classification rate went from 98% to 99%, put it on the road." We need some way to contextualize this model and understand if it actually works for our real-world application. You know, does it avoid pedestrians? Does it negotiate traffic scenarios appropriately? Does it obey the law?

All this stuff we actually need to understand. And how we're going to do that is with evals. Now, you know, the whole point here is, you know, evals aren't just unit tests for AI. They're not just for finding regressions, right? If I didn't have evals, the only way I can get any signal on my changes is by shipping it to prod and then getting signal, you know, in the real world.

But that's expensive, it's slow, and ultimately, it's pretty risky. So what do evals do is it's kind of like if you invest in good evals, you're kind of building a laboratory that lets you run experiments to your heart's content and do 90% of the product iteration loop before going to prod, and then now you can ship much more quickly, much more confidently.

Now, furthermore, if you actually apply the same metrics from offline to your online production data, you now have data-driven signal about which examples in prod are going to be most useful for that next iteration loop. And so, with all of this knowledge, my evals journey had completed and I transformed from this guy to this guy.

So, success. Now, if this heartfelt childhood story isn't enough to do it for you, you don't have to take my word. You can take the words of all of these tech luminaries. We have Kevin Weil, Gary Tan, Mike Krieger, Greg Brockman, all extolling the virtues and the necessities of evals.

And surely, if they're all saying it, there's got to be something to it. It can't be a total scam. So, there's got to be some, there's got to be something worth checking out here. So, with all that buzz, I made my way to Braintrust where our goal is to sort of build the dev platform to, of course, let you do evals but also do all the things that go along with it.

So, that involves, you know, tweaking prompts and experimenting in the playground. It involves logging data and sort of getting the observability component and kind of connecting all those together in this beautiful data flywheel so that we can, we can let you build the data flywheel to let your AI dreams come true.

Because that's really what, what we're here for. Now, I know this was a dense and content heavy presentation. So, I'll try to distill it with one simple message. which is that the key to industry transformation. The key to success. It's evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, all right, thank you, please join the evals track, Golden Gate Ballroom B, I'll see you there.

Thank you.