back to index

Why should anyone care about Evals? — Manu Goyal, Braintrust


Whisper Transcript | Transcript Only Page

00:00:00.000 | Manu Reviewer: Peter van de Ven
00:00:07.000 | All right, who is excited about evals?
00:00:23.320 | All right, what can I do to get those juices flowing?
00:00:28.320 | I'm Manu, and I work at Braintrust,
00:00:31.600 | where we build a platform to do evals and a whole bunch of other stuff.
00:00:36.320 | So I was thinking we could just start
00:00:38.640 | by talking a little bit about my own personal evals journey.
00:00:43.400 | Now, you might see this picture and say,
00:00:45.840 | "Oh, what an adorable little boy absorbed in his Nintendo 64 video game."
00:00:51.840 | But if you look a little closer,
00:00:53.840 | you'll see a boy who's deeply disappointed,
00:00:56.640 | with the state of technology in his society.
00:00:59.920 | Because this boy, he knows that technology is not meant to be shackled
00:01:05.040 | to the constraints of rule-based systems,
00:01:07.920 | doomed to do the same thing over and over and over.
00:01:11.360 | No, technology is meant to come alive, to grow and adapt,
00:01:15.920 | and really be a thought partner to mankind.
00:01:19.200 | So I knew this in this moment,
00:01:21.200 | which is why I decided to devote my career
00:01:24.400 | to being a software engineer in the AI industry.
00:01:28.200 | And so I dropped the Nintendo,
00:01:31.200 | and I started grinding away on Leet Code.
00:01:33.200 | And soon enough, I landed a job in the self-driving car industry.
00:01:38.880 | Now, we can all learn a lot about self-driving cars,
00:01:43.280 | but the thing I took away was that you can spend all day tuning the model,
00:01:48.960 | changing the architecture, adjusting the loss function, all good stuff,
00:01:54.160 | but it's never going to be enough for you to actually ship it to production.
00:01:59.200 | I can't say, "Oh, my image classification rate went from 98% to 99%, put it on the road."
00:02:07.920 | We need some way to contextualize this model and understand
00:02:12.880 | if it actually works for our real-world application.
00:02:16.000 | You know, does it avoid pedestrians?
00:02:18.400 | Does it negotiate traffic scenarios appropriately?
00:02:22.480 | Does it obey the law?
00:02:24.080 | All this stuff we actually need to understand.
00:02:27.040 | And how we're going to do that is with evals.
00:02:30.640 | Now, you know, the whole point here is, you know,
00:02:33.360 | evals aren't just unit tests for AI.
00:02:35.920 | They're not just for finding regressions, right?
00:02:38.640 | If I didn't have evals, the only way I can get any signal on my changes
00:02:43.520 | is by shipping it to prod and then getting signal, you know, in the real world.
00:02:48.640 | But that's expensive, it's slow, and ultimately, it's pretty risky.
00:02:53.360 | So what do evals do is it's kind of like if you invest in good evals, you're kind of building
00:02:59.680 | a laboratory that lets you run experiments to your heart's content and do 90% of the product
00:03:06.960 | iteration loop before going to prod, and then now you can ship much more quickly, much more confidently.
00:03:14.080 | Now, furthermore, if you actually apply the same metrics from offline to your online production
00:03:23.600 | data, you now have data-driven signal about which examples in prod are going to be most useful
00:03:29.680 | for that next iteration loop.
00:03:31.920 | And so, with all of this knowledge, my evals journey had completed and I transformed from
00:03:39.040 | this guy to this guy.
00:03:40.800 | So, success.
00:03:42.720 | Now, if this heartfelt childhood story isn't enough to do it for you, you don't have to take
00:03:50.000 | my word.
00:03:50.480 | You can take the words of all of these tech luminaries.
00:03:54.240 | We have Kevin Weil, Gary Tan, Mike Krieger, Greg Brockman, all extolling the virtues
00:04:02.880 | and the necessities of evals.
00:04:05.280 | And surely, if they're all saying it, there's got to be something to it.
00:04:09.280 | It can't be a total scam.
00:04:11.520 | So, there's got to be some, there's got to be something worth checking out here.
00:04:18.720 | So, with all that buzz, I made my way to Braintrust where our goal is to sort of build
00:04:24.480 | the dev platform to, of course, let you do evals but also do all the things that go along with it.
00:04:31.040 | So, that involves, you know, tweaking prompts and experimenting in the playground.
00:04:35.680 | It involves logging data and sort of getting the observability component and kind of connecting all
00:04:42.320 | those together in this beautiful data flywheel so that we can, we can let you build the data flywheel
00:04:49.360 | to let your AI dreams come true.
00:04:52.560 | Because that's really what, what we're here for.
00:04:55.680 | Now, I know this was a dense and content heavy presentation.
00:05:01.920 | So, I'll try to distill it with one simple message.
00:05:06.800 | which is that the key to industry transformation.
00:05:10.320 | The key to success.
00:05:12.240 | It's evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, all right, thank you, please join the evals track, Golden Gate Ballroom B, I'll see you there.
00:05:33.600 | Thank you.