back to indexWhy should anyone care about Evals? — Manu Goyal, Braintrust

00:00:23.320 |
All right, what can I do to get those juices flowing? 00:00:31.600 |
where we build a platform to do evals and a whole bunch of other stuff. 00:00:38.640 |
by talking a little bit about my own personal evals journey. 00:00:45.840 |
"Oh, what an adorable little boy absorbed in his Nintendo 64 video game." 00:00:59.920 |
Because this boy, he knows that technology is not meant to be shackled 00:01:07.920 |
doomed to do the same thing over and over and over. 00:01:11.360 |
No, technology is meant to come alive, to grow and adapt, 00:01:24.400 |
to being a software engineer in the AI industry. 00:01:33.200 |
And soon enough, I landed a job in the self-driving car industry. 00:01:38.880 |
Now, we can all learn a lot about self-driving cars, 00:01:43.280 |
but the thing I took away was that you can spend all day tuning the model, 00:01:48.960 |
changing the architecture, adjusting the loss function, all good stuff, 00:01:54.160 |
but it's never going to be enough for you to actually ship it to production. 00:01:59.200 |
I can't say, "Oh, my image classification rate went from 98% to 99%, put it on the road." 00:02:07.920 |
We need some way to contextualize this model and understand 00:02:12.880 |
if it actually works for our real-world application. 00:02:18.400 |
Does it negotiate traffic scenarios appropriately? 00:02:24.080 |
All this stuff we actually need to understand. 00:02:27.040 |
And how we're going to do that is with evals. 00:02:30.640 |
Now, you know, the whole point here is, you know, 00:02:35.920 |
They're not just for finding regressions, right? 00:02:38.640 |
If I didn't have evals, the only way I can get any signal on my changes 00:02:43.520 |
is by shipping it to prod and then getting signal, you know, in the real world. 00:02:48.640 |
But that's expensive, it's slow, and ultimately, it's pretty risky. 00:02:53.360 |
So what do evals do is it's kind of like if you invest in good evals, you're kind of building 00:02:59.680 |
a laboratory that lets you run experiments to your heart's content and do 90% of the product 00:03:06.960 |
iteration loop before going to prod, and then now you can ship much more quickly, much more confidently. 00:03:14.080 |
Now, furthermore, if you actually apply the same metrics from offline to your online production 00:03:23.600 |
data, you now have data-driven signal about which examples in prod are going to be most useful 00:03:31.920 |
And so, with all of this knowledge, my evals journey had completed and I transformed from 00:03:42.720 |
Now, if this heartfelt childhood story isn't enough to do it for you, you don't have to take 00:03:50.480 |
You can take the words of all of these tech luminaries. 00:03:54.240 |
We have Kevin Weil, Gary Tan, Mike Krieger, Greg Brockman, all extolling the virtues 00:04:05.280 |
And surely, if they're all saying it, there's got to be something to it. 00:04:11.520 |
So, there's got to be some, there's got to be something worth checking out here. 00:04:18.720 |
So, with all that buzz, I made my way to Braintrust where our goal is to sort of build 00:04:24.480 |
the dev platform to, of course, let you do evals but also do all the things that go along with it. 00:04:31.040 |
So, that involves, you know, tweaking prompts and experimenting in the playground. 00:04:35.680 |
It involves logging data and sort of getting the observability component and kind of connecting all 00:04:42.320 |
those together in this beautiful data flywheel so that we can, we can let you build the data flywheel 00:04:52.560 |
Because that's really what, what we're here for. 00:04:55.680 |
Now, I know this was a dense and content heavy presentation. 00:05:01.920 |
So, I'll try to distill it with one simple message. 00:05:06.800 |
which is that the key to industry transformation. 00:05:12.240 |
It's evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, evals, all right, thank you, please join the evals track, Golden Gate Ballroom B, I'll see you there.