back to indexThe Future of Evals - Ankur Goyal, Braintrust

Chapters
0:0 Introduction to AI Engineer World's Fair
0:15 Speaker Introduction: Ankur Goyal, CEO of Braintrust
0:22 The Future of Evals
0:30 Increasing Adoption of Eval
1:58 Introducing Loop
4:9 Call to Action: Try Loop and Join the Team
00:00:22.760 |
So today, we're gonna talk a little bit about evals to date 00:00:26.440 |
and where we think evals are gonna be going in the future. 00:00:29.840 |
Also, for those of you who saw my brother earlier, 00:00:35.000 |
I'm gonna do my best to live up to his energy and charisma. 00:00:39.300 |
But, yeah, you know, it's been an amazing almost two-year journey 00:00:47.200 |
We have had the opportunity to work with some of the most amazing companies 00:00:51.180 |
building, I think, the best AI products in the world. 00:00:55.200 |
I'm blown away by how many evals people actually run 00:01:00.180 |
The average org that signs up for Braintrust runs almost 13 evals a day. 00:01:05.340 |
Some of our customers run more than 3,000 evals a day. 00:01:10.340 |
And some of the most advanced companies that are running evals 00:01:14.840 |
are spending more than two hours in the product every day 00:01:20.340 |
And I think one of the things that stands out to me is, 00:01:24.020 |
while we have customers building some of the coolest, most automated, 00:01:29.180 |
AI-based products and agents in the world, evals are such a manual process. 00:01:35.180 |
To date, every time you run an eval, the best thing you can do is look at a dashboard, 00:01:42.340 |
and I think we have a pretty cool dashboard in Braintrust. 00:01:44.340 |
But still, it's just a dashboard that you look at, and you walk away and think, 00:01:48.340 |
"Okay, what changes can I make to my code or to my prompts so that this eval does better?" 00:01:55.500 |
And I actually think that is all going to change. 00:01:59.500 |
So today, I'm excited to talk about something called Loop. 00:02:03.540 |
Loop is an agent that we've been working on for some time now that's built into Braintrust. 00:02:08.660 |
And it's actually only possible because of evals. 00:02:12.500 |
Every quarter for the last two years, we've run evals on the frontier models 00:02:17.500 |
to see how good they are at actually improving prompts, improving datasets, and improving scorers. 00:02:24.000 |
And until very, very recently, they actually weren't very good. 00:02:27.240 |
In fact, we think that Cloud 4, in particular, was a real breakthrough moment, 00:02:32.280 |
and it performs almost six times better than the previous leading model before it. 00:02:39.040 |
So Loop runs inside of Braintrust, and it can automatically optimize your prompts 00:02:48.100 |
But just as importantly, it also helps you build better datasets and better scorers 00:02:52.940 |
because it's really the combination of these three things 00:03:03.620 |
You can actually start using it today if you are an existing Braintrust user 00:03:09.000 |
There's a feature flag that you can just flip on called Loop 00:03:13.780 |
By default, it uses Cloud 4, but you can actually pick any model that you have access to 00:03:18.900 |
and start using it, whether it's an OpenAI model, a Gemini model, or maybe some of you are building your own LLMs. 00:03:27.360 |
And as you can see, it runs directly inside of Braintrust. 00:03:30.740 |
One of the things that we learned from working with a lot of users is how important it is to actually look at data 00:03:38.080 |
and look at prompts while you're working with them, and we didn't want that to go away when we introduced Loop. 00:03:44.740 |
So every time it suggests an edit to your data or it suggests a new idea for scoring or it suggests an edit to one of your prompts, 00:03:52.400 |
you can actually see that side-by-side directly in the UI. 00:03:55.780 |
Of course, for the more adventurous among you, there's also a toggle that you can turn on that says, like, just go for it, and it will go and optimize away, which actually works really well. 00:04:10.440 |
So just to recap, to date, evals have been a critical part of building some of the best AI products in the world, but the task of actually doing evaluation has been incredibly manual. 00:04:23.100 |
And I'm excited about how, over the next year, evals themselves are going to be completely revolutionized by the latest and greatest that's coming out from the frontier models themselves. 00:04:34.760 |
And we're very excited to incorporate that into Braintrust. 00:04:37.760 |
Please, if you're not already using the product, try it out. 00:04:42.760 |
We have a lot of work to do, and we'd love to talk to you. 00:04:47.260 |
So if you're interested in working on this kind of problem, whether it's the UI part of it, the AI part of it, or the infrastructure side of it, we'd love to talk to you. 00:05:01.260 |
You can scan the QR code and get in touch with us.