The Future of Evals - Ankur Goyal, Braintrust

00:00:00.000 | -

00:00:02.000 | - Awesome.

00:00:22.760 | So today, we're gonna talk a little bit about evals to date

00:00:26.440 | and where we think evals are gonna be going in the future.

00:00:29.840 | Also, for those of you who saw my brother earlier,

00:00:35.000 | I'm gonna do my best to live up to his energy and charisma.

00:00:39.300 | But, yeah, you know, it's been an amazing almost two-year journey

00:00:46.000 | for us at Braintrust.

00:00:47.200 | We have had the opportunity to work with some of the most amazing companies

00:00:51.180 | building, I think, the best AI products in the world.

00:00:55.200 | I'm blown away by how many evals people actually run

00:00:59.180 | on the product.

00:01:00.180 | The average org that signs up for Braintrust runs almost 13 evals a day.

00:01:05.340 | Some of our customers run more than 3,000 evals a day.

00:01:10.340 | And some of the most advanced companies that are running evals

00:01:14.840 | are spending more than two hours in the product every day

00:01:18.340 | working through their evals.

00:01:20.340 | And I think one of the things that stands out to me is,

00:01:24.020 | while we have customers building some of the coolest, most automated,

00:01:29.180 | AI-based products and agents in the world, evals are such a manual process.

00:01:35.180 | To date, every time you run an eval, the best thing you can do is look at a dashboard,

00:01:42.340 | and I think we have a pretty cool dashboard in Braintrust.

00:01:44.340 | But still, it's just a dashboard that you look at, and you walk away and think,

00:01:48.340 | "Okay, what changes can I make to my code or to my prompts so that this eval does better?"

00:01:55.500 | And I actually think that is all going to change.

00:01:59.500 | So today, I'm excited to talk about something called Loop.

00:02:03.540 | Loop is an agent that we've been working on for some time now that's built into Braintrust.

00:02:08.660 | And it's actually only possible because of evals.

00:02:12.500 | Every quarter for the last two years, we've run evals on the frontier models

00:02:17.500 | to see how good they are at actually improving prompts, improving datasets, and improving scorers.

00:02:24.000 | And until very, very recently, they actually weren't very good.

00:02:27.240 | In fact, we think that Cloud 4, in particular, was a real breakthrough moment,

00:02:32.280 | and it performs almost six times better than the previous leading model before it.

00:02:39.040 | So Loop runs inside of Braintrust, and it can automatically optimize your prompts

00:02:44.780 | all the way to very complex agents.

00:02:48.100 | But just as importantly, it also helps you build better datasets and better scorers

00:02:52.940 | because it's really the combination of these three things

00:02:55.740 | that make for really great evals.

00:03:00.780 | This is a little preview of the UI.

00:03:03.620 | You can actually start using it today if you are an existing Braintrust user

00:03:07.280 | or you sign up for the product.

00:03:09.000 | There's a feature flag that you can just flip on called Loop

00:03:11.500 | and start using it right away.

00:03:13.780 | By default, it uses Cloud 4, but you can actually pick any model that you have access to

00:03:18.900 | and start using it, whether it's an OpenAI model, a Gemini model, or maybe some of you are building your own LLMs.

00:03:24.700 | You can use those as well.

00:03:27.360 | And as you can see, it runs directly inside of Braintrust.

00:03:30.740 | One of the things that we learned from working with a lot of users is how important it is to actually look at data

00:03:38.080 | and look at prompts while you're working with them, and we didn't want that to go away when we introduced Loop.

00:03:44.740 | So every time it suggests an edit to your data or it suggests a new idea for scoring or it suggests an edit to one of your prompts,

00:03:52.400 | you can actually see that side-by-side directly in the UI.

00:03:55.780 | Of course, for the more adventurous among you, there's also a toggle that you can turn on that says, like, just go for it, and it will go and optimize away, which actually works really well.

00:04:10.440 | So just to recap, to date, evals have been a critical part of building some of the best AI products in the world, but the task of actually doing evaluation has been incredibly manual.

00:04:23.100 | And I'm excited about how, over the next year, evals themselves are going to be completely revolutionized by the latest and greatest that's coming out from the frontier models themselves.

00:04:34.760 | And we're very excited to incorporate that into Braintrust.

00:04:37.760 | Please, if you're not already using the product, try it out.

00:04:40.760 | Try out Loop.

00:04:41.760 | Give us your feedback.

00:04:42.760 | We have a lot of work to do, and we'd love to talk to you.

00:04:46.260 | We're also hiring.

00:04:47.260 | So if you're interested in working on this kind of problem, whether it's the UI part of it, the AI part of it, or the infrastructure side of it, we'd love to talk to you.

00:04:56.260 | You can scan this QR code.

00:04:58.260 | It should be over there.

00:05:00.260 | Yeah.

00:05:01.260 | You can scan the QR code and get in touch with us.

00:05:03.260 | We'd love to chat.

00:05:04.260 | Thank you.

00:05:05.260 | Thank you.

00:05:06.260 | Thank you.

00:05:07.260 | Thank you.

00:05:08.260 | Thank you.

00:05:09.260 | We'll see you next time.

The Future of Evals - Ankur Goyal, Braintrust

Chapters