Five hard earned lessons about Evals — Ankur Goyal, Braintrust

00:00:00.000 | Let's talk about some of the interesting things we've learned over time. So the first thing is,

00:00:20.680 | I think it's super important for you to understand and define whether evals are actually providing

00:00:27.360 | value for your organization or not. And I tried to come up with three signs that you should

00:00:33.100 | look for that are good. So the first is if a new model comes out, you should be prepared

00:00:42.480 | via your evals to be able to launch an update to your product within 24 hours that incorporates

00:00:48.200 | the new model. Sarah from Notion, she talked yesterday, she talked about this specifically,

00:00:54.680 | but for the past several model releases, every time something comes out, Notion's able to

00:00:59.540 | incorporate the new model within 24 hours. And I think that's a really good sign of success.

00:01:04.340 | If you can't do that, then it means that you have some work to do on your evals.

00:01:11.140 | Another sign of success is if a user complains about something, do you have a very clear and

00:01:16.480 | straightforward path to take their complaint and add it into your evals? If you do, then you have

00:01:22.220 | a shot at actually incorporating user feedback, pulling it into your evals, and ultimately doing

00:01:27.980 | it better. If you don't, then you're going to lose a lot of valuable information into the ether. So again,

00:01:33.040 | I think this is a really important kind of threshold or milestone to hit. And the last one, which I'm actually

00:01:40.220 | going to talk about a little bit more throughout the presentation is you should really start using

00:01:45.080 | evals to play offense and understand which use cases you can solve and how well you can solve them

00:01:51.260 | before you actually ship things, not like unit tests, which allow you to just test for regressions.

00:01:56.360 | So if you really adopt evals, then I think before you launch a new product, you have a really

00:02:02.700 | good idea of how well the product might work given what your evals say.

00:02:10.760 | The second lesson is that great evals, they have to be engineered. They don't just come for free

00:02:17.360 | with synthetic data sets and random LLM-as-a-judge scores that you read about online. And I think

00:02:25.280 | there's maybe two ways of thinking about this. There's no data set that is perfectly aligned with

00:02:30.740 | reality. I think in the cases that there are, there's like basically nothing to do and the use

00:02:35.920 | cases already work, which there are a few that are kind of like that, like solving competition math

00:02:41.360 | problems, for example. But for most real world use cases, any data set that you can come up with ahead

00:02:46.760 | of time is not going to represent what users are actually experiencing. And I think the best data sets

00:02:52.640 | are those that you can continuously reconcile as you actually experience what happens in reality.

00:02:58.100 | And doing that well requires quite a bit of engineering. Of course, brain trust can help you

00:03:03.200 | with that. But I think the point is, you have to think about a data set as an engineering problem,

00:03:08.240 | not just something that's given to you. And the same is true with scorers. I think a lot of people we talk

00:03:14.960 | to ask, "Hey, what scorers does brain trust come with? And how can we use those so that we don't

00:03:20.420 | need to think about scoring?" And we actually have a really powerful open source library called auto

00:03:25.880 | evals. But it's very open source and flexible for a reason, which is that every company that we work

00:03:32.480 | with that's sufficiently advanced is writing their own scoring functions and modifying them constantly. And

00:03:38.720 | I think one way to think about scorers is they're like a spec or like a PRD for your AI application.

00:03:45.180 | And if you think about them that way, one, it actually justifies making an investment in scoring

00:03:50.820 | beyond just using something off the shelf. And two, hopefully it's fairly obvious that if you just use,

00:03:55.660 | you know, an open source or generic scorer, that's a spec for someone else's project, not yours.

00:04:00.720 | There's been a real shift towards context in prompts. That's not just the system prompt that you write.

00:04:11.820 | And I actually think that just traditional prompt engineering, people say this in different ways, but I

00:04:16.980 | think traditional prompt engineering is evolving quite a bit. And it's very important to think about context,

00:04:21.960 | not just a prompt. So this is an example of what kind of a modern prompt looks like for an agent. Usually you

00:04:29.760 | have a system prompt and then a for loop, which, you know, runs LLM calls, issues tool calls, incorporates

00:04:37.260 | the tool calls into the prompt, and then iterates and iterates. And I actually took a few trajectories

00:04:45.360 | from agents that we see in the wild and summarize these numbers. And as you can see, a vast majority

00:04:51.660 | of the tokens in the average prompt are not from the system prompt. And so, yes, it's very important to write a good

00:04:58.800 | system prompt and continue to improve it. But if you're not very precise about how you define tools and how

00:05:05.460 | you define their outputs, then you're leaving a lot on the table. And I think one of the most important

00:05:10.200 | things we've learned together with some customers is that you can't just take tools as a reflection of

00:05:18.300 | your APIs or your product as it exists today. You have to think about tools in terms of what the LLM wants

00:05:24.960 | LLM wants to see and how you can use, you know, exactly what you present to the LLM to make it work

00:05:31.620 | really well. And I think that in most projects, it's actually very disruptive when you write good tools. It's not

00:05:39.900 | something that's just like an API layer on top of the stuff that you already have. And the same is true with

00:05:44.460 | their outputs. There's one example that we worked on recently for an internal project where

00:05:51.120 | shifting the output of a tool from JSON to YAML actually made a significant difference. And I know

00:05:57.460 | that's a little bit of a meme in the AI universe, but it's just so much more token efficient and easy for an LLM to look at

00:06:05.160 | YAML-shaped data while doing analysis than extremely verbose JSON. Now, if you're writing code and you're

00:06:13.660 | plugging something into, you know, a charting library, it makes no difference because to JavaScript, YAML and JSON are both

00:06:20.620 | structured data. But to an LLM, they're very different. And so I think you have to be very, very thoughtful about, you know, how you

00:06:28.000 | actually construct the definition of a tool and how you construct its output for the LLM to maximally benefit from it.

00:06:34.000 | So, I think one of the most important things we've learned, and actually, I would credit some of the folks at Repl.it for really pioneering this pattern,

00:06:47.340 | but, you know, every time a new model comes out, everything might change. And I think you need to engineer your product, engineer your

00:06:56.200 | team, engineer your, you know, mindset, so that when a new model comes out, if it changes everything for you, you can jump on that

00:07:04.000 | opportunity, and ship something that maybe wasn't possible before. And I'm going to show you some numbers for a product feature that we're actually launching, and I'm going to show you a little bit of it today.

00:07:16.000 | But we've had an eval for a while that tells us how well this feature might work. And we run it every few months. And you can see, you know, it wasn't that long ago that GPT-4.0 was the best model out there. But things have changed. And, you know, progressively, GPT-4.1 did a little bit better.

00:07:37.000 | GPT-3.0 is much better. And GPT-4.0 is much better. And GPT-4.0 is actually even more remarkably better. And what that's meant for us is that this feature that, you know, at 10% would really not be

00:07:51.000 | viable for our users to use, suddenly becomes viable. And so, you know, GPT-4.0 actually came out two weeks ago. And we're shipping the first version of this feature today, which is just two weeks later.

00:08:03.000 | But we were able to jump on that opportunity, because we ran this eval, we were ready to do it. And we saw that, okay, great, we've actually finally crossed this threshold.

00:08:13.000 | So everyone that I personally work with or talk to, I encourage to create evals that are very, very ambitious, and likely not viable with today's models, and construct them in a way that when a new model comes out, you can just plug the new model in and try it.

00:08:31.000 | In Braintrust, we have this tool called the Braintrust Proxy. There's a lot of similar tools, you could use ours, or you could use something else. But really, the point is that you don't need to change any code to work across model providers. And so, you know, Google just launched the newest version of Gemini.

00:08:49.000 | Gemini, actually, Gemini 2.5 Pro.0520 scores 1% on this benchmark. So we didn't even put it on here. But maybe the thing they launched today actually does a lot better. We can find out, you know, with just a few keystrokes, maybe right after this talk.

00:09:08.000 | And the last thing is, it's super important if you think about optimizing your prompts to optimize the entire system. So that means thinking holistically about your AI system as the data that you use for your evals,

00:09:27.000 | the task, which is, you know, the prompt, the agentic system tools, etc., and the scoring functions. And every time you think about making, you know, your app better, you need to think about improving this overall system.

00:09:42.000 | We actually ran a benchmark, which is the same benchmark that I showed previously. It auto-optimizes prompts using an LLM. And we ran it once by just giving it the prompt and saying, like, hey, please optimize the prompt.

00:10:00.000 | And a second time giving it the prompt, the dataset, and the scores and said, please optimize this whole system. And you can see there's a very dramatic difference. So, again, something goes from unviable to viable.

00:10:12.000 | But it's just super important to optimize the entire system, not just the prompt.

00:10:19.000 | And actually, this is a new product feature that we are starting to launch today. If you're a Braintrust user, you can go to the feature flag section of Braintrust and turn on a new feature flag called Loop.

00:10:34.000 | And Loop is this amazing, cool new feature that actually auto-optimizes your evals directly within Braintrust.

00:10:44.000 | So, you can work in our playground and give it, you know, a prompt, a dataset, and some scores. And it can actually create prompts, datasets, and scores, too, and just, you know, work with it.

00:10:58.000 | The kinds of things that we've seen work really well are optimize this prompt, or what am I missing from this dataset that would be really good to test for this use case?

00:11:09.000 | Why is my score so low? Or why is my score so high? Can you please help me write a score that is, you know, harsher than the one that I have right now?

00:11:19.000 | You can also try it out with different models. So, as you can see from this, we've definitely seen the best performance with Claude 4 Sonnet.

00:11:28.000 | And Claude 4 Opus performs a couple of percentage points better. But we encourage you to try it out with different models.

00:11:35.000 | You can use O3. You can use O4 Mini. You can use Gemini. Maybe you're building your own LLM or fine-tuned model.

00:11:42.000 | You can try that as well. And yeah, we're very excited for this. I think I'm going to talk about this a little bit later, and I'm happy to do it with some Q&A as well.

00:11:52.000 | But I actually, I really think that the workflow around evals is going to dramatically change. Now that LLMs are capable of looking at prompts and looking at data and actually making, you know, constructive improvements automatically, a lot of the manual labor that went into iterating with evals,

00:12:11.000 | doesn't need to be there anymore. So, it's really exciting. We're excited to ship this and start to get some feedback.

00:12:21.000 | So, just to recap, five lessons that I think are really important. Effective evals speak for themselves. It's important to understand whether you've kind of reached a point of eval competence in your organization or not.

00:12:33.000 | It's okay if you haven't. It's not easy. But it's important to be honest about that and work towards it. When you're working on evals, it's very important to engineer the entire system.

00:12:45.000 | So, don't just think about the prompt. Don't just think about improving the prompt. Please don't just use synthetic data or hugging face datasets.

00:12:53.000 | I know they're awesome, but please use more than just that. Please don't use off-the-shelf scores only. Write your own. Think very deliberately about how you can craft the spec of what you're working on into your scoring functions.

00:13:07.000 | Think very carefully about context. And I think, in particular, what helps me personally is to think about writing tools like I would think about writing a prompt.

00:13:17.000 | It's my opportunity to communicate with an LLM and set it up for success. And how I define the API interface of the tool and I define its output has a very dramatic impact on that.

00:13:30.000 | Make sure that you're ready for new models to come out and to just change everything. So, if a new model comes out, you want to be prepared to know that immediately, ideally the day that it comes out.

00:13:43.000 | and also be prepared to rip out everything and replace it with a fundamentally new architecture that takes advantage of that new model.

00:13:51.000 | And I think part of that is obviously having the right evals. Part of it is engineering your product in a way that actually allows you to do that.

00:13:59.000 | And then finally, when you think about optimizing or improving your eval performance, you have to think about optimizing the whole system.

00:14:08.000 | the data and how you get that data, the task itself, which, you know, the prompt tools, et cetera, and the scoring functions.

00:14:19.000 | And with that, we have some time for Q&A.

00:14:22.000 | Yeah, there's two microphones up here, one on the left side, one on the right side. Feel free to stand up and ask your questions.

00:14:31.000 | Hi, this is Jyoti. One of your slides said take feedback and turn it into an eval. Are you concerned about overfitting evals at that point where every feedback then turns into an eval?

00:14:49.000 | Oh, that's a great question. Also, nice to see you. So the question was, one of the slides was about taking feedback from, you know, real data and adding it to a dataset and incorporating it in an eval. Are you worried about overfitting?

00:15:05.000 | And I think the answer is, I'm actually way more worried about overfitting to the dataset without the user's feedback than I am to adjusting the fit to incorporate the user's feedback.

00:15:17.000 | Like, the most important thing about a dataset is not the state of the dataset at any point in time. It is how well you are equipped to reconcile the dataset with the reality that you want.

00:15:29.000 | And I actually think one of the things that we discourage in the product, and some people complain to us about this. I get it if you're one of those people.

00:15:37.000 | But we don't automatically take user feedback and add it to datasets right now. We actually want a human who has some taste and maybe can build some intuition about the problem to find the data points from users that are interesting and add them to the dataset.

00:15:54.000 | And I think that is your opportunity as a user to apply some judgment about, like, oh, okay, this user is trying to do something that should obviously work.

00:16:02.000 | It's really sad that it doesn't work in my product. Let me add it to the dataset so I can make sure it does. Excuse me.

00:16:09.000 | You had a slide, I think, in the tool descriptions about, like, with some percentages on it. Yeah, this one. What is that?

00:16:17.000 | Yeah, so we took a few agents. Like, we, you know, have a lot of traces. And we analyzed the relative number of tokens for different message types.

00:16:30.000 | So the system prompt is one message type. Tool definitions are, you know, the spec of what tools the model can call. User and assistant are tokens from user and assistant, just text interactions. And then tool responses are tokens from the, you know, the tool generates itself.

00:16:52.000 | Oh, and this is the percentage of tokens?

00:16:54.000 | Correct. And this is the relative percentage of those tokens. Yeah. Yeah. Yeah.

00:16:59.000 | Yeah, so the point that we're trying to make here is that I think in modern agentic systems, tools actually, like, very, very significantly dominate the token budget of the LLM.

00:17:13.000 | And I think that it's very important to think about how you define the definition of tools and how you define their outputs, so that you, you know, engineer the LLM for success, not just sort of take, you know, your GraphQL API and give it as a bunch of, you know, tool calls to the LLM.

00:17:32.000 | First off, that point about the thumbs down is such a good point. I'm working with the

00:17:39.720 | government and people don't like the answer they got, for example, about taxes and they

00:17:45.160 | give it a thumbs down. Yeah. Right. So like adding that human aspect is a really good

00:17:50.180 | idea. We actually even added a little thing that said, "The answer is right, but I just

00:17:54.680 | don't like it." That's awesome. But my question is about your point that the new model changes

00:18:01.520 | everything. We've updated our models several times and use Cloud and OpenAI, and we haven't

00:18:09.320 | found huge differences other than recently someone really cheap wanted to use 4.1 mini,

00:18:16.540 | and like it seemed to ignore every... I swear it ignored the system problem completely.

00:18:23.000 | Yeah. But what kind of things, when you say it changes everything, can you tell me a little

00:18:27.320 | more about what kind of changes you're seeing? For sure. I think

00:18:31.040 | the use case that we just shipped with Loop is a really good example of that. So this

00:18:35.040 | is a very ambitious agent. It's looking at prompts and data sets and scores and automatically optimizing

00:18:43.640 | the prompts based on the data sets and scores. And this is something that, you know, we wrote

00:18:49.260 | a benchmark for a while ago, and we ran with every consecutive model launch, and the numbers

00:18:55.560 | looked more like what you see for GPT 4.0 for a very long time. This isn't true for every

00:19:00.560 | benchmark. So as part of this exercise, we actually have a bunch of evals that Loop optimizes. That's

00:19:08.040 | our eval set. And there's some evals like classifying, taking movie quotes and figuring out what movie

00:19:14.840 | they're coming from that have worked really well since GPT 3.5. And so there's certain use cases

00:19:20.840 | where it just doesn't matter. There are other use cases where they're so ambitious that they

00:19:25.960 | just don't work today. And I think you want to create evals so that if there's something ambitious

00:19:31.320 | that you want to do in the future, you are very well prepared when a new model comes out to just

00:19:35.880 | push a button and find that out. Thank you.

00:19:40.840 | We'll see you next time.

Five hard earned lessons about Evals — Ankur Goyal, Braintrust

Chapters