back to indexEvals 101 — Doug Guthrie, Braintrust

00:00:00.000 |
Hey everybody, my name is Doug Guthrie. I'm a solutions engineer at BrainTrust. As you can 00:00:22.980 |
see here, we're an end-to-end developer platform for building AI products. We do evals. If you 00:00:28.360 |
watched the keynote this morning, you saw our founding engineer jumping up and down on 00:00:32.680 |
stage yelling evals. I am not going to do that. I'm not as funny or cool as him, but we 00:00:38.960 |
should all be very excited about evals here. Very brief agenda of what we'll cover today 00:00:45.460 |
in this sort of intro. Very, very brief company overview. Give you an intro to evals. Why would 00:00:54.360 |
you even start thinking about using them? What are they? What are the different components 00:00:59.740 |
that you need to create an eval? Some more BrainTrust specific things via running evals via our SDK. 00:01:08.160 |
You'll see in the examples that you can run evals both in the platform itself as well as the 00:01:14.680 |
SDK. I think it's a really kind of cool thing that we can connect maybe the local development 00:01:19.800 |
that we're doing with what we're doing within the platform. And then how we then move to 00:01:25.180 |
production. This is maybe human review. This is online scoring. So how well is our application 00:01:31.500 |
performing in production? And then lastly, a little bit of human in the loop. Getting user 00:01:37.180 |
feedback from your users. How do we now take some of these production logs and feed them into 00:01:43.400 |
the data sets that we're using in our evals. Creating this really great flywheel effect. 00:01:47.780 |
Cool. Quick, quick company overview. You see up there maybe in the top right some of our fund, 00:01:57.780 |
excuse me, some of our investors. Maybe a quick call out on the leadership side. Ankur Goyal is our CEO. 00:02:05.000 |
Maybe the reason to call out that is Ankur in the last two stops, he essentially built BrainTrust from scratch the last two places. 00:02:16.380 |
And this is really where he found the idea to like maybe this is actually a thing that people need. 00:02:23.380 |
And really the origination story of BrainTrust. The other thing to call out here is like we already have a lot of companies using BrainTrust in production today. 00:02:33.380 |
This is just a few of the companies that are utilizing us for evals, for observability of their Gen AI applications. 00:02:44.900 |
I won't bore you too much with that, but let's jump into this. If you're at the keynote, you probably saw a similar slide here. 00:02:51.760 |
I didn't take it out, but you see like the tech luminaries here, I think as Manu referenced them, talking about evals and the importance of them. 00:03:01.760 |
I think this is a, you know, obviously this is an intro to evals track. And what better way to start this out with some really, you know, influential people in the space talking about evals and why they are so important. 00:03:15.280 |
So why would you think about evals, right? They, they, they help you answer questions. So here's, here's a few of them. 00:03:21.800 |
When I change the underlying model to my application, is it, is it getting better or is it getting worse? 00:03:28.800 |
When I change my prompt, right? When I change certain, certain things about the application, is it getting better or is it getting worse? 00:03:35.320 |
We want to get away from not having a rigorous sort of process around building with large language models, which as you all know, right, non-deterministic outputs creates somewhat of a challenge. 00:03:47.320 |
And without evals in place, it becomes really, really hard to, to create a good application that we can put into production. 00:03:55.160 |
Maybe some other ones, like obviously being able to detect regressions within the, within the code. 00:04:01.440 |
I think the other, the other thing that Ankur mentioned to me when I first started, which I didn't really mention this, but this is my, my third week at, at BrainTrust as a solutions engineer. 00:04:10.320 |
But one of the things that he mentioned to me that I thought really resonated was, uh, evals are a really great way, I think people think of them as almost like unit tests for, uh, for our, you know, our applications. 00:04:21.440 |
But he kind of described it in another way of like, this is a really great way for us to play offense, uh, as opposed to just playing defense, where I think maybe unit tests are, are kind of used for. 00:04:31.000 |
This can actually be used as a tool to, to really help, uh, create a lot of rigor around us building and developing these applications. 00:04:39.000 |
And ensuring that we actually build, uh, things that we can put into production. 00:04:44.880 |
Maybe from like a business perspective, uh, why would you think about running evals or using evals? 00:04:50.360 |
Uh, here's a few here, uh, if you have, um, evals running both offline and online, you create this feedback loop or this flywheel effect that I think Manu mentioned in the, in the keynote, in the keynote, excuse me. 00:05:03.040 |
Um, and this flywheel effect allows us to, you know, uh, cut dev time, allows us to, uh, enhance the quality of this application that we're putting out in production. 00:05:11.920 |
If you're able to connect the things that are happening in, in real life with your users in production and being able to, uh, filter those logs down, add those spans to data sets, inform what you're doing in an offline way creates, uh, that, that flywheel effect that becomes really powerful from a development perspective. 00:05:28.920 |
Uh, again, a little bit more on the, on the customer side, here's a few of our customers and some of the outcomes that they've, that they've seen using brain trust, whether it's, uh, moving a little bit faster, pushing more AI features into production or just increasing the quality of, of the applications that they have. 00:05:47.800 |
So let's, let's, let's start talking a little bit about the, the core concepts of, of, of brain trust. 00:05:57.320 |
Obviously we're here to talk about evals, uh, the, the, the things that I think, um, you see like these arrow, these arrows going, uh, one way and the other. 00:06:06.760 |
This again is that, that flywheel effect that, that I described earlier. 00:06:10.880 |
Um, there's the, the prompt engineering aspect of this, uh, in brain trust, think of, uh, this playground that we have as an IDE for LLM outputs. 00:06:20.080 |
Uh, the playgrounds allow for that rapid prototyping as we make those changes, as we change the underlying model, what is the impact to that, to that, uh, particular, uh, task or that application. 00:06:30.520 |
Then those evals allow us to understand the improvement or the regression of those changes and then the observability aspect, right? 00:06:37.560 |
This is the, the logs that we're generating in production, the, the ability to, uh, uh, uh, have a human review those logs in a really easy, uh, intuitive interface and then have user feedback from actual users be logged into the application as well. 00:06:53.560 |
Yeah, you probably, uh, heard this, uh, several times today throughout the week if you stopped by the booth. 00:06:59.080 |
But sort of our definition here is that structured test that checks how well your AI system performs, right? 00:07:04.760 |
It helps you measure these things that are important, quality, reliability, uh, correctness. 00:07:09.800 |
So what are the ingredients in an eval, right? 00:07:14.920 |
I've been talking a little bit about task, right? 00:07:16.600 |
This is the thing, uh, the code or the prompt that we want to, uh, evaluate. 00:07:20.760 |
The, the really cool thing about brain trust is that this can be as simple as a, as a sim, excuse me, a single prompt. 00:07:27.000 |
Or it could be this, this full sort of agentic workflow where we're, uh, calling out the tools. 00:07:31.800 |
There's no sort of limit on to the, uh, the, the complexity that we put into this task. 00:07:36.280 |
The only thing it requires is an input and an output. 00:07:43.960 |
This is, uh, essentially what we're gonna run the task against to understand how well, uh, our application is performing. 00:07:53.000 |
So the score is really the, the logic behind your evals. 00:07:56.840 |
There's, there's a couple different ways to think about this. 00:08:01.320 |
So, uh, you give it the, the output and some criteria and it is able to assess, uh, like say, I want, uh, based on this output, is this excellent? 00:08:12.920 |
And then those outputs then correspond to, you know, 0.5 or 1. 00:08:19.400 |
These are maybe a little bit more heuristic or binary, but, uh, we can use both of these to really aid in the development of that, of that eval and ensuring that we're building a really good application. 00:08:29.560 |
Uh, I think I just sort of mentioned this here as well, but like the, the two mental models here of, of evals, there's, there's offline and online. 00:08:41.720 |
This is us actually doing that, that iteration. 00:08:44.680 |
It's, uh, identifying and resolving issues before deployment. 00:08:48.600 |
Uh, this is where we're defining those tasks. 00:08:50.840 |
It's where we're defining those scores, uh, online evals. 00:08:54.120 |
This is that real time tracing of, of, of the application in production. 00:08:58.200 |
It's logging the model inputs and the outputs, uh, the intermediate steps, the tool calls, everything that's happening. 00:09:04.280 |
It allows us to diagnose performance and reliability issues, latency, uh, based on how you instrument your, your application with brain trust. 00:09:12.760 |
We can pull back lots of different metrics related to cost and tokens and duration and all of these things help inform, uh, how we build this application. 00:09:22.360 |
Um, I'll, I'll jump into a little bit more of like how we can instrument our app, uh, 00:09:29.160 |
Uh, we're going to first, I think, talk a little bit more of the offline before doing that. 00:09:35.000 |
Uh, maybe just like level set on how to improve. 00:09:39.400 |
I think one of the, the, the, the things that I've seen, uh, in the last few days here, the conversations that I've had, it's like almost, how do I get started? 00:09:46.840 |
Or what do I do if X or, you know, those types of questions. 00:09:50.760 |
The other thing I heard from Anker, uh, very early on is that like, just get started, create that baseline that you can then iterate and build from. 00:09:58.920 |
Uh, I think a lot of people get caught in like creating this, this golden data set of test cases, uh, that they, they can then like iterate from. 00:10:06.840 |
Um, start, you, you don't necessarily have to do that, start and build that, uh, that baseline, establish that foundation that you can then improve upon. 00:10:15.000 |
But this is a really good sort of matrix of like, uh, if I have good output, but a low score, what do I do? 00:10:21.240 |
Right. Improve your evals. If I have bad output and a high score, improve your, your evals or your scoring, but really good kind of like high level, uh, understanding of, of where to start to target your, your efforts when you are building these apps and you're in, you're creating evals. 00:10:36.600 |
So let's jump into the actual components that I just, that I just talked about. 00:10:42.600 |
So within the brain trust platform, we have a task, uh, again, this could be a prompt. 00:10:46.920 |
This could be like this full agentic workflow, uh, very basic, you, you see that, that gif running. 00:10:53.320 |
You specify an underlying model that you wanted to use. 00:11:01.240 |
It has access to mustache, mustache templating. 00:11:03.960 |
So you can pass in variables like user questions or, um, you know, the input from the user, its chat history or metadata, right? 00:11:11.480 |
So when we actually go and want to parse through these logs, the metadata becomes actually beneficial. 00:11:16.440 |
Uh, enabling us to do that in a really easy way. 00:11:19.880 |
Um, going forward, uh, maybe we have a multi-turn sort of chat type scenario where we want to add 00:11:26.520 |
uh, additional messages for the system and the assistant and the user, uh, and our tool calls as well. 00:11:31.960 |
Uh, the, uh, the, the platform allows for that just via this, uh, plus this messages, uh, button. 00:11:37.720 |
And then you're able to add those different messages to the prompt. 00:11:45.160 |
So oftentimes the, the, the prompt will, will need access to, to something, right? 00:11:53.080 |
We can now use those tools as part of that prompt, right? 00:11:57.000 |
And so when you use sort of encode in that prompt, like make sure you use X tool, uh, this prompt has access 00:12:04.840 |
Uh, the last one, this is actually a feature that is in beta right now. 00:12:11.320 |
Uh, it's actually creating more of that agentic type workflow within your, uh, within the brain 00:12:17.640 |
So it's, it's a way right now, at least to chain together prompts where the, the output of one 00:12:24.760 |
But if you think back maybe to, to this slide, if I have sort of, um, this prompt that has access 00:12:29.880 |
to tools, you, you create a pretty powerful system here where you're able to go from, uh, 00:12:34.920 |
maybe that first step that has X access to a certain tool and we get some output from that. 00:12:40.040 |
We can then go to that next step that has access to maybe some other tools, right? 00:12:43.960 |
That's sort of like maybe multi-agent type of workflow we can create with the underlying tools 00:12:49.000 |
The second thing that I talked about is data sets, right? 00:12:54.840 |
These are our test cases that we want to give to the, the task to run. 00:13:01.160 |
We can get the output and then going down a little bit further, actually score that. 00:13:04.920 |
But this is, um, obviously really important when we're, when we're running our evals. 00:13:08.760 |
And then when we are, uh, trying to pull from production, right? 00:13:12.440 |
The, the actual logs that are happening, we can add those, those, uh, spans, those traces 00:13:21.000 |
But, uh, if you look there at the bottom, the only thing that's required is the, the input. 00:13:25.000 |
Uh, you also have the ability to add, uh, expected. 00:13:27.720 |
So what is the expected output for that input, right? 00:13:31.080 |
You can sort of like, uh, create some sort of score that looks at the output with the expected. 00:13:35.960 |
There is a, a score called Levenstein that allows you to, you know, measure the, the, the difference 00:13:41.320 |
So you can do some different things based on what you provide to that data set. 00:13:46.440 |
Again, being able to filter down different things, pulling the data set maybe into your own code base. 00:13:50.840 |
And I want to filter by again, X, Y, or Z via the metadata. 00:13:55.240 |
Uh, I mentioned this a little bit ago, but just start small and iterate, right? 00:14:01.320 |
You don't have to create this golden data set to, to get started here. 00:14:04.520 |
Um, just, just start and then, uh, then continue to iterate and build from that baseline. 00:14:11.080 |
Uh, the human review portion also becomes really powerful. 00:14:13.640 |
Again, when we have stuff, uh, being logged within production, having humans actually go 00:14:18.280 |
through those logs and, you know, there's lots of different ways to filter it down to the things 00:14:24.040 |
And then we can now decide to add those things to certain data sets that then inform, uh, 00:14:30.680 |
Uh, the last thing, the last ingredient here that we need for our, excuse me, uh, for our evals are our scores. 00:14:43.000 |
Again, this is like more of those binary type conditions, but you can actually, uh, code TypeScript or Python. 00:14:49.240 |
You can do that within the UI, as you see over there on the bottom left, or you can within your own code base, 00:14:54.200 |
create that score and then push it into brain trust. 00:14:58.440 |
Other users who are maybe aren't in the code base can use that score as well. 00:15:01.880 |
The other score that we have access to is called LLM as a judge. 00:15:06.280 |
So this allows us to use an LLM to sort of judge the output. 00:15:10.040 |
We can give it the, the set of criteria that indicates what a good or a fair or a bad score, 00:15:15.880 |
or whatever it is, you, you get to decide, uh, what that looks like. 00:15:19.160 |
So you give it that criteria and it says, if it's good, uh, I want to do a one. 00:15:26.040 |
But this starts to create the scores, uh, that we can use in that offline, in that online sense. 00:15:31.560 |
The other thing to call out here is that we have, um, internally built a package called auto evals. 00:15:37.720 |
So this is something that you can now pull into your project. 00:15:40.120 |
These are out of the box scores that are both LLM as a judge, as well as code base. 00:15:44.840 |
And so it just allows you to get started very, very quickly. 00:15:47.800 |
Another thing I heard Ankur mention is, um, maybe starting with Levenstein, 00:15:52.840 |
maybe not the best score in a lot of cases, but again, it establishes a baseline. 00:15:57.080 |
Very, very little development work for, for, uh, our users, 00:16:00.600 |
but it creates that thing that we can then build from. 00:16:03.080 |
And now you have, uh, a direction, a direction to go in to go build maybe that more custom score. 00:16:08.440 |
Some of the things that we've heard from our customers, 00:16:14.120 |
some tips that, um, that, you know, important to think about. 00:16:17.080 |
A lot of our customers are using higher quality models for scoring. 00:16:21.560 |
Uh, even if the prompt uses a cheaper model, just, just makes a lot of sense. 00:16:25.320 |
Like while we're running that application to use the cheaper model, 00:16:28.040 |
but use the, the more expensive one to actually go out and score it. 00:16:31.720 |
Um, also break your scoring into, uh, very focused areas. 00:16:36.680 |
So, uh, the example that I'll show is a, an application that, 00:16:40.680 |
that generates a change log from a series of commits. 00:16:43.560 |
So I could create a score that says assess my accuracy, 00:16:48.600 |
Or I could create three different scores that assess accuracy and then formatting and then correctness. 00:16:54.120 |
So have your scores be very targeted to the thing, uh, that they're supposed to be doing. 00:16:58.600 |
Uh, test your score pump, excuse me, prompts in the playground before use. 00:17:03.720 |
And then avoid, avoid overloading the score or prompt with context. 00:17:07.240 |
Uh, focus it on the relevant input and the output. 00:17:09.240 |
Uh, couple things here, here's where, uh, over on the left we have our playgrounds. 00:17:18.120 |
This is where we do that, that sort of like rapid iteration where we can pull in those prompts. 00:17:22.120 |
We can pull in those agents, add our data sets and add our scores. 00:17:26.200 |
And we can click run and it'll go out and sort of churn through that data set that we've defined. 00:17:30.440 |
And it will give you a sense for how well your task is performing against the data set with the scores that we define. 00:17:36.280 |
But this is the place where, uh, developers, uh, PMs. 00:17:39.640 |
We even have a healthcare company that has doctors coming into the platform and interacting with the 00:17:44.600 |
playground and even doing human review as well. 00:17:46.840 |
Uh, depends a little bit obviously on, on the organization. 00:17:52.360 |
This is our sort of like snapshot in time of those evals. 00:17:55.960 |
So imagine now like as we are doing this development and we're trying to understand, 00:17:59.720 |
uh, like the last, you know, the last month or so, are, are we getting better, right? 00:18:04.120 |
The, the, the changes that we are making, the model changes, whatever it is, 00:18:07.720 |
are we improving our, our application and the experiments is a really great way to, 00:18:13.560 |
Uh, really important maybe to call out as well. 00:18:17.480 |
The evals can happen from, uh, the application, right? 00:18:20.600 |
The, the brain trust platform as well as via the SDK. 00:18:26.680 |
Maybe just really quick because, uh, nobody likes looking at, at slides all the time. 00:18:31.640 |
Uh, maybe if you haven't seen brain trust yet, this is maybe a good, a good quick demo. 00:18:37.560 |
Uh, so again, like the, the, uh, the idea here is I have this, this application. 00:18:46.280 |
It, uh, grabs the most recent commits and then creates a change log from there. 00:18:52.360 |
And then once this completes, you can even provide some user feedback. 00:18:55.560 |
But this is the thing that we want to, uh, evaluate. 00:18:58.440 |
So what I can do, uh, I'll go into my, my playground, right? 00:19:03.080 |
This is the place where I can start to, uh, run those, those, uh, those experiments. 00:19:07.320 |
Or I can start to iterate on that prompt that I have. 00:19:09.800 |
From my project, I've actually loaded in two different, uh, two different prompts. 00:19:15.000 |
So maybe I'll, before going into the playground, I've actually created these two prompts within 00:19:19.960 |
my code base and I pushed them into the brain trust platform. 00:19:22.680 |
I've also created a dataset in that code base and I've also created some scores, right? 00:19:28.200 |
These are the ingredients that we need to run our evals. 00:19:31.080 |
So now when we have, we have those, we have those different components. 00:19:38.120 |
So I'm going to actually create a net new playground and I will load in one of these prompts. 00:19:48.440 |
Uh, my first prompt has a model associated with it. 00:19:51.720 |
What becomes really cool here is the ability to, to like, to iterate on the underlying model, 00:19:56.600 |
I think a lot of us are, we have access to a lot of underlying providers and we want 00:20:00.760 |
to be able to understand if I change this or if I, you know, a provider adds a new model, 00:20:08.200 |
So I can duplicate this prompt and maybe change this to GPT 4.1. 00:20:14.760 |
Before I run it, I have to add all of my components. 00:20:21.320 |
So I'll add my, my dataset and then I can add my different scores that I've, uh, 00:20:27.720 |
And so I can click run and then now we'll understand here, what is the effect of changing the model, 00:20:35.160 |
uh, for this particular task with the scores that I've configured against this dataset. 00:20:39.080 |
So this will churn through all of these in, in parallel and we'll start to get some results back. 00:20:43.320 |
Uh, lots of different ways to actually start to look at this data. 00:20:46.760 |
I always like coming over to the summary layout because I can understand like, um, 00:20:50.680 |
um, you can see over here, this is my base task right here and then my comparison task. 00:20:54.840 |
So I can understand, uh, looks like, you know, on average, uh, that the base is performing a little 00:21:00.680 |
bit better than my, uh, comparison task on my completeness score. 00:21:04.280 |
Uh, it's varying a little bit worse on my accuracy score. 00:21:07.160 |
Uh, both of them are, you know, zero percent on my formatting. 00:21:10.760 |
So probably have some work to do there, but you can start to see how you can use this type of interface 00:21:18.280 |
Now, uh, the other thing that, um, maybe shouldn't do, but, uh, I can't resist because we just released 00:21:27.080 |
So imagine you are, uh, you know, you're a user within brain trust and before this, 00:21:32.440 |
you would sort of manually iterate here and creating net new prompts, uh, making modifications, 00:21:38.600 |
What if you could now utilize AI to go and do that for you? 00:21:42.200 |
So any sort of like cursor like, uh, interface, we can ask it to optimize a prompt. 00:21:48.280 |
And I think the really unique thing here is it has access to those, uh, those evaluation results. 00:21:53.640 |
And so when it goes to go, uh, change that prompt, it understands that it changes the prompt. 00:22:00.120 |
It understands if it got better relative to the scores that we defined. 00:22:03.240 |
So you can see if it's going to go through here, it'll fetch some eval results. 00:22:07.080 |
Uh, you'll probably see a diff here very, very soon. 00:22:10.280 |
If we don't, I won't, I won't hang out here too long, but I do want to highlight one of the things 00:22:14.040 |
that we are releasing that really enables our users to iterate in a really, uh, really fast way. 00:22:22.120 |
We can click accept and then it'll actually go out and run that eval again or it would. 00:22:27.720 |
Uh, I think I have an issue with my anthropic API keys, but the idea here again is like we can 00:22:33.480 |
create that very rapid, uh, iterative feedback loop here within the playground. 00:22:37.640 |
The other thing here is, uh, we can run these as experiments. 00:22:41.800 |
So this is, um, this is where we can start to create those snapshots in time of that eval. 00:22:48.280 |
And again, see as I make these changes to that, that application, how is it sort of performed 00:22:53.880 |
I want to make sure I don't, I don't want to go down. 00:22:56.040 |
I don't want to like, uh, decrease the performance of my scores relative to obviously the last time 00:23:00.520 |
it ran, but, uh, looking out over the last month, six months, whatever it is that we're tracking. 00:23:09.880 |
So that was very, very, uh, brief sort of intro to like evals via the UI, right? 00:23:19.320 |
Again, like just to summarize, we need a task, we need a dataset and we need at least one score. 00:23:24.680 |
We can pull those into the playground and now we can start to iterate. 00:23:27.880 |
We can save these via experiments. Uh, and now we can, we have a way in which we can understand 00:23:33.320 |
how well this application is performing, right? This is no longer like qualitative, right? This isn't 00:23:38.280 |
like, Hey, I think this got better. That output looks better. There is actual rigor behind this now. 00:23:43.320 |
Um, customers oftentimes ask though, like, I don't really want to use, or I'm not going to use the, 00:23:49.320 |
the platform as much. I'd rather use this from my, my code base. Is that possible? 00:23:53.320 |
Uh, and it is. Uh, so, uh, we have, uh, Python SDK. We have a TypeScript SDK. There, there's some 00:23:59.720 |
other ones as well, Go, Java, Kotlin. Uh, for the most part, most of our users are using, uh, Python 00:24:04.680 |
or TypeScript. Here's a, just a couple examples of what this might look like from an SDK perspective. 00:24:10.280 |
Um, actually, if you all aren't, uh, opposed to looking at, uh, some code, here's just a, a really 00:24:17.640 |
basic example of, uh, defining a prompt within my, my code base and then pushing it into 00:24:23.080 |
brain trust. So just leveraging that, that Python SDK, uh, another example, I should come over here. 00:24:29.080 |
Creating a score. So you give it sort of like the, the things that it's looking for, 00:24:35.720 |
but now I've sort of defined this score within my code base. Uh, it's version controlled. Uh, 00:24:40.360 |
also the, the prompts that you create within the UI are version controlled as well. Um, 00:24:44.280 |
but this is just another way to start to interact with, with brain trust. So again, scores, we could 00:24:49.160 |
do data sets, uh, and then you can even do, uh, prompts up here as well, I believe. So here's my 00:24:53.640 |
eval data set. Here's my, uh, change log to prompt. Again, being able to start from the code base and 00:24:59.080 |
actually push them into the platform is possible. Just depends on the organization where they want to 00:25:03.640 |
start. The, the other thing here is like, this is more on the, like the, the components of the eval side. 00:25:09.480 |
So that's that top portion. Define those assets in code, run that brain trust push, and now you have 00:25:14.440 |
access to that in that brain trust library. The other one is actually like defining the evals 00:25:18.760 |
and code, right? So what that looks like is, is slightly different. Um, come over here. So we have 00:25:24.680 |
our eval, but this is just a class that's coming from our brain trust SDK. Again, it's looking for the 00:25:29.720 |
exact same things that I just described, right? A data set that we can use from, uh, brain trust itself. 00:25:34.680 |
Uh, the task that we want to invoke and then the scores. So again, defining this here within, uh, 00:25:41.160 |
within your code base, certainly possible. And then I can run, you know, a command 00:25:45.000 |
that actually runs that eval within brain trust. So from here, go into brain trust, see the eval running. 00:25:54.440 |
This is now experiment that I can view over time. So again, like you saw two different types of workflows 00:26:00.040 |
here, uh, again, catering to maybe two different personas or again, the way in which organizations 00:26:05.080 |
want to work, it's up to them. Brain trust is very flexible and how we allow our users to, uh, 00:26:12.200 |
Uh, probably jumped ahead a little bit, but this is sort of a recap of what I just showed you. Again, 00:26:28.040 |
from your code, you, you create your prompts, your scores, your data sets, you can push them in there. 00:26:32.520 |
Uh, maybe it's just importantly to important to highlight here of like why you would do this. 00:26:37.720 |
Uh, you want to source control your, your prompts. Uh, the big one here to call out is the online 00:26:42.040 |
scoring. Uh, I have a section in a little bit diving a little bit deeper into that. Uh, but if you want 00:26:47.720 |
to use those scores that we define in the dataset, we should push them into brain trust so that we can 00:26:52.280 |
create online scores. We can understand how our application is performing in production relative 00:26:56.840 |
to those scores that we want or that we're using within our offline evals. 00:27:00.200 |
What I just showed you, maybe another, uh, another variation of that, that eval within our code. Again, 00:27:06.920 |
defining that dataset, defining that task, defining those scores becomes very, very easy to now connect 00:27:12.360 |
these two things. It's just again, up to you to decide where you are, where you want to do this. 00:27:16.840 |
The other thing to call out here is that this can be run via CI/CD. We do have some customers that, 00:27:21.960 |
that want to run their evals as part of the CI process. So understanding in a more automated way, 00:27:27.320 |
right? The, the score for A, B and C, whatever they've configured. Has it gotten better? Has it 00:27:31.720 |
gotten worse? This becomes maybe a check as part of CI. There is, uh, if you look within our documentation, 00:27:36.680 |
there's a, a GitHub action example that shows you how you could set this up. 00:27:48.520 |
Moving to production entails setting up logging, right? It entails instrumenting our application 00:27:53.880 |
with, uh, with brain trust, um, code, uh, being able to like say, I want to wrap this, this, uh, 00:28:01.240 |
LLM client. I want to wrap this particular function when it goes to call that tool becomes very, very easy 00:28:06.760 |
to do that. But so, so why should you do it? Um, I think I've probably said it, said it numerous times 00:28:11.640 |
here, but we want to measure quality on live traffic, right? We actually want to understand how well our 00:28:16.680 |
application is performing with those scores really great to use during offline evals becomes, uh, 00:28:22.600 |
our aid in ensuring that we, that we build really good applications that we're not creating regressions, 00:28:27.480 |
but also really important to monitor that live traffic. The other really important thing to call 00:28:32.440 |
out, I think is that, that flywheel effect that it creates. So we have these data sets that we use to 00:28:38.520 |
inform our offline evals. It's very, very easy now to take the, the logs that are generated within 00:28:44.040 |
production and add those back to data sets. This also, um, speaks to some of that human 00:28:49.960 |
review component where we want to now bring those humans in. They can start to review some of the 00:28:55.080 |
logs that are relevant. Like maybe there's user feedback equals zero. Maybe there's a comment or 00:28:59.720 |
whatever it is, but like they can filter down to those particular things. And as they find really 00:29:04.280 |
interesting, maybe test cases, it's very, very easy to add those, uh, back to the data set that we use in 00:29:09.800 |
our offline evals. So I think the, the feedback loop or the flywheel effect that that's, that this creates 00:29:15.160 |
is one of the really fundamental value props of, of the platform. So how do we do this? Uh, there, 00:29:21.640 |
there's a couple different ways, right? We're first going to initialize a logger. This is just going to 00:29:25.160 |
authenticate us into brain trust and point us to a project. Uh, you may have seen when I opened up the 00:29:30.600 |
platform, I had numerous projects inside of there. You can almost think of a project as a, as a container for 00:29:36.440 |
that feature, right? So you probably have multiple AI features that you're building. I want to have 00:29:41.480 |
the container for feature A for those prompts, those scores, those data sets. You could certainly utilize 00:29:46.920 |
those things across projects, but it becomes a really good sort of way to, uh, containerize the things that 00:29:51.640 |
are important for that feature. Then you can start really basic, right? You can wrap a, uh, an LLM client. 00:29:57.720 |
Uh, so when you saw those, some of those metrics with like tokens and LLM duration and costs, uh, just very 00:30:05.640 |
basically within the, the, uh, the script or excuse me, the, the code here, I just wrapped that open AI 00:30:10.680 |
client. And now I'm just sort of ingesting all of that, those metrics into my logs. That's the easiest 00:30:16.520 |
way to get started. You obviously probably want to do a little bit more again. Maybe you want to, uh, 00:30:21.960 |
understand when, you know, that LLM invokes a tool. So I want to trace, uh, I can add a trace decorator 00:30:28.120 |
on top of a function. Uh, I can even use some of the, like the, the brain trust low level, like span elements 00:30:35.080 |
to create custom logs. And I want to customize the input and I want to customize the output and 00:30:39.240 |
the metadata that we, that we log to that span. So again, you can, you can start very basic with 00:30:44.520 |
wrapping a client and then go down to like the individual span itself, specifying that input and 00:30:49.480 |
This leads us to online scoring, right? This is, I talked a little bit about this, right? This is where 00:30:58.600 |
like when our logs are coming in, we can actually configure within the platform, those scores that we want to run, 00:31:04.280 |
and we can specify sort of a sampling rate. So we don't necessarily run that score across 00:31:08.520 |
every single log that comes in. Maybe it's 10%, 20%, so on. Um, but it, but it creates that really 00:31:14.600 |
tight feedback loop that, that I've been talking about. Uh, also maybe just important to mention 00:31:20.440 |
the early regression alerts. So we can create automations within the brain trust platform. 00:31:24.840 |
If my score drops below a certain threshold, let's create an alert, uh, with it, with our automation feature. 00:31:33.560 |
This is just, uh, and I can maybe walk through what this looks like instead of showing you here. 00:31:37.880 |
Uh, the custom views. This is where like, there's a lot of really rich information within these logs, 00:31:45.560 |
and it becomes really important, I think, again, for the human review component to like filter these 00:31:50.280 |
down to the things that they care about, or the things that anybody cares about. So we can create 00:31:54.360 |
custom views within brain trust with the appropriate filters. Uh, and then it's very easy for that human to 00:31:59.880 |
go into what we call human review mode within brain trust and sort of parse through those logs. Uh, 00:32:05.080 |
the ones that are the, the ones that are going to be most meaningful to them. 00:32:08.200 |
Let me, uh, let me connect some of those, those dots there. So 00:32:16.360 |
again, showing you some code, maybe good, maybe bad, but, um, I'm guessing there's some technical 00:32:23.080 |
people in the room that don't mind here. So if I look for, um, the, uh, you may have seen in one of 00:32:34.280 |
those slides, there is a, you know, the Vercel AI SDK. I want to wrap this AI SDK model. Uh, again, 00:32:40.920 |
this allows us to just create all of those metrics within brain trust with just zero lift from us as a 00:32:46.120 |
developer. This becomes really easy to do. Uh, you can also see where I have specified that span 00:32:51.800 |
itself, right? I actually want to define the inputs and the outputs of that. The reason you would do 00:32:57.640 |
that is because you have a specific dataset with a structure that you want to ensure maps to that. 00:33:03.000 |
So like when you are within those logs, parsing through them, it becomes really easy to add those 00:33:07.400 |
spans back to that dataset. So ensuring that that data structure is sort of consistent across offline and 00:33:13.080 |
online becomes really important again to create that feedback loop. So this is, you know, very high 00:33:18.040 |
level of like how we can start to create those spans. Um, now that we do, right, we can now go in, uh, 00:33:24.120 |
the platform and start to configure our online scoring. So this is here just within this configuration pane. 00:33:29.720 |
I can click online scoring. I'll just delete, I'll create a new rule. Uh, so my new rule and here's 00:33:39.960 |
where we can add different scores, right? Obviously I have a few here that I've been using for the 00:33:44.760 |
offline evals. I don't necessarily need to select all of them, but I certainly can. 00:33:49.000 |
And then I want to, uh, apply a sampling rate. So I want to actually give you an example of what this 00:33:55.720 |
looks like. So I'm going to do a hundred percent. The other thing to call out here is that you can apply 00:33:59.880 |
these to the individual spans themselves and not the entire route span. So where this becomes beneficial is 00:34:05.560 |
like when you are invoking maybe tool calls, you're invoking like a, a rag workflow and you actually 00:34:12.040 |
want to create a score on whether or not the thing that it gave back, uh, is actually relevant to the 00:34:17.880 |
user's query. So we can actually create a score specifically for that and highlight what that span 00:34:22.840 |
is here. So again, very, very flexible in how you apply these scores to the things that are happening 00:34:28.520 |
online. Now, when I come back here to the application and we'll just run this again, 00:34:37.080 |
uh, creating that change log, you'll now start to see here within, uh, the logs, 00:34:43.640 |
this will start to show up and then you'll start to see these scores be generated. 00:34:49.320 |
Right. Again, this is where like you can now start to understand over time in production, 00:34:53.800 |
how are these things doing? How are they faring? Where can we get better? Uh, again, 00:34:57.800 |
now we can connect again, like the things that are happening in our offline evals 00:35:01.400 |
with the things that are happening with online. The other thing to call out here is the, uh, 00:35:06.360 |
the feedback mechanism, right? Uh, we certainly have, uh, the ability to do like human review, 00:35:11.240 |
but oftentimes you want your users to provide feedback as well. And so this is just a basic example of a 00:35:17.400 |
a thumbs up, thumbs down, and you can even provide a comment here. 00:35:20.440 |
This can now be logged to brain trust. So I should see over here my user feedback. So here's my, 00:35:29.960 |
my comment. And then I have my user feedback score, but now I can also do something like this. So again, 00:35:36.360 |
maybe I want to filter my logs down to where user feedback is zero. So click that button. I'm going to 00:35:43.480 |
change this to zero, right? I don't have any rows yet like that, but now I can save this as a view and 00:35:55.160 |
people who are now using this as human review can filter this down to where user feedback equals zero. 00:36:00.680 |
And we can figure out what's going on, right? What are the things that, that fell down here within 00:36:04.440 |
this application that we need to go fix? The other thing I'll highlight here is our sort of human review 00:36:10.040 |
component. Uh, actually, um, you, you click that, that button or you can hit just R and it opens up 00:36:16.680 |
this, this different pane of your log. So it's a pared down version of what you just saw there. 00:36:22.520 |
It's a little bit easier for a human to go through and actually look at that, that, that input and that 00:36:27.080 |
output. But you as a, as a user of brain trust can configure the human review scores that you would 00:36:32.600 |
like to, uh, to, to use. So I have this add something here. So maybe this is a little bit more free text. 00:36:39.240 |
I have a better score. Again, these are the things that, that you can add to your platform that map to the, 00:36:45.000 |
the, the, the, the review, excuse me, the scores that you want your, your humans to, to add to those logs. 00:36:55.480 |
Um, just really quick, I'll, I'll highlight some of these things here. Um, this is what it starts 00:37:01.320 |
to look like when you instrument your, uh, your application with those different, um, wrappers or 00:37:06.440 |
those different trace functions. Um, I'm able to understand at a very granular level, excuse me, 00:37:11.400 |
granular level, the things that it's doing, right? So I essentially have these tool calls where it's 00:37:15.640 |
going out and it's grabbing the commits from GitHub. It's understanding what the latest release is, 00:37:20.040 |
and it's fetching the commits from that latest release. And now I can generate that change log. 00:37:24.840 |
But again, the, the really unique thing here, and maybe a different example of this 00:37:28.680 |
is I can start to score those individual things that are happening. So this is a different, uh, 00:37:35.320 |
application with, uh, with this example. So if I open this up, I have these, uh, this conversational 00:37:44.840 |
analytics application. So a user can ask a question and can return back some data. But this application 00:37:49.800 |
goes through these various steps. Like the first step is to rephrase the question that the user asked. 00:37:54.440 |
So imagine like there's this chat history that we can load in as input and the LLM needs to rephrase 00:38:00.360 |
that user question. If the LLM does a really bad job of rephrasing this question, everything as a result 00:38:05.880 |
of this will fall down. Probably not going to get a right, a right answer. So what I can do is create a 00:38:10.360 |
score specifically for that span to understand how well the LLM did in rephrasing that question. You can also 00:38:17.800 |
understand the intent that I was able to derive or the LLM was able to derive from that question. 00:38:22.440 |
Is that right? But you start to think of like these, these more complex type of applications that you 00:38:27.880 |
build. You need to be able to understand the individual steps that are happening. And Braintrust 00:38:32.040 |
allows for that very, very easily via these scores and then being able to apply them not only again while 00:38:36.760 |
you're in offline eval kind of mode, but also online, right? We want to understand these logs and be able 00:38:42.280 |
to apply these scores at the individual span level. This becomes pretty powerful as well. 00:38:51.400 |
I think I actually stole from my next section, my human in the loop. Kind of walked through this a little bit. 00:38:57.880 |
Maybe just another call out. If you happen to be at one of our workshops on Tuesday, Sarah from Notion, 00:39:05.480 |
who's a Braintrust customer, talked a little bit about how they think about human in the loop. 00:39:10.200 |
I think it's important to consider like the size of her organization and what they're doing. 00:39:15.640 |
She mentioned that like she has a special type of role that they use for human in the loop type of 00:39:23.080 |
interaction, right? There is, it's almost like a product manager mixed with an LLM specialist. 00:39:28.680 |
They're the people that are going through and doing those human reviews. Smaller organizations, 00:39:32.280 |
she made, she made a comment that was, it actually makes a lot of sense for the engineers, 00:39:36.760 |
some engineers to actually go through and do this as well. It becomes really powerful to pair like the 00:39:42.360 |
automation with a human component of this. Like this is not going to go away. I think it adds value to the 00:39:48.600 |
Again, I think I just stole from myself like why, why this matters, right? This is really critical for 00:39:55.960 |
the quality and the reliability of your application. It provides that ground truth for what you're doing. 00:40:02.600 |
Two types of human in the loop interactions here. I walked you through that human review. 00:40:09.720 |
Give me one second and I'll call you. Yeah. The two, excuse me, the two types, the human review, 00:40:16.840 |
being able to like create that interface within Braintrust that allows that user to kind of parse 00:40:21.720 |
through the logs in a really easy manner, as well as configuring scores that allow them to add the 00:40:26.760 |
relevant scores to that particular log. And then the user feedback. This is actually coming from our 00:40:31.400 |
users in the application. Again, being able to create sort of views on top of that feedback that then 00:40:37.000 |
power maybe the human review and then creates that flywheel effect that we want. That's all I have 00:40:44.200 |
today. Appreciate you all coming out here and listening to me. But yeah, if you have a question. Thanks. 00:40:53.880 |
Yeah. The question is around like, how are we using human review and like some of the logs and informing the, 00:40:56.920 |
the, the offline eval portion of this largely that. Cool. 00:41:02.600 |
Yeah. The question is around like, how are we using a human review and like some of the logs and informing the, 00:41:09.400 |
the offline eval portion of this largely that goal. 00:41:14.840 |
Uh, yeah, one thing I, maybe I didn't highlight here is so maybe back within Braintrust. I'm going to go back to my initial project. 00:41:26.520 |
So, imagine now like we have, we have all of these logs. We filtered it down to a particular, oh. 00:41:55.000 |
Awesome. Thank you. Um, yeah. So imagine like we, we have this, uh, this, this process now, 00:42:05.400 |
right, where we're, we're doing that human review. We filtered it down to the records that are meaningful 00:42:09.640 |
for whatever reason. It becomes really easy again to connect what's happening within production. So I, 00:42:14.760 |
maybe I select all of these rows or I select individual rows, but I can add these back to the data 00:42:19.800 |
set that we're using within those offline evals. I, I think I've said this like a hundred 00:42:24.280 |
times over this conference, this flywheel effect. This is like, I think what's missing oftentimes 00:42:29.880 |
when we're building these, these AI applications and what Braintrust allows for really seamlessly. 00:42:35.240 |
Uh, I have two questions. Uh, the first one is about production. Uh, 00:42:41.480 |
is it possible to have multiple models in production and compare how they behave? 00:42:46.360 |
Yeah. I, I don't see why not. Like my, my, my guess is in the underlying application, you're swapping 00:42:51.960 |
Like, uh, having like a B test, you know, I can have like a two or three or four and 00:42:58.920 |
Absolutely. Yeah. Um, let's see if I have an example here. 00:43:02.440 |
You're able to, to group some of these scores. Uh, maybe this is sort of an example of, of what you're 00:43:08.840 |
talking about. So like maybe within production, we have different models running. Uh, this, this sort of view here allows us to 00:43:15.000 |
understand like the, the models that we're using under the hood. Uh, and this is just, you know, 00:43:20.680 |
you could do this within production as well and sort of do that A/B testing. 00:43:24.040 |
Cool. Uh, my, my second question is about humans in the loop, right? Um, let's suppose that I have 00:43:31.960 |
multiple humans and, um, they behave, uh, slightly different as his quarter, his quarters. Do you have 00:43:41.000 |
anything or what, what is the vision to do with that? Like, is there a way that I can 00:43:46.680 |
actually compare how they're scoring or something like that or not really? 00:43:49.960 |
So different users can maybe have different sort of criteria for scoring. Maybe the first thing I 00:43:55.960 |
would say to that is like, there, there should be like maybe a rubric for your users who are 00:43:59.800 |
interacting with human review. So you're not creating that. Uh, you certainly have the ability to see, 00:44:05.480 |
like who is scoring different things within the platform. Um, I'm not sure if you're able to 00:44:10.280 |
pull that as like a data set to like assess the differences there, but maybe like before it gets 00:44:15.240 |
to that place, like have a rubric, have a guideline of, of what scoring looks like for your humans. 00:44:23.560 |
Hi. Um, so the scores I'm used to working with for like LLM as a judge are like, they're relativistic, 00:44:34.520 |
right? So they can't tell you is the answer relevancy good or bad for a single run, but it can tell you 00:44:42.360 |
how it compares to previous iteration of like the same test set, for example. Um, do you guys use LLM 00:44:52.120 |
as a judge scores for online or is, and like, how are, are they relativistic like that? Or do you have 00:45:03.320 |
some way to be like, this is a good answer, you know, in and of itself for this sample or because it's all, 00:45:10.280 |
you have new data coming in, right? Yeah, I think a lot of our customers who are thinking about this 00:45:14.520 |
are, are like almost doing evals on their evals, like trying to understand, did the LLM as a judge 00:45:20.440 |
actually do a good job there? So like when that actually runs, there's a rationale behind it. 00:45:25.320 |
And so you can sort of run an eval of those LLMs as a judge. I think Sarah from Notion in our workshop 00:45:31.800 |
described sort of a process like that within Notion, but I think that's, that's sort of like where I would aim 00:45:41.240 |
Are any of your customers doing evals before they launch? Like, I'm working with a government. 00:45:55.240 |
they don't want to launch until we show some accuracy levels. Yeah. So we're getting our subject 00:46:02.840 |
matter experts to enter in all the questions that they have, right? And they have huge data sets of 00:46:09.720 |
thousands of questions. Believe me, as a government. And then we're using measures like you're talking 00:46:18.200 |
about. Do you have a way to do that? Like, I guess it's, I guess it's the same. Is it? 00:46:26.920 |
So what you're describing is what we call offline evals, right? This is development, right? We can 00:46:34.680 |
actually do this testing before we get into production. This is what I was talking about, 00:46:39.960 |
like establish that baseline, right? Using those scores, using that data set that you've already 00:46:45.480 |
created. But this all happens before we get into production, right? And then you can like, one of the 00:46:50.920 |
things that I heard from somebody earlier is like, one of my challenging things of building this AI 00:46:56.680 |
application is establishing trust or creating that trust in this thing. That's part of what this is, 00:47:04.120 |
right? It's like, it's showing, showing those people the scores of that application. So you start to 00:47:10.200 |
iterate on this thing. Maybe it starts at 20%, then it goes to 30, then at 40, and so on. That to me is 00:47:15.400 |
the thing that you use to create that trust and create that like groundswell to push it into production. 00:47:20.680 |
Okay. Yeah, that is what we're trying to do. But I wondered if I can see the tool does that. Thank 00:47:27.240 |
you. Yeah, of course. Time for one more. Thanks. Quick question. I love the CICD components. We're 00:47:36.200 |
trying to build a lot of -- we're trying to build like MLS as a platform for our team. So we get into 00:47:41.720 |
evals and stuff like that. So how much of the monitoring dashboard you have in BrainTrust can actually be 00:47:48.680 |
like take the data taken out and posted in a unified dashboard somewhere else. 00:47:52.760 |
Yeah. All of this is available via SDK, right? You can pull down experiments. You can pull down 00:47:59.240 |
data sets. So you're able to, you know, pull this down. Like we have -- we have a customer that is 00:48:04.360 |
actually building their own UI on top of like the SDK itself. Like so they built their own sort of like 00:48:09.960 |
components utilizing the SDK and pulling the sort of things that we've logged, the experiments that we have 00:48:16.520 |
in the application into their own UI. So it's certainly possible.