back to indexBuilding Reliable Agents: Agent Evaluations

00:00:12.920 |
Hopefully you guys enjoyed the last hour or so and had a good chance to talk with a lot 00:00:19.000 |
of other attendees or sponsors or speakers as they were leaving. 00:00:23.140 |
I think, again, I think the way Langtrain got started was going to events like these 00:00:28.340 |
I'd highly encourage you to take advantage of that. 00:00:33.500 |
Before this, we heard a bunch of talks about agents and how people were building, and one 00:00:36.580 |
theme that was mentioned a few times was evals. 00:00:40.060 |
And so for the next series of talks, we're going to deep dive into that. 00:00:43.500 |
We've got a few speakers who are going to shed a lot of light on how they are using evals 00:00:51.200 |
But before that, I wanted to take a little bit of time to set the context for why we think 00:00:55.220 |
evals are important and some general things that we are doing in the space. 00:01:03.220 |
So we ran a survey of agent builders about six months ago where we asked them what was the 00:01:07.880 |
biggest blocker for getting more agents into production. 00:01:10.160 |
And the number one thing that they cited by far was quality. 00:01:13.380 |
And so we talked a little bit with Michele about the trade-offs between quality and latency 00:01:19.540 |
And quality is still the top thing, blocking people getting to production. 00:01:22.540 |
And in order to kind of like bridge that gap from prototype to production and increase 00:01:27.540 |
that quality, one of the techniques that we've seen people adopt is eval-driven development. 00:01:34.540 |
So using evals throughout a bunch of different stages of development to measure your app's performance, 00:01:40.200 |
and then make sure that you're constantly climbing that ladder of performance. 00:01:44.540 |
And one of the things that I want to emphasize is that evals is really a continuous journey. 00:01:51.540 |
So there's three different types of evals that we see people running. 00:01:54.540 |
Most people are thinking about evals maybe in one, maybe two of these. 00:01:57.540 |
But we think it's a continuous journey throughout the whole life cycle. 00:02:03.540 |
This is probably what most people think of when they say or hear the word evals. 00:02:09.540 |
So this is before you go into production, you get some data set, you run your app against 00:02:14.540 |
that data set, and then you measure performance using some evaluators to score it. 00:02:17.540 |
And you can track this performance over time, you can compare different models, different prompts, 00:02:21.540 |
things like that, and get a sense for whether the changes you're making are actually increasing 00:02:26.540 |
or decreasing your app's performance on this data set that you've constructed. 00:02:32.540 |
Of course, that data set isn't perfect, and so there's another type of eval called online 00:02:38.540 |
This is when you take your app that's running in production, you take some subset of the data 00:02:44.540 |
that's coming into the app, and you score that. 00:02:46.540 |
And so then you can start to track the performance of your app in real time as it's running on real 00:02:52.540 |
And so this is real queries that users are sending in, so it's not a static kind of like golden 00:02:58.540 |
And so these are the two types of evals that people are most familiar with. 00:03:01.540 |
But we also think there's a third type of eval, which we call in-the-loop evals. 00:03:05.540 |
And so these are evals that occur during the agent while it's running. 00:03:09.540 |
And so Michele talked a little bit about some of what they were doing at Replit with this, 00:03:13.540 |
where they were adding some evals based on trying it out and testing with browser use or running code against it. 00:03:19.540 |
And so these would be examples of running some evals in the loop to correct the agent as it's running. 00:03:25.540 |
And then if it messes up, like in any of these scenarios here, you can feed it back into the agent and have it kind of like self-correct. 00:03:31.540 |
And so you can add this in a bunch of different domains and use it to basically check the agent. 00:03:43.540 |
It actually improves it before it responds and it can block bad responses. 00:03:47.540 |
The big downside is that this takes more time and it costs more money. 00:03:51.540 |
So we see this commonly being used when the tolerance for mistake is really low or when latency is not an issue. 00:03:58.540 |
And as we see more and more long-running agents, I actually think that's a perfect time to start thinking about putting these in-the-loop evals into your agent. 00:04:09.540 |
When we think about evals, there's generally kind of like two parts to evals that we see. 00:04:14.540 |
The data that you run over and then the evaluators that you use. 00:04:19.540 |
And so all of those three components, they had different aspects of these data and of these evaluators. 00:04:26.540 |
So in the offline sense, you know, you've got your data set. 00:04:29.540 |
In the online evals, the data is the production data. 00:04:34.540 |
And in the loop is the production data, but it's happening in the loop. 00:04:36.540 |
And then the evaluators can be different as well. 00:04:38.540 |
So when you have your golden truth data set, you can compare against it. 00:04:44.540 |
Those are called what we call ground truth or reference evals. 00:04:47.540 |
And then reference free evals are when you don't have this ground truth. 00:04:51.540 |
And this is what you do online or in the loop because you don't know what the ground truth is. 00:04:54.540 |
And so basically data and evaluators are two parts of evals no matter what type of eval you're doing. 00:05:02.540 |
And so we want to make it easy for people to build data sets and run things over their data as well as build their own evaluators. 00:05:12.540 |
Because one thing that we've noticed is that all the academic data sets that you might see or get started with, those aren't representative of how users are using your application. 00:05:21.540 |
They're oftentimes not even in the same domain. 00:05:24.540 |
And so when we talk about data and evals, it's really about making it easy for any application developer to build a data set or build evaluators that are specific for their use case. 00:05:41.540 |
Traces are where you run these online evaluators over. 00:05:47.540 |
We track not only the inputs and outputs but all the steps as well. 00:05:50.540 |
And so you can then run evaluators over these traces for the online evals part. 00:05:56.540 |
We've also made it really, really easy to add these to a data set and start building up this ground truth data set for offline evals. 00:06:03.540 |
And so there's a nice little button in Langsmith that you can click, add to data set. 00:06:07.540 |
It will take this kind of like input/output pair. 00:06:09.540 |
You can actually modify the output pair as well and then it adds it to a data set. 00:06:12.540 |
So we've tried to make it really easy for people to build up these data sets in Langsmith powered by the observability. 00:06:21.540 |
And so one of the things that we like to say is that great evals start with great observability. 00:06:26.540 |
And that's how we think about it in Langsmith. 00:06:33.540 |
Now let's talk about evaluators for a little bit. 00:06:36.540 |
So there are a few different types of evaluators that we see people using. 00:06:40.540 |
First is maybe just using code to evaluate things. 00:06:43.540 |
So this would be like exact match or regex or checking if it's valid JSON or things like that. 00:06:52.540 |
But they're oftentimes not as representative of all the things you want to catch, especially 00:06:58.540 |
So one of the things that we see popping up here is using LLM as a judge techniques to use an LLM to score the outputs of your agent or LLM application. 00:07:08.540 |
And so this is useful because they can handle more complex things. 00:07:15.540 |
But generally the idea of using an LLM to judge outputs is pretty promising. 00:07:20.540 |
And the third type that we see is just good old human annotation. 00:07:23.540 |
This can happen kind of like live from the user as they're using the app. 00:07:26.540 |
You can collect thumbs up, thumbs down, send those to Lang Smith and track them there. 00:07:30.540 |
Or you can have a human go in the background and use something like our annotation cues to score these runs. 00:07:38.540 |
So one of the things that we've been building over the past month or so is a set of open source evaluators to try to make it easy to get started with these evaluators. 00:07:49.540 |
And so there are a few common use cases that we think are common and you can use off the shelf. 00:07:53.540 |
These include things like code, rag, extraction, and tool calling. 00:07:57.540 |
So for code, for example, we have some off the shelf utils that will lint Python code or lint TypeScript code. 00:08:03.540 |
And then you can take those errors and feed them back into the LLM. 00:08:10.540 |
But of course, for a lot of use cases, you are going to want to configure evaluators to your specific use case in your specific domain. 00:08:17.540 |
And so we have a few customizable evals also included in open evals. 00:08:21.540 |
One of these are LLM as a judge things, making it really easy to get started with that. 00:08:25.540 |
A little bit more interesting is things around agent trajectories. 00:08:28.540 |
So taking in a trajectory of messages, passing it into one of these evaluators, and customizing it so you can choose what to look for. 00:08:35.540 |
And then one of the things that we're launching today is chat simulations. 00:08:39.540 |
So a lot of applications are chat bots or they have some back and forth component. 00:08:43.540 |
And so sure, you can test a single kind of like interaction, but you often want to see how it performs in a conversational setting. 00:08:49.540 |
And so we're launching some utils to both run and then score those evaluators. 00:08:54.540 |
And these are all open source in open evals package. 00:09:00.540 |
One of the most common techniques we see being used is LLM as a judge evaluators. 00:09:03.540 |
These can be really powerful, but they're also tricky to get working properly. 00:09:07.540 |
You now have a separate prompt that you have to worry about. 00:09:09.540 |
You have to prompt engineer this prompt, which is grading your other prompt. 00:09:12.540 |
And so there's this extra work that goes into it. 00:09:15.540 |
And so while this is powerful, we oftentimes see people struggling to set it up or to trust the process. 00:09:22.540 |
And so we're launching in private preview some features specifically designed to help with this. 00:09:27.540 |
So first, we're launching some work that's based off of align eval, which is some research by Eugene Yan. 00:09:35.540 |
She was actually a lot of the influence for some of this work. 00:09:38.540 |
She wrote a great paper called Who Validates the Validators. 00:09:41.540 |
And so a lot of this work we're incorporating into Langsmith to make it really easy to get started with LLM as a judge techniques. 00:09:49.540 |
And then, of course, after you get started, how do you know that it's working? 00:09:52.540 |
So we're launching some eval calibration techniques in Langsmith where you can blindly score how the evals are doing 00:09:58.540 |
and then compare them and see that over time. 00:10:00.540 |
And if they start to drift out of whack, then you go back into this align eval kind of like step. 00:10:07.540 |
We're really excited to launch it and work in figuring out what the right UX for building these LLM as a judge evaluators are. 00:10:16.540 |
The thing that I want to emphasize is that evals is a continuous journey. 00:10:19.540 |
You're not done with it once you build a data set and run it once. 00:10:24.540 |
You're not done with it just because you did it on the offline part. 00:10:28.540 |
And eventually, I think more and more are going to start building it into the agents themselves. 00:10:32.540 |
And so this is one of the key takeaways that I'd kind of leave you with.