Building Reliable Agents: Agent Evaluations

00:00:00.000 | All right.

00:00:11.920 | Welcome back.

00:00:12.920 | Hopefully you guys enjoyed the last hour or so and had a good chance to talk with a lot

00:00:19.000 | of other attendees or sponsors or speakers as they were leaving.

00:00:23.140 | I think, again, I think the way Langtrain got started was going to events like these

00:00:27.340 | and talking to folks.

00:00:28.340 | I'd highly encourage you to take advantage of that.

00:00:33.500 | Before this, we heard a bunch of talks about agents and how people were building, and one

00:00:36.580 | theme that was mentioned a few times was evals.

00:00:40.060 | And so for the next series of talks, we're going to deep dive into that.

00:00:43.500 | We've got a few speakers who are going to shed a lot of light on how they are using evals

00:00:48.800 | or thinking about new research in evals.

00:00:51.200 | But before that, I wanted to take a little bit of time to set the context for why we think

00:00:55.220 | evals are important and some general things that we are doing in the space.

00:01:02.220 | There we go.

00:01:03.220 | So we ran a survey of agent builders about six months ago where we asked them what was the

00:01:07.880 | biggest blocker for getting more agents into production.

00:01:10.160 | And the number one thing that they cited by far was quality.

00:01:13.380 | And so we talked a little bit with Michele about the trade-offs between quality and latency

00:01:18.540 | and cost.

00:01:19.540 | And quality is still the top thing, blocking people getting to production.

00:01:22.540 | And in order to kind of like bridge that gap from prototype to production and increase

00:01:27.540 | that quality, one of the techniques that we've seen people adopt is eval-driven development.

00:01:34.540 | So using evals throughout a bunch of different stages of development to measure your app's performance,

00:01:40.200 | and then make sure that you're constantly climbing that ladder of performance.

00:01:44.540 | And one of the things that I want to emphasize is that evals is really a continuous journey.

00:01:51.540 | So there's three different types of evals that we see people running.

00:01:54.540 | Most people are thinking about evals maybe in one, maybe two of these.

00:01:57.540 | But we think it's a continuous journey throughout the whole life cycle.

00:02:00.540 | So what do I mean by that?

00:02:01.540 | First, let's talk about offline evals.

00:02:03.540 | This is probably what most people think of when they say or hear the word evals.

00:02:09.540 | So this is before you go into production, you get some data set, you run your app against

00:02:14.540 | that data set, and then you measure performance using some evaluators to score it.

00:02:17.540 | And you can track this performance over time, you can compare different models, different prompts,

00:02:21.540 | things like that, and get a sense for whether the changes you're making are actually increasing

00:02:26.540 | or decreasing your app's performance on this data set that you've constructed.

00:02:32.540 | Of course, that data set isn't perfect, and so there's another type of eval called online

00:02:36.540 | evals, which we commonly see people doing.

00:02:38.540 | This is when you take your app that's running in production, you take some subset of the data

00:02:44.540 | that's coming into the app, and you score that.

00:02:46.540 | And so then you can start to track the performance of your app in real time as it's running on real

00:02:51.540 | production data.

00:02:52.540 | And so this is real queries that users are sending in, so it's not a static kind of like golden

00:02:57.540 | set.

00:02:58.540 | And so these are the two types of evals that people are most familiar with.

00:03:01.540 | But we also think there's a third type of eval, which we call in-the-loop evals.

00:03:05.540 | And so these are evals that occur during the agent while it's running.

00:03:09.540 | And so Michele talked a little bit about some of what they were doing at Replit with this,

00:03:13.540 | where they were adding some evals based on trying it out and testing with browser use or running code against it.

00:03:19.540 | And so these would be examples of running some evals in the loop to correct the agent as it's running.

00:03:25.540 | And then if it messes up, like in any of these scenarios here, you can feed it back into the agent and have it kind of like self-correct.

00:03:31.540 | And so you can add this in a bunch of different domains and use it to basically check the agent.

00:03:37.540 | And so this has some obvious benefits.

00:03:39.540 | It improves response quality.

00:03:41.540 | You're not monitoring it after the fact.

00:03:43.540 | It actually improves it before it responds and it can block bad responses.

00:03:47.540 | The big downside is that this takes more time and it costs more money.

00:03:51.540 | So we see this commonly being used when the tolerance for mistake is really low or when latency is not an issue.

00:03:58.540 | And as we see more and more long-running agents, I actually think that's a perfect time to start thinking about putting these in-the-loop evals into your agent.

00:04:09.540 | When we think about evals, there's generally kind of like two parts to evals that we see.

00:04:14.540 | The data that you run over and then the evaluators that you use.

00:04:19.540 | And so all of those three components, they had different aspects of these data and of these evaluators.

00:04:26.540 | So in the offline sense, you know, you've got your data set.

00:04:29.540 | In the online evals, the data is the production data.

00:04:32.540 | And it's happening after the fact.

00:04:34.540 | And in the loop is the production data, but it's happening in the loop.

00:04:36.540 | And then the evaluators can be different as well.

00:04:38.540 | So when you have your golden truth data set, you can compare against it.

00:04:43.540 | And so that's useful.

00:04:44.540 | Those are called what we call ground truth or reference evals.

00:04:47.540 | And then reference free evals are when you don't have this ground truth.

00:04:51.540 | And this is what you do online or in the loop because you don't know what the ground truth is.

00:04:54.540 | And so basically data and evaluators are two parts of evals no matter what type of eval you're doing.

00:05:02.540 | And so we want to make it easy for people to build data sets and run things over their data as well as build their own evaluators.

00:05:12.540 | Because one thing that we've noticed is that all the academic data sets that you might see or get started with, those aren't representative of how users are using your application.

00:05:21.540 | They're oftentimes not even in the same domain.

00:05:24.540 | And so when we talk about data and evals, it's really about making it easy for any application developer to build a data set or build evaluators that are specific for their use case.

00:05:36.540 | So how do we help?

00:05:37.540 | How do we help do that on the data side?

00:05:39.540 | One, we've talked about tracing.

00:05:41.540 | Traces are where you run these online evaluators over.

00:05:44.540 | So we have really good tracing in Langsmith.

00:05:46.540 | You can send everything there.

00:05:47.540 | We track not only the inputs and outputs but all the steps as well.

00:05:50.540 | And so you can then run evaluators over these traces for the online evals part.

00:05:56.540 | We've also made it really, really easy to add these to a data set and start building up this ground truth data set for offline evals.

00:06:03.540 | And so there's a nice little button in Langsmith that you can click, add to data set.

00:06:07.540 | It will take this kind of like input/output pair.

00:06:09.540 | You can actually modify the output pair as well and then it adds it to a data set.

00:06:12.540 | So we've tried to make it really easy for people to build up these data sets in Langsmith powered by the observability.

00:06:21.540 | And so one of the things that we like to say is that great evals start with great observability.

00:06:26.540 | And that's how we think about it in Langsmith.

00:06:29.540 | They're tied.

00:06:30.540 | They're not separate things.

00:06:33.540 | Now let's talk about evaluators for a little bit.

00:06:36.540 | So there are a few different types of evaluators that we see people using.

00:06:40.540 | First is maybe just using code to evaluate things.

00:06:43.540 | So this would be like exact match or regex or checking if it's valid JSON or things like that.

00:06:48.540 | These are great.

00:06:49.540 | These are deterministic.

00:06:50.540 | They're cheap.

00:06:51.540 | They're fast to run.

00:06:52.540 | But they're oftentimes not as representative of all the things you want to catch, especially

00:06:56.540 | if you have natural language responses.

00:06:58.540 | So one of the things that we see popping up here is using LLM as a judge techniques to use an LLM to score the outputs of your agent or LLM application.

00:07:08.540 | And so this is useful because they can handle more complex things.

00:07:11.540 | There's some downsides to this.

00:07:12.540 | They're tricky to get started work.

00:07:14.540 | We'll talk about this later.

00:07:15.540 | But generally the idea of using an LLM to judge outputs is pretty promising.

00:07:20.540 | And the third type that we see is just good old human annotation.

00:07:23.540 | This can happen kind of like live from the user as they're using the app.

00:07:26.540 | You can collect thumbs up, thumbs down, send those to Lang Smith and track them there.

00:07:30.540 | Or you can have a human go in the background and use something like our annotation cues to score these runs.

00:07:38.540 | So one of the things that we've been building over the past month or so is a set of open source evaluators to try to make it easy to get started with these evaluators.

00:07:49.540 | And so there are a few common use cases that we think are common and you can use off the shelf.

00:07:53.540 | These include things like code, rag, extraction, and tool calling.

00:07:57.540 | So for code, for example, we have some off the shelf utils that will lint Python code or lint TypeScript code.

00:08:03.540 | And then you can take those errors and feed them back into the LLM.

00:08:06.540 | This is great.

00:08:07.540 | You can use these off the shelf.

00:08:08.540 | There's little configuration needed.

00:08:10.540 | But of course, for a lot of use cases, you are going to want to configure evaluators to your specific use case in your specific domain.

00:08:17.540 | And so we have a few customizable evals also included in open evals.

00:08:21.540 | One of these are LLM as a judge things, making it really easy to get started with that.

00:08:25.540 | A little bit more interesting is things around agent trajectories.

00:08:28.540 | So taking in a trajectory of messages, passing it into one of these evaluators, and customizing it so you can choose what to look for.

00:08:35.540 | And then one of the things that we're launching today is chat simulations.

00:08:39.540 | So a lot of applications are chat bots or they have some back and forth component.

00:08:43.540 | And so sure, you can test a single kind of like interaction, but you often want to see how it performs in a conversational setting.

00:08:49.540 | And so we're launching some utils to both run and then score those evaluators.

00:08:54.540 | And these are all open source in open evals package.

00:09:00.540 | One of the most common techniques we see being used is LLM as a judge evaluators.

00:09:03.540 | These can be really powerful, but they're also tricky to get working properly.

00:09:07.540 | You now have a separate prompt that you have to worry about.

00:09:09.540 | You have to prompt engineer this prompt, which is grading your other prompt.

00:09:12.540 | And so there's this extra work that goes into it.

00:09:15.540 | And so while this is powerful, we oftentimes see people struggling to set it up or to trust the process.

00:09:22.540 | And so we're launching in private preview some features specifically designed to help with this.

00:09:27.540 | So first, we're launching some work that's based off of align eval, which is some research by Eugene Yan.

00:09:33.540 | You'll hear from Shreya later on.

00:09:35.540 | She was actually a lot of the influence for some of this work.

00:09:38.540 | She wrote a great paper called Who Validates the Validators.

00:09:41.540 | And so a lot of this work we're incorporating into Langsmith to make it really easy to get started with LLM as a judge techniques.

00:09:49.540 | And then, of course, after you get started, how do you know that it's working?

00:09:52.540 | So we're launching some eval calibration techniques in Langsmith where you can blindly score how the evals are doing

00:09:58.540 | and then compare them and see that over time.

00:10:00.540 | And if they start to drift out of whack, then you go back into this align eval kind of like step.

00:10:05.540 | And so this is in private preview.

00:10:07.540 | We're really excited to launch it and work in figuring out what the right UX for building these LLM as a judge evaluators are.

00:10:16.540 | The thing that I want to emphasize is that evals is a continuous journey.

00:10:19.540 | You're not done with it once you build a data set and run it once.

00:10:22.540 | You're going to want to keep on running it.

00:10:24.540 | You're not done with it just because you did it on the offline part.

00:10:27.540 | You're going to want to do it online.

00:10:28.540 | And eventually, I think more and more are going to start building it into the agents themselves.

00:10:32.540 | And so this is one of the key takeaways that I'd kind of leave you with.

00:10:35.540 | with.

00:10:37.540 | Thank you.