LangChain Interrupt 2025 Agent Evaluations

00:00:00.000 | All right, we're going to go all the way back. This is time travel. It's a feature in lane graph.

00:00:13.000 | All right. So, there we go. So, we ran a survey of agent builders about six months ago where we asked them what was the biggest blocker for getting more agents into production.

00:00:24.000 | The number one thing that they cited by far was quality. So, we talked a little bit about the trade-offs between quality and latency, and quality is still the top thing in blocking people getting to production.

00:00:37.000 | And in order to kind of like bridge that gap from prototype to production and increase that quality, one of the techniques that we've seen people adopt is eval-driven development.

00:00:48.000 | So, using evals throughout a bunch of different stages of development to measure your app's performance and then make sure that you're constantly climbing that ladder of performance.

00:00:59.000 | And one of the things that I want to emphasize is that evals is really a continuous journey.

00:01:04.000 | So, there's three different types of evals that we see people running. Most people are thinking about evals maybe in one, maybe two of these, but we think it's a continuous journey throughout the whole lifecycle.

00:01:15.000 | So, what do I mean by that? First, let's talk about offline evals. This is probably what most people think of when they say or hear the word evals.

00:01:24.000 | So, this is before you go into production, you get some data set, you run your app against that data set, and then you measure performance using some evaluators to score it, and you can track this performance over time.

00:01:34.000 | You can compare different models, different prompts, things like that, and get a sense for whether the changes you're making are actually increasing or decreasing your app's performance on this data set that you've constructed.

00:01:46.000 | Of course, that data set isn't perfect, and so there's another type of eval called online eval that you commonly see people doing. This is when you take your app that's running in production, you take some subset of the data that's coming into the app, and you score that.

00:02:00.000 | And so, then you can start to track the performance of your app in real time as it's running on real production data, and so this is real queries that users are sending in, so it's not a static kind of like golden set.

00:02:12.000 | And so, these are the two types of evals that people are most familiar with, but we also think there's a third type of eval, what we call in-the-loop evals.

00:02:19.000 | And so, these are evals that occur during the agent, while it's running.

00:02:23.000 | So, Michele talked a little bit about some of what they were doing at Replit with this, where they were adding some evals based on trying it out and testing with browser use or running code against it.

00:02:34.000 | So, these are the examples of running some evals in the loop to correct the agent as it's running, and then if it messes up, like in any of these scenarios here, you can feed it back into the agent and have it kind of like self-correct.

00:02:46.000 | And so, you can add this in a bunch of different domains and use it to basically check the agent.

00:02:51.000 | And so, this has some obvious benefits.

00:02:53.000 | It improves response quality.

00:02:55.000 | You're not monitoring it after the fact.

00:02:57.000 | It actually improves it before it responds and can block bad responses.

00:03:01.000 | The big downside is that this takes more time and it costs more money.

00:03:05.000 | And so, we see this commonly being used when the tolerance for mistake is really low or when latency is not an issue.

00:03:13.000 | And as we see more and more long-running agents, I actually think that's a perfect time to start thinking about putting these in-the-loop evals into your agent.

00:03:22.000 | When we think about evals, there's generally kind of like two parts to evals that we see.

00:03:29.000 | the data that you run over, and then the evaluators that you use.

00:03:34.000 | And so, all of those three components, they had different aspects of these data and of these evaluators.

00:03:41.000 | So, in the offline sense, you know you've got your data set.

00:03:44.000 | In the online evals, the data is the production data, and it's happening after the fact.

00:03:49.000 | In the loop, it's the production data, but it's happening in the loop.

00:03:51.000 | And then the evaluators can be different as well.

00:03:53.000 | So, when you have your golden truth data set, you can compare against it.

00:03:58.000 | And so, that's useful.

00:03:59.000 | As we recall, what we call ground truth or reference evals.

00:04:02.000 | And then reference-free evals are when you don't have this ground truth.

00:04:06.000 | And this is what you do online or in the loop because you don't know what the ground truth is.

00:04:10.000 | So, basically, data and evaluators are two parts of evals, no matter what type of eval you're doing.

00:04:17.000 | And so, we want to make it easy for people to build data sets and run things over their data, as well as build their own evaluators.

00:04:27.000 | Because one thing that we've noticed is that all the academic data sets that you might see or get started with, those aren't representative of how users are using your application.

00:04:36.000 | They're often times not even in the same domain.

00:04:39.000 | And so, when we talk about data and evals, it's really about making it easy for any application developer to build a data set or build evaluators that are specific for their use case.

00:04:51.000 | So, how do we help do that on the data side?

00:04:54.000 | One, we've talked about tracing.

00:04:56.000 | Traces are where you run these online evaluators over.

00:04:59.000 | So, we have really good tracing in LinkSmith.

00:05:01.000 | You can send everything there.

00:05:02.000 | We track not only the inputs and outputs, but all the steps as well.

00:05:05.000 | And so, you can then run evaluators over these traces for the online evals part.

00:05:10.000 | We've also made it really, really easy to add these to a data set and start building off with this ground truth data set for offline evals.

00:05:18.000 | So, there's a nice little button in LinkSmith that you can click, add the data set, and it will take this kind of like input-output pair.

00:05:24.000 | You can actually modify the output pair as well and then add it to a data set.

00:05:27.000 | So, we've tried to make it really easy for people to build up these data sets in LinkSmith powered by the observability.

00:05:35.000 | And so, one of the things that we like to say is that great evals start with great observability.

00:05:41.000 | And that's how we think about it in LinkSmith.

00:05:43.000 | They're tied.

00:05:44.000 | They're not separate things.

00:05:46.000 | Now, let's talk about evaluators for a little bit.

00:05:51.000 | So, there are a few different types of evaluators that we see people using.

00:05:55.000 | First is maybe just using code to evaluate things.

00:05:58.000 | So, this would be like exact match or regex or checking if it's valid JSON or things like that.

00:06:03.000 | These are great.

00:06:04.000 | These are deterministic.

00:06:05.000 | They're cheap.

00:06:06.000 | They're fast to run.

00:06:07.000 | But they're oftentimes not as representative of all the things you want to catch, especially if you have natural language responses.

00:06:14.000 | So, one of the things that we see popping up here is using LLM as a judge techniques to use an LLM to score the outputs of your agent or LLM application.

00:06:23.000 | And so, this is useful because they can handle more complex things.

00:06:26.000 | There's some downsides to this.

00:06:27.000 | They're tricky to get started work.

00:06:29.000 | We'll talk about this later.

00:06:30.000 | But generally, the idea of using an LLM to judge outputs is pretty promising.

00:06:34.000 | And the third type that we see is just good old human annotation.

00:06:37.000 | This can happen kind of like live from the user as they're using the app.

00:06:41.000 | You can collect thumbs up, thumbs down, send those blanks, and track them there.

00:06:44.000 | Or, you can have human go in the background and use something like our annotation cues to score these runs.

00:06:51.000 | So, one of the things that we've been building over the past month or so is a set of open source evaluators to try to make it easy to get started with these evaluators.

00:07:03.000 | And so, there are a few common use cases that we think are common and you can use off-the-shelf.

00:07:08.000 | These include things like code, rag, extraction, and tool calling.

00:07:12.000 | So, for code, for example, we have some off-the-shelf utils that will lint Python code or lint TypeScript code.

00:07:18.000 | And then you can take those errors and feed them back into the LLM.

00:07:21.000 | This is great.

00:07:22.000 | You can use these off-the-shelf.

00:07:23.000 | There's little configuration needed.

00:07:25.000 | But, of course, for a lot of use cases, you are going to want to configure evaluators to your specific use case in your specific domain.

00:07:32.000 | And so, we have a few customizable evals also included in open evals.

00:07:35.000 | One of these are LLM's judge things, making it really easy to get started with that.

00:07:40.000 | A little bit more interesting is things around agent trajectories.

00:07:43.000 | So, taking in a trajectory of messages, passing it into one of these evaluators, and customizing it so you can choose what to look for.

00:07:50.000 | And then one of the things that we're launching today is chat simulations.

00:07:53.000 | So, a lot of applications are chat bots, or they have some back and forth component.

00:07:58.000 | And so, sure, you can test a single kind of like interaction, but you often want to see how it performs in a conversational setting.

00:08:04.000 | And so, we're launching some utils to both run and then score those evaluators.

00:08:09.000 | These are all open source in open evals package.

00:08:12.000 | One of the most common techniques we see being used is LLM as the judge evaluators.

00:08:18.000 | These can be really powerful, but they're also tricky to get working properly.

00:08:22.000 | You now have a separate prompt that you have to worry about.

00:08:24.000 | You have to prompt engineer this prompt, which is grading your other prompts.

00:08:28.000 | And so, there's this extra work that goes into it.

00:08:30.000 | And so, while this is powerful, we oftentimes see people struggling to set it up or to trust the process.

00:08:37.000 | And so, we're launching in private preview some features specifically designed to help with this.

00:08:42.000 | So, first, we're launching some work that's based off of Align Evo, which is some research by Eugene Ann.

00:08:48.000 | You'll hear from Shreya later on. She was actually a lot of the influence for some of this work.

00:08:53.000 | She wrote a great paper called "Who Validates the Validators?"

00:08:56.000 | And so, a lot of this work we're incorporating into Langsmith to make it really easy to get started with LLM as a judge techniques.

00:09:04.000 | And then, of course, after you get started, how do you know that it's working?

00:09:07.000 | So, we're launching some eval calibration techniques in Langsmith where you can blindly score how evals are doing

00:09:13.000 | and then compare them and see that over time. And if they start to drift out of whack, then you go back into this Align Evo kind of like step.

00:09:20.000 | So, this is in private preview. We're really excited to launch it and work in figuring out what the right UX for building these LLM as a judge evaluators are.

00:09:30.000 | The thing that I want to emphasize is that evals is a continuous journey. You're not done with it once you build a dataset and run it once.

00:09:37.000 | You're going to want to keep on running it. You're not done with it just because you did it on the offline part. You're going to want to do it online.

00:09:43.000 | And eventually, I think more and more, we're going to start building it into the agents themselves.

00:09:47.000 | And so, this is one of the key takeaways that I kind of leave you with.

00:09:51.000 | With that, I want to hand it over to some of the amazing speakers that we have.

00:09:55.000 | Up next, we'll be Ben Leval, VP of Engineering at Harvey.

00:10:00.000 | Ben will discuss how Harvey builds agents and then uses Langsmith to evaluate them.

LangChain Interrupt 2025 Agent Evaluations – Harrison Chase