Engineering Better Evals: Scalable LLM Evaluation Pipelines That Work

00:00:00.000 | All right. Well, let's get started. We don't have much time, but I hope your conference is going

00:00:18.780 | well. Welcome to AI Engineering World Fair. My name is Dat No. Today's talk is about LME Val

00:00:25.960 | Pipelines. So I never know what I want to talk about until I get into the room, so I

00:00:30.580 | don't prep too hard, but by show of hands, who here has built an agent? Just raise your

00:00:36.260 | hand. Okay. Who here has run an eval? Right? Who here has productionized an AI product?

00:00:45.160 | Nice. Okay. Some technical builders. Let's get technical then. So my name is Dat. I'm an AI

00:00:54.280 | architect at Arise AI. This is Mochi and Latte. They're dogs of my friends. I figured, let's keep

00:01:02.520 | it spicy and interesting. But I've been building observability and evals since day zero. So since

00:01:08.620 | the first, I don't know if you guys know what Arise AI is, but we are the largest AI evals player in the

00:01:15.460 | space. So observability evals kind of beyond. We work really heavily with real use cases. So folks like

00:01:22.160 | Reddit, folks like Duolingo. So we work across the best AI teams and we have a really unique business.

00:01:28.400 | Being on the observability side, we get to see what everyone is building, how they're tackling those

00:01:35.300 | problems, what are their biggest pains, and what are kind of the tips and tricks that they use to really

00:01:40.820 | productionize these things. And just to give you a hint, Duolingo has massive eval scale. They tend to run

00:01:47.160 | about 20 evals per trace. So they end up spending quite a fair amount doing evals, understanding their

00:01:55.400 | evals, optimizing them. And the last thing about me, I have a huge passion for the AI community. When I was in SF

00:02:01.000 | the last five years, I really loved to go to pretty much every single event that I could. I'm not a developer

00:02:06.520 | advocate. I'm an engineer by trade, but I just love the community. So yeah, this is a little bit about

00:02:13.240 | Rise, but I think I don't want to keep it too salesy. I just want to keep it pretty technical.

00:02:16.440 | So really three concepts that I think everybody should be familiar with and where evals really sit

00:02:21.800 | in the space is really as simple as this. This is what I teach all my customers. Really, the first thing

00:02:27.320 | is observability. I think you guys have kind of seen this before. Observability really just answers the

00:02:32.120 | question of what is the thing that I built actually doing, right? To some people, it may be traces,

00:02:37.240 | traces in spans. I'll show you a little bit, you know, platform agnostic. Just think about the concepts.

00:02:45.160 | But traces might be one area for people. So traces represent, hey, what's happening? Can I look at

00:02:51.080 | things? To an AI engineer, makes a lot of sense. Let's say you're an AI PM, maybe not super technical,

00:02:57.000 | or maybe you want to think about things differently. Maybe you want to look at,

00:03:01.080 | hey, what are the conversations that are happening? Turns out you can run evals at these levels.

00:03:05.000 | We'll get into that in depth kind of later. You know, signal comes and observability comes in

00:03:11.240 | different kind of flavors and forms. Maybe it's analytics. What we're starting to really realize is

00:03:16.200 | that LLM teams are getting split into two special niches. There's platform teams, right? And they own

00:03:23.320 | things like the infrastructure. So who here has heard of a model gateway router? It's like an interface

00:03:29.240 | pattern behind it or all the models, right? Well, it turns out the central LLM platform team tends to

00:03:33.960 | own that. They care about costs, latency, things like that. And then you have the other LLM teams.

00:03:39.240 | These LLM teams sit on the like the outer side of the business, so like a hub and spoke.

00:03:45.240 | They work for the business side. So these are like the people building the applications to help the

00:03:49.880 | business. So if anyone here comes from like the ML or data science space, it's actually not far from

00:03:55.320 | that. And so different teams care about different metrics. So maybe if you're an AI PM sitting on the

00:04:01.480 | business side, you care about evals. If you care about, you know, the platform, maybe you care about

00:04:06.040 | costs, latency, things like that. But TLDR, observability, observability represents what's happening.

00:04:15.400 | And now evals are really important in this space because the reality of the fact is if you've ever

00:04:19.560 | seen a trace or something like that, you're not going to inspect every single trace manually, right?

00:04:25.880 | It is not scalable for you an AI engineer or you the AI PM to look through these things. So what is evals used

00:04:33.080 | for? It's actually just a really clever word for signal. You're just trying to understand what's going well

00:04:37.880 | and what's not going well. So I'm not here to sell you on like evals. I think everybody knows how important they are.

00:04:43.000 | But if you think evals, evals are LLM as a judge only, there's actually a lot of other tools that

00:04:49.400 | you're missing. So LLM as a judge, raise your hand if you used LLM as a judge. Okay, about half the room.

00:04:55.000 | It's super great. You use an LLM to give you feedback or process on any process, including an LLM process.

00:05:02.520 | So if you're doing RAG, this is a really good way to think about RAG in terms of evals. RAG would be like,

00:05:08.440 | hey, user has a question. We retrieve some context to be able to possibly answer that question.

00:05:14.840 | And then we generate an answer. It turns out every arrow on this is actually an eval you can run. So,

00:05:21.320 | hey, I retrieve some context and I want to compare that to the query being asked. Well,

00:05:25.720 | that's RAG relevance. It's like, is the thing that I returned even helpful in answering the question?

00:05:30.840 | And so LLM as a judge is great. It's super helpful. You know, I think most people understand why it

00:05:36.760 | works, but there's a whole research area on why they're really good indicators. The original task

00:05:43.000 | is not the eval task, right? So if I asked you a human, hey, generate me a summary on something long

00:05:49.080 | and complex, like the book War and Peace, that's a very different task than say, hey, I wrote this summary

00:05:55.000 | for you. Is it a good one or is it a bad one? But LLM as a judge is a small part. It doesn't always

00:06:02.360 | have to be a large language model or autoregressive model. Things like encoder-only BERT-type

00:06:08.120 | architectures are super helpful. They're about 10 times cheaper, about one or two orders of magnitudes

00:06:14.520 | faster to run that eval. But, you know, you don't just have LLMs at your disposal. You, a human,

00:06:20.760 | also are a really good way to discern signal. So it turns out evals can also come in the form of,

00:06:25.320 | is your user having a good or bad experience? So for those people who have productionized some sort

00:06:29.800 | of LLM application, do you guys have user feedback? Raise your hand if you've implemented user feedback.

00:06:34.360 | And, okay, about 30%. It's actually incredible signal. So that comes from a human. Obviously,

00:06:41.640 | you yourself can also generate labels on stuff. So has anyone here heard of a golden data set?

00:06:46.760 | Raise your hand. Okay. Most of the room. And the way I encourage folks to think about this,

00:06:52.040 | this is a pro tip, actually. If the first column represents scale. So LLM as a judge is valuable

00:06:57.240 | because I don't have to grade it myself, right? But let's say I don't necessarily trust it off the bat.

00:07:02.120 | Use the third column to help you out. So a golden data set represents quality. So you yourself graded it.

00:07:09.000 | You know that it's what you expected, you know. Well, it turns out you can run your LLM as a judge

00:07:14.600 | on a golden data set. What you're trying to do is say, hey, can the LLM approximate the thing that I

00:07:20.920 | trust? Right? And what that allows you to do is to actually quantify and tune your LLM as a judge.

00:07:26.360 | So we'll go over that in a second. But strong pro tip. Most really strong LLM teams in the world

00:07:31.560 | kind of do this today. And it turns out we don't always have to use an LLM or a human. You can use

00:07:37.800 | what are called like heuristics or code-based logic. So I'm going to take you into the platform a little

00:07:42.040 | bit to talk through it. But in our platform, you have a way to run evals. Great. What code evals

00:07:47.720 | actually are, are just, it's much cheaper. So I'll just run a little test here. But, you know, let's say

00:07:55.320 | you want to say, hey, does this output contain any keywords? I don't need to use an LLM or a human for that.

00:08:00.280 | I can just use code. It's infinitely cheaper, faster to run. Does this match this regex pattern,

00:08:06.520 | XYZ? Is this a parsable JSON? So the reality of it is you have this kind of large toolbox

00:08:12.360 | in your kind of eval set. So when you say evals, don't just think of LLM as a judge. There's a whole

00:08:18.120 | other set of smarter things that you can use. They're actually more cost effective. And so really, you know,

00:08:24.280 | this is a really good way to emphasize really the value of like the AI engineer evals and

00:08:29.720 | observability. Most people understand this left-hand circle, this purple one. It actually

00:08:35.480 | represents what we all want to do. And it's like, hey, build a better AI system, right? So what you do

00:08:39.560 | is you collect data, observability, traces, things like that. Then you run some evals to say, hey,

00:08:45.560 | did this process go well or did this process not go well? So you're discerning signal from that,

00:08:50.360 | you know, mass of data. You'll actually collect where areas of things went right or wrong,

00:08:55.480 | right? And you'll say, hey, turns out we hallucinated on this. It's because our rag strategy

00:09:00.200 | is off or the agent is off, for example. You'll also annotate data sets as well, just to double check

00:09:05.960 | that those evals are correct. And then, of course, you always come back into your platform and you,

00:09:11.240 | you know, you update the prompt template, right? You change the model because it wasn't good enough,

00:09:16.280 | or you update the agent orchestration. So everybody understands that left-hand circle.

00:09:20.520 | Now, a lot of people actually forget about the right-hand circle. And so it turns out,

00:09:25.480 | the first time you run evals, what you'll quickly realize is that they're not perfect, right? You

00:09:33.160 | actually have to tune those evals over time. So the way you collect signal actually adjusts as your

00:09:38.120 | application also, you know, gets better. And so what I mean by that is that process of running evals,

00:09:43.960 | what you might notice if you annotate some of them is that the eval said something hallucinated or

00:09:48.760 | wasn't correct, and it actually was, or vice versa. So what you actually need to do is collect a set of

00:09:54.840 | those failures, right? Say, hey, this is where the eval was wrong. And you'll know that by annotating some

00:10:00.440 | data every now and again. And then you'll want to improve the eval prompt template, right? Because the

00:10:06.120 | way you collect signal at first, you'll quickly realize it's either too obscure, too vague, not specific

00:10:11.720 | enough. So these are the kind of two virtuous cycles that you really want to get through very quickly.

00:10:17.640 | And the way I describe it to AI engineering teams is, if you want to build like a quality AI product,

00:10:23.080 | think about velocity. So the faster you iterate through stuff, if I can get through four iterations

00:10:29.320 | in a month rather than two, you're going to exponentially have a better AI product as you

00:10:33.240 | build. And so when we talk about architectures and things like that, when the industry first started,

00:10:39.400 | this was state-of-the-art routers, right? Am I right? So routers are made up of like components.

00:10:46.120 | This is a really dumb example of booking.com's trip planner. So booking, you know, they're one of our

00:10:51.640 | largest customers. Trip planner is basically a travel agent in LLM form. It drives revenue for that

00:10:57.480 | company. It helps you book, you know, it'll book your flights, your hotels. It'll give you an itinerary.

00:11:02.920 | And so, you know, when we think about evals, evals can be as complex as the application itself.

00:11:10.280 | So in kind of older architectures, where there's things like routing, you can eval individual

00:11:17.160 | components. I think most people get this when you're looking inside of a trace, for example.

00:11:21.160 | Maybe I want to eval a specific component or trace. So this one LLM call, right? But remember... Oops,

00:11:29.400 | I'll come up here. But remember that you can, okay, remember that you can zoom out too. So it doesn't

00:11:34.840 | have to just be this one specific component. Let's say this one component is part of an agent or a

00:11:39.400 | workflow. Maybe I just want to evaluate the input output of that larger workflow. So that larger workflow

00:11:45.560 | is made up of LLM calls, API calls, right? I have to find actual flights, actual hotels that have vacancy,

00:11:51.640 | maybe some heuristics. And then you can zoom out a little bit more. Maybe you want to eval things like

00:11:58.440 | the way control flow happens. It's a really important component. If you have components

00:12:02.920 | in your AI agents that have control flow in them, it actually makes way more sense to eval your control

00:12:08.440 | flow first. And you have conditional evals, meaning if you didn't get the control flow right, why eval

00:12:14.840 | anything down the line? Because it's probably wrong, right? So save yourself some money and some costs.

00:12:21.000 | So you can think about conditional evals as well. And then of course, we have things like people want to run

00:12:26.680 | evals at the highest level. So imagine you have a back and forth. So we call this a session in our

00:12:32.520 | platform. But the whole idea is, you know, a session is made up of a series of traces. So you can imagine

00:12:38.440 | there's a back and forth between you and your agent. I just want to understand, hey, at any point was the

00:12:43.560 | customer frustrated? Was the customer XYZ? So when you start to think about evals, there's no one-stop shop.

00:12:50.520 | If anybody says this is how you should do evals, and they never asked you about how your application

00:12:55.560 | works, you probably shouldn't trust them. Also, I have a hot take. And my hot take is that don't use

00:13:00.760 | out-of-the-box evals. If you get out-of-the-box, if you use out-of-the-box evals, you'll get out-of-the-box

00:13:04.760 | results. So really customize them very heavily. It's something that we've learned really from some of the best

00:13:10.360 | teams in the world. Okay, let me come here. Then you have complexity. This is our own architecture

00:13:17.240 | for our AI co-pilot. We built an AI whose one purpose is to troubleshoot, observe, build evals

00:13:24.280 | for your AI system. It obviously takes advantage of our platform. But, you know, the reason why we go

00:13:28.760 | this route is take us forward five, ten years from now. Do you guys really think that you, a human,

00:13:34.760 | are going to be the ones who are evaluating all these AI systems, like, mainly? What do you think would

00:13:39.960 | actually take your place? It's probably going to be an AI that evaluates future AI. So this is our

00:13:45.320 | first iteration on this stuff. We're super excited about it. We, you know, it's been out for a year.

00:13:50.520 | It's getting better and better. But maybe I'll show you a little bit of the workflows that we have

00:13:54.280 | in our platform really quickly. Who here is working with agents? Okay, who here is interested in, like,

00:14:01.800 | agent evaluation? Okay, let's cover that then. Let's see. I'll show you. We'll show you how the industry is

00:14:06.680 | doing agent evals. So the agent evals, things get, like, way more complex, right? The calls are longer.

00:14:13.160 | When you look at your traces, they're much longer. I'll actually show you our agent traces. So this is

00:14:20.680 | one that kind of failed. But our agent trace kind of works like this. So Copilot works like this. It

00:14:25.720 | basically, based off what you say and where you're at in the platform, there's agents that kind of can do

00:14:31.560 | things. And it has tools. Each agent has access to a set of tools that it's particularly good at.

00:14:37.240 | So the whole idea is that, yes, we can see what each individual trace is doing, right? We can say,

00:14:45.000 | hey, what's happening in this particular area? We can look at the traces. But the reality of what people

00:14:50.920 | are actually asking in the space is not, is my AI agent good or bad? What they're actually asking is,

00:14:57.160 | what are the failure modes in which my agent fails, right? And so what I mean by that is,

00:15:02.600 | you can look at one individual trace in the graph view of it. But the reality is you want to understand

00:15:09.560 | and discern the signal with your entirety of your AI agent. So what is the, like, what does the pathing

00:15:15.960 | look like across all of that particular AI agent's calls? So for instance, if it had access to 10 tools,

00:15:24.040 | maybe you want to answer questions like, how often did it call a specific tool, right? What were the evals

00:15:29.800 | in a specific path? So in our agent graph, for example, it's framework agnostic. So whether you use

00:15:36.440 | land graph, whether you use crew AI, whether you use your hand rolled code, this is an agnostic way

00:15:43.640 | to look at how an agent's pathing performs across the aggregate traces. And so this helps you understand,

00:15:51.560 | okay, if my agent that hits component one, then two, then three, my evals look great. But for some reason,

00:15:58.360 | when we hit component four, then two, then three, our evals are dropping. And the reason is why? Well,

00:16:04.680 | oh, it turns out component four had a dependency, right, on component three, and it needs that dependency in order

00:16:10.520 | to perform. And so when you think about the complexity of agent evals, you need kind of

00:16:16.360 | the ability to see across not one instantiation, but all of them. You need to understand the distribution

00:16:21.640 | of what's happening. And so when we think about evals across agents, that's one way you can think about it.

00:16:27.880 | And then maybe an easier way to kind of think about it too is trajectory. We're thinking about trajectory

00:16:34.680 | evals. So imagine for a second, you have this specific input, right? And the input is like, you know,

00:16:41.320 | hey, find me these hotels at TripPlanner. And you know you should hit this component, then that component,

00:16:47.240 | this other component. So in this case, it's like start agent, tool agent. You might have a golden data

00:16:52.360 | set, like very similar to how we have golden data sets for LLM as a judge, but this is for trajectories.

00:16:57.000 | So I expect us to be able to hit at least three or four of these components, for example. So the reference

00:17:01.960 | trajectory is kind of mentioned, like I need to hit these components. Then you get to do two things,

00:17:07.240 | one of two things. Either one, you can pass in like, here's what we did, here's what we expected,

00:17:12.440 | into an LLM. And then an LLM can actually grade the trajectory.

00:17:17.080 | You can also just say, hey, did we explicitly hit these exact like trajectory strings? Great.

00:17:21.160 | But you don't always need a ground truth for that. You can start to get creative here. You can say, hey,

00:17:25.560 | you know, here's this process that I expected to hit. Does these nodes and the description of their nodes

00:17:32.120 | match the correct trajectory, for instance? And then maybe you could do things like, we're kind of

00:17:38.200 | playing around with this, but maybe here's the trajectory that I hit. Here's the possible paths that

00:17:43.480 | are just possible. Did I do well in these specific areas, right? And you can pass in the pathing as a

00:17:49.240 | series of like, nested key value pairs, for example. LLMs are pretty good at that. But we start to think

00:17:53.960 | about, you know, agent evals. You know, the eval space is already complex enough. And what we're seeing

00:18:00.200 | is even more complexity. But hopefully that makes sense. I'll pause here. Hopefully that makes sense,

00:18:06.920 | but usually I like to make time for questions at the end to keep this pretty interactive. Hope that's okay,

00:18:11.800 | team. But any questions? Does this make sense?

00:18:14.520 | Cool. No questions? Oh, yeah. Go ahead.

00:18:19.560 | Most of the evals you're talking about are kind of like after the fact, right? Is there a way that you

00:18:23.960 | can use evals sort of like to flow as a pattern? Like that's clearly a hallucination.

00:18:29.080 | Yeah. Incredible question. So a lot of people, so there's evals that can be,

00:18:36.600 | you know, some people call them offline or online. I like to say is like, is it in the path? Is it in

00:18:41.560 | orchestration or out of orchestration? So for some people, there's a cost to in orchestration evals,

00:18:48.120 | and the cost is things like latency, right? Some people might call those a guardrail too. Like,

00:18:52.040 | hey, can I continue or not continue? And so there's pros and cons to everything. I think

00:18:57.080 | when it comes to guardrails, this is something I kind of coach my customers. The way to think about

00:19:02.440 | guardrails in general is you have system one. System one is your orchestration system. It's what you built.

00:19:08.760 | It's your prompts. It's everything else. System two is your guardrail system, right? Guardrails are

00:19:14.200 | really nice because they mitigate risk, right? But there is a cost and the cost is maybe it's latency

00:19:20.440 | in your user's experience. You can get around that by doing smart things like maybe embeddings guardrails.

00:19:25.640 | They're, you know, two orders of magnitude shorter. But a lot of people don't think about the other two

00:19:31.160 | cons here. The other con is complexity. Two systems is complex, especially when one system checks in with

00:19:36.760 | the first. The third thing is that a lot of people mistake guardrails as, like, the thing that needs

00:19:43.640 | to be adjusted. A lot of people will go to their guardrails first, like, oh, I need to adjust my

00:19:47.320 | guardrails. The reality is you need to adjust system one. That's the root cause, right? Your guardrails

00:19:52.760 | are really there to protect you. And then maybe the last thing I'll say too is guardrails are not infallible.

00:19:58.600 | They kind of act like unit tests. They're for known knowns, right? Whereas observability plus evals,

00:20:05.640 | because the reality is you don't know the distribution of what you're going to see

00:20:09.000 | until you get there, right? Ask anybody who's built in the LLM space. Their users are just crazy.

00:20:14.520 | And so that's the difference. And I really caution people because people are like, oh, I need to fix my

00:20:20.520 | guardrail. No, no, go fix the prompt first and then worry about your guardrails. But yeah, inline,

00:20:26.840 | we call those inline evals. Some people call them guardrails. But really, do you do the evals in the

00:20:31.320 | orchestration or outside of it? And so you can, there's pros and cons. So there's no right or wrong

00:20:36.760 | answer there. But good question. Yeah, so when we have a complex system that is typically taking a long

00:20:44.760 | time to run and we have timeouts in this. And I know you'll have something called a span that limits

00:20:51.000 | what kind of view we're taking to a complex agent. So if we have like a complex system that's really

00:20:59.400 | going to take time and then there's an asynchronous way, is there support to manage something like that

00:21:04.920 | and evolve across the whole system that we have? Oh, yeah. Amazing question. So who here has ever

00:21:11.720 | heard of OTEL? Open telemetry. Okay, even less than, okay. One of the most important things to our

00:21:17.880 | enterprise customers is being on open telemetry. So you know how I said LLM teams are being split into

00:21:23.640 | two? Well, it turns out LLM services are also being split across services. And so the idea is like

00:21:29.240 | people want to understand. So maybe an asynchronous process in one service or one Docker container,

00:21:34.760 | or you know, one Docker or one Kubernetes pod. OTEL propagation is a great way to get around that,

00:21:41.480 | meaning you can have a process like application A sends data to my model router, right? And then that

00:21:48.440 | comes back to application A. Then application A hits application B for some reason. And then it comes back

00:21:53.560 | to A. When you're actually creating those traces, you want to be able to see all that work, right? You don't

00:21:58.120 | want to just instrument one particular thing. You want to see it across work across. And so OTEL is

00:22:03.960 | an incredible pattern for that. It's a solved problem. So that's why we at Arise two and a half years ago,

00:22:09.240 | when this crazy time started for all of us, we made a bet to be OTEL first. And it's it's really paid off.

00:22:14.760 | Yeah. Yeah.

00:22:17.080 | Okay, so confidence scores on evals, right? Yeah, I think it depends where you're getting your eval. If

00:22:36.920 | eval, if it's from an auto-regressive model, companies like OpenAI have actually exposed the log

00:22:41.800 | prob. So the log probability is pseudo like confidence of like, and since you're returning

00:22:47.560 | only one code token, and that token is like the eval label, log prob is a really good way for those

00:22:52.920 | auto-regressive models. If you're using things like small language models and coder-only models,

00:22:58.440 | they come with a probability of the classification. But really, yeah, it's tough. But you have a bunch

00:23:05.480 | of tools in your toolbox, and you generally use them together to discern where things go well or

00:23:09.400 | not well. But log prob, if you're using a model provider that exposes the log prob, is a really

00:23:14.920 | good way to start for auto-regressive models. Okay, last question, and then we're time up. Yeah?

00:23:20.520 | Hey, do you have anything in your plans, like going forward, like how to shorten the loop between

00:23:26.360 | customer feedback and automatically improving the props, and also having like, you know, the development team

00:23:31.560 | effort work at it? Oh, good question. Yeah, we want to automate in that area, definitely. So,

00:23:36.200 | who here has heard of DSPY? All right, okay. If you didn't raise your hand on any of this,

00:23:41.480 | I hope you learned a bunch. DSPY obviously has something like MiPro. MiPro is an optimizer,

00:23:46.440 | you get like 30 inputs, 30 outputs, and then it basically creates less fragile prompts that span

00:23:51.880 | across different models. Like, so it does, it works for OpenAI, and then it works for Gemini, et cetera.

00:23:56.920 | In terms of like auto-optimization, yeah, I think we have the ability to, or we're releasing the ability

00:24:03.000 | to run prompt optimization. So some people call it, we call it meta-prompting. But basically,

00:24:08.920 | we feed it a dataset, we said, here's the input-output pairs, here's the evals on those things,

00:24:13.640 | and then where things failed and didn't fail. Look at the original prompts, look at this dataset.

00:24:19.880 | Can you give me a new prompt that fixes this dataset? Yeah, so we call that meta-prompting,

00:24:25.000 | but it's basically use an LLM to just automate, so you don't have to. Yeah, but good question.

00:24:31.000 | But really appreciate the time. We're over at the booth. Feel free to come grab me if you want to talk

00:24:35.160 | architecture or anything, but really nice to see you all.

Engineering Better Evals: Scalable LLM Evaluation Pipelines That Work — Dat Ngo, Aman Khan, Arize