Engineering Better Evals: Scalable LLM Evaluation Pipelines That Work

All right. Well, let's get started. We don't have much time, but I hope your conference is going well. Welcome to AI Engineering World Fair. My name is Dat No. Today's talk is about LME Val Pipelines. So I never know what I want to talk about until I get into the room, so I don't prep too hard, but by show of hands, who here has built an agent?

Just raise your hand. Okay. Who here has run an eval? Right? Who here has productionized an AI product? Nice. Okay. Some technical builders. Let's get technical then. So my name is Dat. I'm an AI architect at Arise AI. This is Mochi and Latte. They're dogs of my friends. I figured, let's keep it spicy and interesting.

But I've been building observability and evals since day zero. So since the first, I don't know if you guys know what Arise AI is, but we are the largest AI evals player in the space. So observability evals kind of beyond. We work really heavily with real use cases. So folks like Reddit, folks like Duolingo.

So we work across the best AI teams and we have a really unique business. Being on the observability side, we get to see what everyone is building, how they're tackling those problems, what are their biggest pains, and what are kind of the tips and tricks that they use to really productionize these things.

And just to give you a hint, Duolingo has massive eval scale. They tend to run about 20 evals per trace. So they end up spending quite a fair amount doing evals, understanding their evals, optimizing them. And the last thing about me, I have a huge passion for the AI community.

When I was in SF the last five years, I really loved to go to pretty much every single event that I could. I'm not a developer advocate. I'm an engineer by trade, but I just love the community. So yeah, this is a little bit about Rise, but I think I don't want to keep it too salesy.

I just want to keep it pretty technical. So really three concepts that I think everybody should be familiar with and where evals really sit in the space is really as simple as this. This is what I teach all my customers. Really, the first thing is observability. I think you guys have kind of seen this before.

Observability really just answers the question of what is the thing that I built actually doing, right? To some people, it may be traces, traces in spans. I'll show you a little bit, you know, platform agnostic. Just think about the concepts. But traces might be one area for people. So traces represent, hey, what's happening?

Can I look at things? To an AI engineer, makes a lot of sense. Let's say you're an AI PM, maybe not super technical, or maybe you want to think about things differently. Maybe you want to look at, hey, what are the conversations that are happening? Turns out you can run evals at these levels.

We'll get into that in depth kind of later. You know, signal comes and observability comes in different kind of flavors and forms. Maybe it's analytics. What we're starting to really realize is that LLM teams are getting split into two special niches. There's platform teams, right? And they own things like the infrastructure.

So who here has heard of a model gateway router? It's like an interface pattern behind it or all the models, right? Well, it turns out the central LLM platform team tends to own that. They care about costs, latency, things like that. And then you have the other LLM teams.

These LLM teams sit on the like the outer side of the business, so like a hub and spoke. They work for the business side. So these are like the people building the applications to help the business. So if anyone here comes from like the ML or data science space, it's actually not far from that.

And so different teams care about different metrics. So maybe if you're an AI PM sitting on the business side, you care about evals. If you care about, you know, the platform, maybe you care about costs, latency, things like that. But TLDR, observability, observability represents what's happening. And now evals are really important in this space because the reality of the fact is if you've ever seen a trace or something like that, you're not going to inspect every single trace manually, right?

It is not scalable for you an AI engineer or you the AI PM to look through these things. So what is evals used for? It's actually just a really clever word for signal. You're just trying to understand what's going well and what's not going well. So I'm not here to sell you on like evals.

I think everybody knows how important they are. But if you think evals, evals are LLM as a judge only, there's actually a lot of other tools that you're missing. So LLM as a judge, raise your hand if you used LLM as a judge. Okay, about half the room. It's super great.

You use an LLM to give you feedback or process on any process, including an LLM process. So if you're doing RAG, this is a really good way to think about RAG in terms of evals. RAG would be like, hey, user has a question. We retrieve some context to be able to possibly answer that question.

And then we generate an answer. It turns out every arrow on this is actually an eval you can run. So, hey, I retrieve some context and I want to compare that to the query being asked. Well, that's RAG relevance. It's like, is the thing that I returned even helpful in answering the question?

And so LLM as a judge is great. It's super helpful. You know, I think most people understand why it works, but there's a whole research area on why they're really good indicators. The original task is not the eval task, right? So if I asked you a human, hey, generate me a summary on something long and complex, like the book War and Peace, that's a very different task than say, hey, I wrote this summary for you.

Is it a good one or is it a bad one? But LLM as a judge is a small part. It doesn't always have to be a large language model or autoregressive model. Things like encoder-only BERT-type architectures are super helpful. They're about 10 times cheaper, about one or two orders of magnitudes faster to run that eval.

But, you know, you don't just have LLMs at your disposal. You, a human, also are a really good way to discern signal. So it turns out evals can also come in the form of, is your user having a good or bad experience? So for those people who have productionized some sort of LLM application, do you guys have user feedback?

Raise your hand if you've implemented user feedback. And, okay, about 30%. It's actually incredible signal. So that comes from a human. Obviously, you yourself can also generate labels on stuff. So has anyone here heard of a golden data set? Raise your hand. Okay. Most of the room. And the way I encourage folks to think about this, this is a pro tip, actually.

If the first column represents scale. So LLM as a judge is valuable because I don't have to grade it myself, right? But let's say I don't necessarily trust it off the bat. Use the third column to help you out. So a golden data set represents quality. So you yourself graded it.

You know that it's what you expected, you know. Well, it turns out you can run your LLM as a judge on a golden data set. What you're trying to do is say, hey, can the LLM approximate the thing that I trust? Right? And what that allows you to do is to actually quantify and tune your LLM as a judge.

So we'll go over that in a second. But strong pro tip. Most really strong LLM teams in the world kind of do this today. And it turns out we don't always have to use an LLM or a human. You can use what are called like heuristics or code-based logic.

So I'm going to take you into the platform a little bit to talk through it. But in our platform, you have a way to run evals. Great. What code evals actually are, are just, it's much cheaper. So I'll just run a little test here. But, you know, let's say you want to say, hey, does this output contain any keywords?

I don't need to use an LLM or a human for that. I can just use code. It's infinitely cheaper, faster to run. Does this match this regex pattern, XYZ? Is this a parsable JSON? So the reality of it is you have this kind of large toolbox in your kind of eval set.

So when you say evals, don't just think of LLM as a judge. There's a whole other set of smarter things that you can use. They're actually more cost effective. And so really, you know, this is a really good way to emphasize really the value of like the AI engineer evals and observability.

Most people understand this left-hand circle, this purple one. It actually represents what we all want to do. And it's like, hey, build a better AI system, right? So what you do is you collect data, observability, traces, things like that. Then you run some evals to say, hey, did this process go well or did this process not go well?

So you're discerning signal from that, you know, mass of data. You'll actually collect where areas of things went right or wrong, right? And you'll say, hey, turns out we hallucinated on this. It's because our rag strategy is off or the agent is off, for example. You'll also annotate data sets as well, just to double check that those evals are correct.

And then, of course, you always come back into your platform and you, you know, you update the prompt template, right? You change the model because it wasn't good enough, or you update the agent orchestration. So everybody understands that left-hand circle. Now, a lot of people actually forget about the right-hand circle.

And so it turns out, the first time you run evals, what you'll quickly realize is that they're not perfect, right? You actually have to tune those evals over time. So the way you collect signal actually adjusts as your application also, you know, gets better. And so what I mean by that is that process of running evals, what you might notice if you annotate some of them is that the eval said something hallucinated or wasn't correct, and it actually was, or vice versa.

So what you actually need to do is collect a set of those failures, right? Say, hey, this is where the eval was wrong. And you'll know that by annotating some data every now and again. And then you'll want to improve the eval prompt template, right? Because the way you collect signal at first, you'll quickly realize it's either too obscure, too vague, not specific enough.

So these are the kind of two virtuous cycles that you really want to get through very quickly. And the way I describe it to AI engineering teams is, if you want to build like a quality AI product, think about velocity. So the faster you iterate through stuff, if I can get through four iterations in a month rather than two, you're going to exponentially have a better AI product as you build.

And so when we talk about architectures and things like that, when the industry first started, this was state-of-the-art routers, right? Am I right? So routers are made up of like components. This is a really dumb example of booking.com's trip planner. So booking, you know, they're one of our largest customers.

Trip planner is basically a travel agent in LLM form. It drives revenue for that company. It helps you book, you know, it'll book your flights, your hotels. It'll give you an itinerary. And so, you know, when we think about evals, evals can be as complex as the application itself.

So in kind of older architectures, where there's things like routing, you can eval individual components. I think most people get this when you're looking inside of a trace, for example. Maybe I want to eval a specific component or trace. So this one LLM call, right? But remember... Oops, I'll come up here.

But remember that you can, okay, remember that you can zoom out too. So it doesn't have to just be this one specific component. Let's say this one component is part of an agent or a workflow. Maybe I just want to evaluate the input output of that larger workflow. So that larger workflow is made up of LLM calls, API calls, right?

I have to find actual flights, actual hotels that have vacancy, maybe some heuristics. And then you can zoom out a little bit more. Maybe you want to eval things like the way control flow happens. It's a really important component. If you have components in your AI agents that have control flow in them, it actually makes way more sense to eval your control flow first.

And you have conditional evals, meaning if you didn't get the control flow right, why eval anything down the line? Because it's probably wrong, right? So save yourself some money and some costs. So you can think about conditional evals as well. And then of course, we have things like people want to run evals at the highest level.

So imagine you have a back and forth. So we call this a session in our platform. But the whole idea is, you know, a session is made up of a series of traces. So you can imagine there's a back and forth between you and your agent. I just want to understand, hey, at any point was the customer frustrated?

Was the customer XYZ? So when you start to think about evals, there's no one-stop shop. If anybody says this is how you should do evals, and they never asked you about how your application works, you probably shouldn't trust them. Also, I have a hot take. And my hot take is that don't use out-of-the-box evals.

If you get out-of-the-box, if you use out-of-the-box evals, you'll get out-of-the-box results. So really customize them very heavily. It's something that we've learned really from some of the best teams in the world. Okay, let me come here. Then you have complexity. This is our own architecture for our AI co-pilot.

We built an AI whose one purpose is to troubleshoot, observe, build evals for your AI system. It obviously takes advantage of our platform. But, you know, the reason why we go this route is take us forward five, ten years from now. Do you guys really think that you, a human, are going to be the ones who are evaluating all these AI systems, like, mainly?

What do you think would actually take your place? It's probably going to be an AI that evaluates future AI. So this is our first iteration on this stuff. We're super excited about it. We, you know, it's been out for a year. It's getting better and better. But maybe I'll show you a little bit of the workflows that we have in our platform really quickly.

Who here is working with agents? Okay, who here is interested in, like, agent evaluation? Okay, let's cover that then. Let's see. I'll show you. We'll show you how the industry is doing agent evals. So the agent evals, things get, like, way more complex, right? The calls are longer. When you look at your traces, they're much longer.

I'll actually show you our agent traces. So this is one that kind of failed. But our agent trace kind of works like this. So Copilot works like this. It basically, based off what you say and where you're at in the platform, there's agents that kind of can do things.

And it has tools. Each agent has access to a set of tools that it's particularly good at. So the whole idea is that, yes, we can see what each individual trace is doing, right? We can say, hey, what's happening in this particular area? We can look at the traces.

But the reality of what people are actually asking in the space is not, is my AI agent good or bad? What they're actually asking is, what are the failure modes in which my agent fails, right? And so what I mean by that is, you can look at one individual trace in the graph view of it.

But the reality is you want to understand and discern the signal with your entirety of your AI agent. So what is the, like, what does the pathing look like across all of that particular AI agent's calls? So for instance, if it had access to 10 tools, maybe you want to answer questions like, how often did it call a specific tool, right?

What were the evals in a specific path? So in our agent graph, for example, it's framework agnostic. So whether you use land graph, whether you use crew AI, whether you use your hand rolled code, this is an agnostic way to look at how an agent's pathing performs across the aggregate traces.

And so this helps you understand, okay, if my agent that hits component one, then two, then three, my evals look great. But for some reason, when we hit component four, then two, then three, our evals are dropping. And the reason is why? Well, oh, it turns out component four had a dependency, right, on component three, and it needs that dependency in order to perform.

And so when you think about the complexity of agent evals, you need kind of the ability to see across not one instantiation, but all of them. You need to understand the distribution of what's happening. And so when we think about evals across agents, that's one way you can think about it.

And then maybe an easier way to kind of think about it too is trajectory. We're thinking about trajectory evals. So imagine for a second, you have this specific input, right? And the input is like, you know, hey, find me these hotels at TripPlanner. And you know you should hit this component, then that component, this other component.

So in this case, it's like start agent, tool agent. You might have a golden data set, like very similar to how we have golden data sets for LLM as a judge, but this is for trajectories. So I expect us to be able to hit at least three or four of these components, for example.

So the reference trajectory is kind of mentioned, like I need to hit these components. Then you get to do two things, one of two things. Either one, you can pass in like, here's what we did, here's what we expected, into an LLM. And then an LLM can actually grade the trajectory.

You can also just say, hey, did we explicitly hit these exact like trajectory strings? Great. But you don't always need a ground truth for that. You can start to get creative here. You can say, hey, you know, here's this process that I expected to hit. Does these nodes and the description of their nodes match the correct trajectory, for instance?

And then maybe you could do things like, we're kind of playing around with this, but maybe here's the trajectory that I hit. Here's the possible paths that are just possible. Did I do well in these specific areas, right? And you can pass in the pathing as a series of like, nested key value pairs, for example.

LLMs are pretty good at that. But we start to think about, you know, agent evals. You know, the eval space is already complex enough. And what we're seeing is even more complexity. But hopefully that makes sense. I'll pause here. Hopefully that makes sense, but usually I like to make time for questions at the end to keep this pretty interactive.

Hope that's okay, team. But any questions? Does this make sense? Cool. No questions? Oh, yeah. Go ahead. Most of the evals you're talking about are kind of like after the fact, right? Is there a way that you can use evals sort of like to flow as a pattern? Like that's clearly a hallucination.

Yeah. Incredible question. So a lot of people, so there's evals that can be, you know, some people call them offline or online. I like to say is like, is it in the path? Is it in orchestration or out of orchestration? So for some people, there's a cost to in orchestration evals, and the cost is things like latency, right?

Some people might call those a guardrail too. Like, hey, can I continue or not continue? And so there's pros and cons to everything. I think when it comes to guardrails, this is something I kind of coach my customers. The way to think about guardrails in general is you have system one.

System one is your orchestration system. It's what you built. It's your prompts. It's everything else. System two is your guardrail system, right? Guardrails are really nice because they mitigate risk, right? But there is a cost and the cost is maybe it's latency in your user's experience. You can get around that by doing smart things like maybe embeddings guardrails.

They're, you know, two orders of magnitude shorter. But a lot of people don't think about the other two cons here. The other con is complexity. Two systems is complex, especially when one system checks in with the first. The third thing is that a lot of people mistake guardrails as, like, the thing that needs to be adjusted.

A lot of people will go to their guardrails first, like, oh, I need to adjust my guardrails. The reality is you need to adjust system one. That's the root cause, right? Your guardrails are really there to protect you. And then maybe the last thing I'll say too is guardrails are not infallible.

They kind of act like unit tests. They're for known knowns, right? Whereas observability plus evals, because the reality is you don't know the distribution of what you're going to see until you get there, right? Ask anybody who's built in the LLM space. Their users are just crazy. And so that's the difference.

And I really caution people because people are like, oh, I need to fix my guardrail. No, no, go fix the prompt first and then worry about your guardrails. But yeah, inline, we call those inline evals. Some people call them guardrails. But really, do you do the evals in the orchestration or outside of it?

And so you can, there's pros and cons. So there's no right or wrong answer there. But good question. Yeah, so when we have a complex system that is typically taking a long time to run and we have timeouts in this. And I know you'll have something called a span that limits what kind of view we're taking to a complex agent.

So if we have like a complex system that's really going to take time and then there's an asynchronous way, is there support to manage something like that and evolve across the whole system that we have? Oh, yeah. Amazing question. So who here has ever heard of OTEL? Open telemetry.

Okay, even less than, okay. One of the most important things to our enterprise customers is being on open telemetry. So you know how I said LLM teams are being split into two? Well, it turns out LLM services are also being split across services. And so the idea is like people want to understand.

So maybe an asynchronous process in one service or one Docker container, or you know, one Docker or one Kubernetes pod. OTEL propagation is a great way to get around that, meaning you can have a process like application A sends data to my model router, right? And then that comes back to application A.

Then application A hits application B for some reason. And then it comes back to A. When you're actually creating those traces, you want to be able to see all that work, right? You don't want to just instrument one particular thing. You want to see it across work across. And so OTEL is an incredible pattern for that.

It's a solved problem. So that's why we at Arise two and a half years ago, when this crazy time started for all of us, we made a bet to be OTEL first. And it's it's really paid off. Yeah. Yeah. Okay, so confidence scores on evals, right? Yeah, I think it depends where you're getting your eval.

If eval, if it's from an auto-regressive model, companies like OpenAI have actually exposed the log prob. So the log probability is pseudo like confidence of like, and since you're returning only one code token, and that token is like the eval label, log prob is a really good way for those auto-regressive models.

If you're using things like small language models and coder-only models, they come with a probability of the classification. But really, yeah, it's tough. But you have a bunch of tools in your toolbox, and you generally use them together to discern where things go well or not well. But log prob, if you're using a model provider that exposes the log prob, is a really good way to start for auto-regressive models.

Okay, last question, and then we're time up. Yeah? Hey, do you have anything in your plans, like going forward, like how to shorten the loop between customer feedback and automatically improving the props, and also having like, you know, the development team effort work at it? Oh, good question. Yeah, we want to automate in that area, definitely.

So, who here has heard of DSPY? All right, okay. If you didn't raise your hand on any of this, I hope you learned a bunch. DSPY obviously has something like MiPro. MiPro is an optimizer, you get like 30 inputs, 30 outputs, and then it basically creates less fragile prompts that span across different models.

Like, so it does, it works for OpenAI, and then it works for Gemini, et cetera. In terms of like auto-optimization, yeah, I think we have the ability to, or we're releasing the ability to run prompt optimization. So some people call it, we call it meta-prompting. But basically, we feed it a dataset, we said, here's the input-output pairs, here's the evals on those things, and then where things failed and didn't fail.

Look at the original prompts, look at this dataset. Can you give me a new prompt that fixes this dataset? Yeah, so we call that meta-prompting, but it's basically use an LLM to just automate, so you don't have to. Yeah, but good question. But really appreciate the time. We're over at the booth.

Feel free to come grab me if you want to talk architecture or anything, but really nice to see you all.

Engineering Better Evals: Scalable LLM Evaluation Pipelines That Work — Dat Ngo, Aman Khan, Arize

Transcript