Lessons from the Trenches: Building LLM Evals That Work IRL: Aparna Dhinkaran

All right. Hey, everyone. My name is Aparna, one of the founders of Arise. We do LLM evals and observability. I wanted to do a session going really deep on this stuff because you guys all hear LLM as a judge, and you're probably like, yeah, yeah, yeah, but how does it actually work in the real world?

Well, we work with some of the top companies in the space who are all deploying LLM applications, and we've seen a lot go well and not go well in the real world. And so even though you've probably seen this tweet from Greg a bunch, evals are all you need, you're probably like, what does that actually mean when you're putting it in the real world?

And I'm going to demystify a little bit of that and talk about some real examples today. So first off, there's a distinction between types of evals that we should just clarify. First is there's model evals. If you're on Hugging Face, you're looking at the OpenLLM leaderboard, and you're like, okay, LLM, 3B, whatever, 7B is better than this because of some MMLU metric.

Well, they're actually stacking and ranking different models against each other. And these are really helpful when you see things like the needle in the haystack test to understand which model to actually use. But for most of you in the room who are probably building the applications, you probably care more about task evals.

And what I mean by that is, is the LLM application actually working? And how do you define evals that actually help you figure that out? So let's talk about how task evals work in the real world. This is probably a review to most of you. Most of the industry is converging around a couple different options, LLM as a judge, user feedback, heuristic-based approaches.

Just as an overview for folks who don't know what LLM as a judge is, it's basically when you're using AI to evaluate AI. I take inputs from, I take the output of my response, I might take the context that it was given, pass it into an eval prompt template, and then I can actually have LLM as a judge come back with an evaluation of how it works.

Let's talk about how this works in a simple application. So this is a really common one we're seeing in the ecosystem. It's a chat to purchase type of application. E-commerce applications use this a lot. So the way it begins is the customer asks some kind of question, hey, blah, blah, blah, I'm looking for a new Kindle.

And there's first this component, call it like a router, that actually is deciding what the customer intent is. A lot of folks actually use function calling for this. And so there's a function call that happens, it determines what the path to send a user down is. And then there's the actual workflow, execution branch of what to happen.

This is a really, really simple one. In the real world, the applications get a lot more complex. There's a real, I see probably a couple of these, you know, every week, where basically, the user intent gets decided by an LLM call. And then it has to get the user intent correct.

This is what they all care about. Because otherwise, it sends users down the wrong path. It can send it down if users asking you a question about, hey, recommend me some product to go buy, but then it sends me down something related to customer support. Well, the issue actually you want to catch is, did it get the function call of determining that user intent correctly?

And so when you look at an application like this, where there's a router, and routers are, you know, probably the most common agentic type of workflows we're actually seeing in production today. There's a router call, then there's LLM calls, then there's application calls, and then there's maybe even calls to traditional ML models doing in the middle of where you're actually calling out to search.

How many of you guys have an application that looks like this? Or have seen app, you know, have built one internally? Okay, awesome, awesome, awesome. This is, this is kind of where we're seeing a lot of, we're seeing a lot of applications being built. It's not just a simple API call, and here's a response.

It's actually built on top of levels. And so as your applications get more complex, your evals are going to get more complex. And there's levels to this is kind of the theme you'll hear today. If you're evaluating something like this, you want evals at different levels of your application.

There's an eval, most importantly, at the router level to help you figure out what's the path that it should go down, did it go down the right execution branch, and then within each execution branch there's often component level evals that are being done. I'm going to actually give you guys a demo so we can show your real application and we can show you where something goes wrong, but just to set some context, you will actually dive really deep today into this router eval in applications.

The key thing, I think, to take away is you'll see questions like this from the demo we're going to give, but users ask questions in the applications, and typically you want to figure out, well, did it go down the right function call? So in this case, did it go down something like the user asked about details of a product?

Did it go down the product details function call? And then there's another kind of implicit one that's often done, which is, did we extract the right parameters for the function call? And do we give it the right parameters? Because if you don't give it the right parameters, then it doesn't matter if you pick the right function call, it's still not going to get it right.

So these are kind of the two ones I'm going to actually walk through and show you guys what it looks like. I'm going to actually show you guys Phoenix today. Phoenix is our open source product. You guys are welcome to try it out and download it. This is actually Phoenix live for my application that I was talking about.

And right now what you're actually looking at is a trace of the application, very simplified trace of the application. This is what a user asked me. Could you tell me if there is any current promotions for Samsung, whatever phone? And then this is actually the output that was responded from the application.

Within this, and you can go and you can look through kind of all the different applications here, all the different kind of questions that users are asking, And you can actually see what kind of questions that users are asking here. Each one of these you'll actually see a full stack trace.

What's most important here, and you kind of see the one I clicked on is one where it actually says it got the function call wrong. So I'm going to actually go dive into that one and we can go look at it. You can go look at it and see it says it got the function call wrong.

It says the user is actually asking about current promotions for this phone. The generated function call is for a product search, which may not specifically address promotions. A more appropriate function call might be the one that directly querred the phone. So I'm going to actually go dive into that one and we can go look at it.

You can go look at it and see it says it got the function call wrong, it says the user is actually asking about current promotions for this phone. The generated function call is for a product search which may not specifically address promotions, a more appropriate function call might be the one that directly queries promotions or discounts.

We actually do have within the application a function call that's available for promos and discounts, might be easier to see that one in the slides, but it didn't actually call that one, it actually called the one that's specifically about product search instead. And so it called the wrong function call and this is actually one where the rest of the entire execution branch is going to be off because it got the first call wrong.

So this is why it's really important to, if I had to zoom back out to just what do you care about in an application like this? You care about first your traces because you want to see what the heck's happening, where is it going down the flow. You care about evals, you care about evals knowing well something like where did it get in the application it wrong and then you also care about, we'll go into this explanations of the evaluation.

So that when it gets it wrong, you get a view of actually where did it go wrong, what to go fix and what should I actually go do to iterate and improve the application. Traces, you want to evaluate it and then you want to use it to actually iterate on your application.

That's kind of the loop that people do as they're building these. I can take this example, I can go add it to my data set for example. And then I can say, all right, every single time I get something wrong like this, I'm going to go build up my data set and then use these to now eventually run experiments.

And run experiments where I can track and improve and this is one where I modified the prompt and I can run these experiments and then continuously iterate to make sure. Maybe the function description wasn't right, maybe the call from the LLM wasn't right. And so there's all sorts of things you can actually do to improve, but it really helps when you have evals at different levels of the application to be able to -- evals at different levels of the application so that you know where to go focus and where to actually go improve.

So with that, I'm going to actually jump to just some of the best practices we've seen from the ground. So you saw an example of basically a router-based application function calling evals. There's different types of levels that we see to applications. How many of you guys have a chatbot with multiple back and forth sessions basically in there?

Well, typically you want evals at different levels of that, at a session level, often at a trace level, often at a span level. So getting the stuff to actually work in the real world isn't just single eval and we're good. It's often single eval, help me understand an explanation, let me drill down to where exactly, what component.

And so these levels really help you do that. And what we see folks do is they actually start to do this in iterative phases. They first start off benchmarking the evals when they're building. And then as they're actually building the application and they're building each of the different components, they're developing those eval templates iteratively along the application.

And then as they move into production, they can actually go monitor it, run it as jobs, but you're doing this as an iterative process as you're kind of building. If there's one thing you take away from my talk today, I hope it's actually this slide. Evals with explanations are by far what we see real people deploying applications, finding the most useful in production.

A single incorrect, not incorrect is just really hard to know what to go fix. But when you have something like an explanation like we were looking at, it makes it easier for teams to go, okay, here's what I go fix, here's what I go dig into. And so run your evals with explanations if you can.

There's different types of ways you can generate these evals actually. There's, if any of you guys are familiar with like, you know, in ML, there's like regression type of models, classification types of models, et cetera. Well, there's different types of evals too. There's numeric score outputs, there's categorical outputs, multi-outputs, multi-class.

Can I actually, maybe this is a fun question. How many of you guys use numerical outputs as your LLM evals? Okay. Okay. A few brave folks. How many of you guys use categorical evals? Okay. Both. Okay. Nice. I'm going to actually share, we did a ton of research around this and we've been sharing about this, but if you are using numerical outputs today, I highly recommend you actually don't only rely on them.

Here's a little research we shared and I'll share some results of this. But numeric scores, just for people who, you know, need a refresher on it, is you basically have the output of your LLM as a judge be a single number. And this is a simple example. I have a document, one where we've corrupted the document with a lot of spelling errors, and one where we've corrupted the document with very little spelling errors.

So one of them, the corruption's like 80%, the other one's the corruption's like 11%. And we asked the LLM as a judge, hey, can you evaluate and tell us how bad of the spelling errors are actually in this document? And for both of them, it actually gave an eval score of 10 on it.

And we actually noticed this was really consistent across all the foundational models. I think Mistral actually did pretty good compared to some of the rest. But across all the foundational models, it was actually pretty binary in how it did the scores. It was either a 0 or it was either a 1 or it was a 10.

But it was never like this linear range of scores that you'd want it to expect. So as you increase the density of corruption, you actually get an increase in the number of scores. It was pretty binary, which kind of just indicated that if you're using numeric scores, it might not be the right way to evaluate because you're not going to catch the granularity.

You know, an 80% doesn't actually mean anything. It's not going to really mean anything different than a 10% evaluation. So just a little best practices from the ground that we've been seeing as we've been running evals with customers. In the last like four minutes here, I'll share a couple more.

This is slightly more model evals-related research for folks to kind of see the latest on that front. For folks who have been following the needle in a haystack test, this was a really popular one trending on Twitter recently. Needle in a haystack test was basically we put a needle in a haystack.

We hid a fact in some context window. And the context window size can change. But the key thing we were also trying to figure out is, does placement in the context window matter? So this is an example where the fact was placed within the first 5% of the context window.

This is an example where the context was placed kind of lower down in the context window, 90%. And the reason to do this type of research is, well, if you're using RAG, which I'm sure many of you guys in this room are, well, does it matter if what you put in the context window, that's the most important part, is actually lower in the document, does that actually impact the final output of the LLM that's given?

And it turns out it actually does. So we did a lot of pressure testing against a number of foundational models. We do have the latest from the Opus model. I just don't have it in this deck right now. But this is actually results from Anthropic Cloud 2.1 versus GPT-4.

The -- sorry, it's hard to read. But the x-axis on the bottom is basically the context window size. And then the y-axis is basically the depth in the document. And I mean, GPT-4 was for sure better in being able to retrieve the fact. But we consistently noticed actually that if you put the fact, especially as you increase the context window, if you put the fact earlier in the context window, it actually had a really hard time almost remembering or retrieving to pull that document.

And so we repeatedly, as we ran this, saw this kind of -- you know, red's where it gets it wrong, green's where it gets it right. But it consistently has this, like, red block earlier in the context window. So for folks who are actually using RAG, depending on how much information you're putting in the document, it's important to just balance where you place it in the document as well.

Another couple research results, we also tested not just retrieval, but also retrieval with generation. What do I mean by generation? Well, after you did the retrieval, you can do things like generation on -- it's kind of like the G in RAG. You actually generate a response after that. So some of the common types of generations were things like, from this financial document, round the numbers or map the dates or concatenate the strings.

So these are all common types of generation tasks. And again, we stack ranked two different models against each other. This one's actually super interesting because GPT-4, which, you know, at that point, state-of-the-art, did worse than Anthropic 2.1, almost four times was worse. And we were really confused at why it was really great at retrieval, but it wasn't great at generation.

And we kept going back and trying to understand, like, why this is over so many results, and talk to the team. And basically, we modified something in the prompt that made it so much better. We asked it to please explain yourself and then answer the question. If any of you guys have noticed, but Anthropic's models are slightly wordier.

And actually, in this scenario, it was more of a feature versus a bug. But because it was wordier, it kept kind of asking itself to -- it, like, thought through the process. And it thought through the process and then answered the question correctly and did the generation at the end as opposed to GPT-4.

But when we asked GPT-4 to actually explain itself and then answer the question, it was able to get a pretty remarkable jump in performance on generation. So hopefully this was helpful to give you guys a view of just, like, different type of task and model evals. If you want to hear more about this, we're actually hosting an event, Arise Observe, on July 11th.

This is my code for a free ticket. If any of you guys want to go, there's all sorts of researchers from OpenAI, Anthropic, Mistral, who are all coming to share model evals as well as builders who are sharing their own task evals. So check it out. Thanks, everyone. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Bye. Bye. Bye. We'll be right back.

Lessons from the Trenches: Building LLM Evals That Work IRL: Aparna Dhinkaran

Transcript