Hey, everyone. My name is Aparna. I'm one of the founders of Arise, and today we're going to talk about agent evaluation. At Arise, we build development tools to help teams build agents and take them all the way to production. We focus on everything from evaluation to observability, monitoring, and tracing your application so you can see every single step that your application took.
Let me tell you a little bit about why we got into this, and then I'll jump into some concrete tips about how we think about evaluating agents. First, building agents is incredibly hard. There is a lot of iteration that goes on at the prompt level, at the model level, at iterating on the different tool call definitions.
For a lot of teams, this is what their experience looks like. They're in an Excel sheet. They're swapping out different prompts. They're trying to figure out, did one prompt give them a better output than another prompt. Often, a lot of this is just based off of how it felt on a couple different examples, and then they go live into production with that prompt.
Part of this is that it's pretty difficult to actually systematically track where is this new prompt doing better than your previous prompts? Where is this model doing better? And it's hard to actually include other people, people, especially if you have your product managers or your SMEs, actually a part of this iterative evaluation-driven process to how you actually want to think about improving your application.
And so it's hard to consistently figure out what makes your agent better. And it doesn't get easier once you actually deploy into production. It's pretty hard to understand, well, where's the bottlenecks in my application? Is there a specific sub-agent or tool call that is kind of consistently creating poor responses?
How do I want to actually identify those bottlenecks? And then what do I actually need to do to go fix it? And so today, I'm going to be diving into a little bit of the different components of how I think about agent evaluations. We're going to talk about some of the most common components, which is evaluating at the tool call level, taking that one step further, going all the way to the trajectory and looking at, did it actually maybe across an entire trace or did it actually call, for example, all the tool calls in the right order?
We're going to then not only look across the single trace, but then across multi-turn conversations. Because these interactions are no longer just kind of a a single-turn experience. They're often multi-turn, keeping track of what happened in the previous interaction and keeping that in mind as context for the next turn of the conversation.
So we're going to talk a little bit about that. And then I'm going to jump into kind of a, you know, approach that we've been really excited about, which is how do we get these agents to self-improve? And that starts with not just thinking about the agent improving, but also your evals consistently improving.
So with that, let's jump in. I'm going to do a little bit of slides and then I'm going to jump into actually a real example. So you guys can actually see it in, in practice. So first, we're going to talk a little bit about tool calling evals. Anyone who's building agents is probably writing a lot of different tools and making sure that your agent has access to call all these different tools depending on the action it needs to take.
And pretty consistently, it's your, your agent needs to probably make the decision of what's the right tool call to call in this specific scenario. I have, you know, potentially contacts from previous part of a conversation or previous actions it's taken. And what do I actually need to call in order to kind of next continue whatever's happening in that, in that interaction?
So not only do you have to pick the right tool call, but you also have to figure out from that conversation or context what's the right arguments to pass into that tool call. And so it's pretty important to actually evaluate, did it call the right tool and did it pass the right arguments into that tool call itself?
And I'm going to go into a little bit of depth this and show you how we think about evals actually from a product perspective. This is the Arise product. You can see here, I'm actually tracing and looking at the traces of our own actually co-pilot here. So this is our own co-pilot and our own agent that actually think about it almost like an insights, a tool where teams can come in and ask all sorts of different questions about where their application is doing well or not doing well, and use this to actually troubleshoot their application and suggest improvements.
And like any great product, we actually dog food our own tooling. And so these are actually the traces of different questions that users have asked us. And we actually evaluate these interactions so we can understand where our co-pilot's doing really well and where it's actually not doing well. One thing that we actually like to look a lot at is not just kind of the individual traces, but actually starting at a little bit more of a higher level view where we can look across all the different paths that, you know, all the different trajectories that our agent actually can go down.
So in our case, this is actually the architecture of our agent. You can see here, we follow a little bit of an orchestration worker type pattern, where there's at the very high level, a planner that decides based off of the information what to go down, and there's all sorts of different tools that it can then call.
And sometimes, depending on the output of those tools, it might need to call even another router or orchestrator to figure out, you know, what next tool call to actually call. And so there's kind of multiple levels to this to actually make sure that ultimately we respond to the user in kind of a good way.
With this specific agent of ours, you can see here that, you know, for me, what I really think of a lot about is, well, at the planner level, the very beginning, is, you know, across all the different paths that this agent could go, where is it kind of doing really well?
And are there any any bottlenecks in performance? And as I can look through some of this, it's, you know, there's evals around questions that are just related to, you know, generic questions of the users asking. It looks like we're actually not doing so well on search. I can see here, we're almost pretty consistently doing, you know, it's about half and half of times we're getting it correct as we're getting it incorrect, which is not that great.
So this is probably an area that I would dive into and look at kind of the bottlenecks of, you know, where we're not doing so well when the user's asking search questions. And it looks like there's other questions that we're actually doing, doing pretty good on. So this type of high level view first is just giving me a view of all the different paths that my agent can go down and really kind of pinpointing to me what I should go focus on specifically.
So now when I go look inside of my traces, I can actually start with something like the Q&A correctness and look at something like, well, I should probably care about in this case, it was specifically the search Q&A correctness. So what I should probably go look at is for search Q&A correctness when it's incorrect.
Let me go take a look at some of the examples of that and try to understand where what I'm doing wrong here. And so in this case, when I'm looking at these, I can actually now drill and go into specific traces. And at this level, I have evals across the entire trace.
I have evals on the individual tool calls. And at the tool call level, I also have, you know, not only did it call the right tool call, it says the function call is correct. But I also have evals on did it actually pass all the right arguments. In this case, it looks like that's where it's going wrong.
It says, therefore, the arguments do not align with the required parameters leading to the conclusion that the arguments are incorrect. So I probably have an issue here where even though I'm calling it looks like the right tool call out of all the different ones that I have, it looks like maybe I'm not passing in the right arguments inside of my tool call based on the context of the conversation.
So that's something that I should go fix. So this is kind of the first big one that we think a lot about. The next one that is pretty interesting is also, you know, for a lot of of these, it's not just a single tool call that's made. It's many, actually.
And you can kind of see that when I'm showing you the way that our architecture is built. There's actually a lot of different, even within the Q&A correctness, when we go down the search correctness path, there's actually a lot of different sub tools that are even called here. So it's pretty important to not only get, uh, you know, if it's individually calling a single tool correctly, but also is it getting the order of the tools that it's supposed to call correctly?
And that's really what, as we think about trajectory evals starts to, uh, starts to, it starts to become about is really, is it calling tool calls in the right order? Um, if across a series of steps as needed for an agent to complete a task, is it consistently calling and executing them in the same set of steps or, and, and eventually converge on, you know, X number of steps to complete the section, or does it sometimes veer off and call it in a different order and therefore require me to, A, have to spend a lot of tokens in order to do the same ask.
Um, and then B, is it kind of messing up and providing wrong kind of outputs because of that? And so we recommend teams to actually drill in and look at, you know, across, not just an individual kind of a tool call, but actually looking across, um, an entire trace and looking at the order of the tool calls to see if that's actually done well.
And then evaluating kind of is overall, in this case, I have it incorrect. And so in this case, is it actually consistently getting the tool calling order correctness? Uh, uh, correct. Um, the next step here is, well, you can look across the single individual trace or interaction, but a lot of these interactions we're seeing with agents actually ends up being multi-turn.
So in this case, I have like a three back and forth between a human and an agent. And there's a lot of interesting questions you can actually ask at this stage. You can ask questions like, is the agent consistent in tone? Um, is the agent maybe asking the same question and again and again?
Um, in which case it's not really learning anything from the previous interactions that it's had with the human. Um, and, and part of that is really, does it keep track of context from the previous kind of n minus one turns in order to be able to answer the nth, uh, turn of that conversation really well.
And so these are all the types of questions to think about when you have something that, that is multi-turn. And I'll actually show you an example from another project here where I do have kind of some of that multi-turn interaction. Um, this may be one where I've kind of this back and forth with, with an agent.
And what I care about here across this entire conversation is, did I actually correctly answer all the questions? Did I actually make sure I kept context so that I wasn't kind of missing context that was made earlier in the conversation? And so deeply recommend folks to actually think about the session evaluations as part of their evaluating of agents.
Um, and then lastly, I'll kind of go through this and, um, I think this will be a good spot for us to, um, deep dive into, which is we spend a lot of time, even just now on the tool calling trajectory session, a lot talking about how to think about evaluating the agent or the application prompt.
And this is really important. I mean, I, we could spend a whole deep dive on this itself, but it's really important to evaluate it correctly, identify where does it go wrong so that you can annotate or, you know, refine those outputs and use those to actually improve your existing prompt.
And I think a lot of teams actually totally get this and are doing this all the time to improve the agent prompt. Um, but what's really important is that the evals that you're actually using just kind of the crux of how you identify those failure cases end up becoming crucial to calling out what you need to improve.
And you don't want those evals to remain static, the prompts for those evals for using LLM as a judge. And so there really is kind of another loop going on here, which is about improving the evals and the eval prompts. And part of this is you're collecting, um, consistently checking the, the eval outcomes as well to make sure it's not just the application that got it wrong, but it could have been the eval that, you know, miscorrectly labeled it as wrong.
Um, and start to identify where the eval itself might need some improvements. And similar to the process you did for the agent application, do a workflow where you're iterating on the eval prompts, um, you know, building up a golden dataset, consistently refining it. And there really are kind of two iterative loops kind of going on at the same time, one for the agent evaluations, one for, um, your, your agent application prompts, and then one for your eval prompts.
And as you think about this, both of them kind of go hand in hand to actually create a really good product experience, um, for teams. Um, there's a lot more in here that, you know, I, I think we, we can dive into, but, um, hopefully this gives you a little bit of a primer about how to think about agent evaluations and, uh, check out Arise Phoenix.
It's a completely open source, um, product that you can use to learn a lot about what we just went through and test it out in your own applications. Um, you can check out Arise X if you want to, um, think about how to run a lot of these evaluations on your own data.
Um, so feel free to check it out and hopefully you guys got something out of this. Thanks everyone for the time.