Break It 'Til You Make It: Building the Self-Improving Stack for AI Agents

00:00:00.000 | Hey, everyone. My name is Aparna. I'm one of the founders of Arise, and today we're going to talk

00:00:06.680 | about agent evaluation. At Arise, we build development tools to help teams build agents

00:00:12.520 | and take them all the way to production. We focus on everything from evaluation to

00:00:18.360 | observability, monitoring, and tracing your application so you can see every single step

00:00:24.240 | that your application took. Let me tell you a little bit about why we got into this, and then

00:00:29.140 | I'll jump into some concrete tips about how we think about evaluating agents. First, building

00:00:35.640 | agents is incredibly hard. There is a lot of iteration that goes on at the prompt level, at

00:00:40.580 | the model level, at iterating on the different tool call definitions. For a lot of teams, this is what

00:00:47.460 | their experience looks like. They're in an Excel sheet. They're swapping out different prompts.

00:00:52.980 | They're trying to figure out, did one prompt give them a better output

00:00:58.280 | than another prompt. Often, a lot of this is just based off of how it felt on a couple different

00:01:04.120 | examples, and then they go live into production with that prompt. Part of this is that it's

00:01:10.600 | pretty difficult to actually systematically track where is this new prompt doing better than your

00:01:17.900 | previous prompts? Where is this model doing better? And it's hard to actually include other people,

00:01:23.320 | people, especially if you have your product managers or your SMEs, actually a part of this iterative

00:01:28.580 | evaluation-driven process to how you actually want to think about improving your application.

00:01:34.760 | And so it's hard to consistently figure out what makes your agent better. And it doesn't get easier

00:01:42.840 | once you actually deploy into production. It's pretty hard to understand, well, where's the bottlenecks in my

00:01:47.880 | application? Is there a specific sub-agent or tool call that is kind of consistently creating

00:01:56.680 | poor responses? How do I want to actually identify those bottlenecks? And then what do I actually need

00:02:02.120 | to do to go fix it? And so today, I'm going to be diving into a little bit of the different components

00:02:07.800 | of how I think about agent evaluations. We're going to talk about some of the most common components,

00:02:12.760 | which is evaluating at the tool call level, taking that one step further, going all the way to the

00:02:19.000 | trajectory and looking at, did it actually maybe across an entire trace or did it actually call,

00:02:26.600 | for example, all the tool calls in the right order? We're going to then not only look across the single

00:02:31.080 | trace, but then across multi-turn conversations. Because these interactions are no longer just kind of a

00:02:37.800 | a single-turn experience. They're often multi-turn, keeping track of what happened in the previous

00:02:42.280 | interaction and keeping that in mind as context for the next turn of the conversation. So we're

00:02:49.320 | going to talk a little bit about that. And then I'm going to jump into kind of a, you know, approach

00:02:54.120 | that we've been really excited about, which is how do we get these agents to self-improve? And that

00:02:58.600 | starts with not just thinking about the agent improving, but also your evals consistently improving.

00:03:04.680 | So with that, let's jump in. I'm going to do a little bit of slides and then I'm going to

00:03:07.960 | jump into actually a real example. So you guys can actually see it in, in practice. So first,

00:03:13.720 | we're going to talk a little bit about tool calling evals. Anyone who's building agents is probably

00:03:18.760 | writing a lot of different tools and making sure that your agent has access to call all these different

00:03:24.440 | tools depending on the action it needs to take. And pretty consistently, it's your, your agent needs to

00:03:30.920 | probably make the decision of what's the right tool call to call in this specific scenario.

00:03:35.880 | I have, you know, potentially contacts from previous part of a conversation or previous actions it's

00:03:40.760 | taken. And what do I actually need to call in order to kind of next continue whatever's happening in that,

00:03:47.320 | in that interaction? So not only do you have to pick the right tool call, but you also have to figure out

00:03:53.000 | from that conversation or context what's the right arguments to pass into that tool call.

00:03:58.600 | And so it's pretty important to actually evaluate, did it call the right tool and did it pass the right

00:04:05.720 | arguments into that tool call itself? And I'm going to go into a little bit of depth this and show you

00:04:10.600 | how we think about evals actually from a product perspective. This is the Arise product. You can see

00:04:17.480 | here, I'm actually tracing and looking at the traces of our own actually co-pilot here. So this is our own

00:04:24.120 | co-pilot and our own agent that actually think about it almost like an insights, a tool where teams can come

00:04:33.000 | in and ask all sorts of different questions about where their application is doing well or not doing well,

00:04:37.640 | and use this to actually troubleshoot their application and suggest improvements. And like any great product,

00:04:45.480 | we actually dog food our own tooling. And so these are actually the traces of different questions that users

00:04:50.520 | have asked us. And we actually evaluate these interactions so we can understand where our co-pilot's doing really

00:04:57.320 | well and where it's actually not doing well. One thing that we actually like to look a lot at is not just kind of the

00:05:05.720 | individual traces, but actually starting at a little bit more of a higher level view where we can look

00:05:11.320 | across all the different paths that, you know, all the different trajectories that our agent actually

00:05:17.560 | can go down. So in our case, this is actually the architecture of our agent. You can see here,

00:05:24.360 | we follow a little bit of an orchestration worker type pattern, where there's at the very high level,

00:05:30.600 | a planner that decides based off of the information what to go down, and there's all sorts of different

00:05:37.240 | tools that it can then call. And sometimes, depending on the output of those tools, it might need to call

00:05:42.200 | even another router or orchestrator to figure out, you know, what next tool call to actually call.

00:05:49.720 | And so there's kind of multiple levels to this to actually make sure that ultimately we respond to the

00:05:55.320 | user in kind of a good way. With this specific agent of ours, you can see here that, you know, for me,

00:06:04.920 | what I really think of a lot about is, well, at the planner level, the very beginning, is, you know,

00:06:11.960 | across all the different paths that this agent could go, where is it kind of doing really well? And are there any

00:06:17.320 | any bottlenecks in performance? And as I can look through some of this, it's, you know, there's

00:06:21.880 | evals around questions that are just related to, you know, generic questions of the users asking.

00:06:27.560 | It looks like we're actually not doing so well on search. I can see here, we're almost pretty

00:06:33.480 | consistently doing, you know, it's about half and half of times we're getting it correct as we're

00:06:38.920 | getting it incorrect, which is not that great. So this is probably an area that I would dive into and

00:06:43.480 | look at kind of the bottlenecks of, you know, where we're not doing so well when the user's

00:06:48.520 | asking search questions. And it looks like there's other questions that we're actually doing, doing

00:06:53.480 | pretty good on. So this type of high level view first is just giving me a view of all the different

00:06:59.240 | paths that my agent can go down and really kind of pinpointing to me what I should go focus on

00:07:04.680 | specifically. So now when I go look inside of my traces, I can actually start with something like

00:07:10.200 | the Q&A correctness and look at something like, well, I should probably care about in this case,

00:07:19.640 | it was specifically the search Q&A correctness. So what I should probably go look at is for search

00:07:25.960 | Q&A correctness when it's incorrect. Let me go take a look at some of the examples of that and try

00:07:30.760 | to understand where what I'm doing wrong here. And so in this case, when I'm looking at these, I can

00:07:36.520 | actually now drill and go into specific traces. And at this level, I have evals across the entire trace.

00:07:44.920 | I have evals on the individual tool calls. And at the tool call level, I also have,

00:07:52.200 | you know, not only did it call the right tool call, it says the function call is correct.

00:07:57.080 | But I also have evals on did it actually pass all the right arguments. In this case, it looks like

00:08:02.920 | that's where it's going wrong. It says, therefore, the arguments do not align with the required parameters

00:08:07.400 | leading to the conclusion that the arguments are incorrect. So I probably have an issue here where

00:08:11.960 | even though I'm calling it looks like the right tool call out of all the different ones that I have,

00:08:17.080 | it looks like maybe I'm not passing in the right arguments inside of my tool call based on the

00:08:22.920 | context of the conversation. So that's something that I should go fix. So this is kind of the first

00:08:29.160 | big one that we think a lot about. The next one that is pretty interesting is also, you know, for a lot of

00:08:35.400 | of these, it's not just a single tool call that's made. It's many, actually. And you can kind of see

00:08:40.840 | that when I'm showing you the way that our architecture is built. There's actually a lot of different,

00:08:45.960 | even within the Q&A correctness, when we go down the search correctness path, there's actually a lot

00:08:51.720 | of different sub tools that are even called here. So it's pretty important to not only get,

00:08:57.880 | uh, you know, if it's individually calling a single tool correctly, but also is it getting the order of

00:09:03.080 | the tools that it's supposed to call correctly? And that's really what, as we think about trajectory evals

00:09:08.040 | starts to, uh, starts to, it starts to become about is really, is it calling tool calls in the right order?

00:09:15.160 | Um, if across a series of steps as needed for an agent to complete a task, is it consistently calling

00:09:24.280 | and executing them in the same set of steps or, and, and eventually converge on, you know,

00:09:30.440 | X number of steps to complete the section, or does it sometimes veer off and call it in a different

00:09:36.360 | order and therefore require me to, A, have to spend a lot of tokens in order to do the same

00:09:42.200 | ask. Um, and then B, is it kind of messing up and providing wrong kind of outputs because of that?

00:09:48.280 | And so we recommend teams to actually drill in and look at, you know, across, not just an individual

00:09:55.000 | kind of a tool call, but actually looking across, um, an entire trace and looking at the order of the

00:10:01.320 | tool calls to see if that's actually done well. And then evaluating kind of is overall, in this case,

00:10:07.080 | I have it incorrect. And so in this case, is it actually consistently getting the tool calling

00:10:12.200 | order correctness? Uh, uh, correct. Um, the next step here is, well, you can look across the single

00:10:19.080 | individual trace or interaction, but a lot of these interactions we're seeing with agents actually ends

00:10:23.800 | up being multi-turn. So in this case, I have like a three back and forth between a human and an agent.

00:10:29.800 | And there's a lot of interesting questions you can actually ask at this stage. You can ask questions like,

00:10:34.680 | is the agent consistent in tone? Um, is the agent maybe asking the same question and again and again?

00:10:42.120 | Um, in which case it's not really learning anything from the previous interactions that it's had with

00:10:46.920 | the human. Um, and, and part of that is really, does it keep track of context from the previous kind of

00:10:53.160 | n minus one turns in order to be able to answer the nth, uh, turn of that conversation really well. And so these

00:11:00.120 | are all the types of questions to think about when you have something that, that is multi-turn.

00:11:04.360 | And I'll actually show you an example from another project here where I do have kind of some of that

00:11:11.240 | multi-turn interaction. Um, this may be one where I've kind of this back and forth with, with an agent.

00:11:18.680 | And what I care about here across this entire conversation is, did I actually correctly

00:11:25.400 | answer all the questions? Did I actually make sure I kept context so that I wasn't kind of missing

00:11:30.440 | context that was made earlier in the conversation? And so deeply recommend folks to actually think

00:11:35.720 | about the session evaluations as part of their evaluating of agents. Um, and then lastly,

00:11:41.560 | I'll kind of go through this and, um, I think this will be a good spot for us to, um, deep dive into,

00:11:48.760 | which is we spend a lot of time, even just now on the tool calling trajectory session,

00:11:54.760 | a lot talking about how to think about evaluating the agent or the application prompt. And this is

00:12:01.800 | really important. I mean, I, we could spend a whole deep dive on this itself, but it's really important

00:12:08.200 | to evaluate it correctly, identify where does it go wrong so that you can annotate or, you know,

00:12:15.640 | refine those outputs and use those to actually improve your existing prompt. And I think a lot

00:12:21.480 | of teams actually totally get this and are doing this all the time to improve the agent prompt.

00:12:26.200 | Um, but what's really important is that the evals that you're actually using just kind of the crux of

00:12:32.600 | how you identify those failure cases end up becoming crucial to calling out what you need to improve.

00:12:39.640 | And you don't want those evals to remain static, the prompts for those evals for using LLM as a judge.

00:12:46.680 | And so there really is kind of another loop going on here, which is about improving the evals and the

00:12:53.640 | eval prompts. And part of this is you're collecting, um, consistently checking the, the eval outcomes as

00:13:01.560 | well to make sure it's not just the application that got it wrong, but it could have been the eval that,

00:13:06.680 | you know, miscorrectly labeled it as wrong. Um, and start to identify where the eval itself might need

00:13:14.760 | some improvements. And similar to the process you did for the agent application, do a workflow where

00:13:20.520 | you're iterating on the eval prompts, um, you know, building up a golden dataset, consistently refining it.

00:13:27.960 | And there really are kind of two iterative loops kind of going on at the same time, one for the agent

00:13:34.120 | evaluations, one for, um, your, your agent application prompts, and then one for your eval prompts.

00:13:39.960 | And as you think about this, both of them kind of go hand in hand to actually create

00:13:44.920 | a really good product experience, um, for teams. Um, there's a lot more in here that, you know, I, I think

00:13:51.880 | we, we can dive into, but, um, hopefully this gives you a little bit of a primer about how to think about

00:13:57.880 | agent evaluations and, uh, check out Arise Phoenix. It's a completely open source, um, product that you can

00:14:04.760 | use to learn a lot about what we just went through and test it out in your own applications.

00:14:09.560 | Um, you can check out Arise X if you want to, um, think about how to run a lot of these evaluations

00:14:16.600 | on your own data. Um, so feel free to check it out and hopefully you guys got something out of this.

00:14:22.760 | Thanks everyone for the time.

Break It 'Til You Make It: Building the Self-Improving Stack for AI Agents - Aparna Dhinakaran

Chapters