back to indexTaming Rogue AI Agents with Observability-Driven Evaluation — Jim Bennett, Galileo

Chapters
0:0 Introduction
0:27 Who uses AI
4:30 Set a thief
6:50 Evaluations
9:38 Example
12:13 Insights
14:24 Recap
00:00:07.000 |
So, I'm here to talk about taming rogue AI agents. 00:00:18.380 |
Essentially, I want to talk about evaluation-driven development, 00:00:21.240 |
observability-driven development, but really why we need observability. 00:00:28.480 |
Is that Jim's most stupid question of the day? Probably. 00:00:36.500 |
Right, if you'd like to meet me afterwards, I've got some snake oil you might be interested in buying. 00:00:49.040 |
If you want some recommendations for books, the Chicago Sun-Times recently published this list 00:00:55.140 |
of books that you could enjoy over the summer. 00:00:58.800 |
Atonement by Ian McEwen, great book, a fantastic movie with Keira Knightley. 00:01:03.300 |
And then, oh, The Last Algorithm by Andy Weir, that sounds fun. 00:01:08.380 |
Yeah, but do you want to read Andy Weir's new book, The Last Algorithm? 00:01:18.060 |
And they had an outside contractor generate this summer reading list, and this contractor used AI, and it hallucinated. 00:01:31.720 |
Now, they actually had to publish an article saying, "Sorry, we mucked up," but this happens. 00:01:40.880 |
You know, I'm sure we can all have opinions on that, but we're generally supposed to trust the news. 00:01:45.380 |
But yet, we can't if it's using AI to generate this kind of content. 00:01:49.380 |
Now, am I worried that the Chicago Sun-Times is going to sue me for saying that they made this stuff up? 00:02:00.480 |
This is a recent case where Butler Snow cited false case law defending the Alabama prison system. 00:02:12.180 |
They're from pretty much the same week a couple of weeks ago. 00:02:14.820 |
But we've all seen these examples, haven't we? 00:02:16.740 |
We've all seen Air Canada's chatbot says you can get a refund, and then they're legally obliged to apply it, things like that. 00:02:23.180 |
And so we understand that AI has this problem that it makes stuff up. 00:02:28.140 |
Almost like every day in the news, it's another story about how AI has broken something. 00:02:33.180 |
And the problem we have is detecting problems with AI is hard. 00:02:44.940 |
Is that the same number of hands? I'm not sure it is. 00:02:48.380 |
But yes, a unit test is kind of easy to write. 00:02:51.380 |
Yeah, I have an add function. I can say add two and two, do I get four? Add three and three, do I get six? 00:03:00.260 |
I can't say if I put this input into my AI system, will it give this output? 00:03:06.260 |
At the most basic, if I ask a single question, I can possibly look for keywords. 00:03:11.260 |
But if I've got a complex agentic workflow, I have an application and an input comes in, it calls an LLM. 00:03:17.700 |
That LLM gets data, makes a decision, calls an agent, gets data, makes a decision, calls a tool, gets data, and so on and so on and so on. 00:03:24.700 |
That is really, really hard for me to actually evaluate. It's really hard for me to say, did it work? 00:03:29.940 |
Because partly, what does even work mean? Especially if things like a chatbot where we're having a human conversation, how do we define what does work mean? 00:03:38.940 |
And this is the problem that we face. So how do we do it? There's an old, I believe it was a British expression. 00:03:46.180 |
Any Brits in the room other than me? Yay, lots of cool people in the room. We like it. 00:03:50.220 |
There's an old British expression called set a thief to catch a thief. And the idea with that expression is if you want to know how a thief works, you set a thief to do it. 00:03:59.100 |
The thief understands the thief, so knows how to catch it. And we can kind of apply that logic to AI. 00:04:04.740 |
We can set an AI to verify an AI. We can actually ask a non-determinant system like an AI to evaluate an AI for us. 00:04:13.740 |
And it turns out, AI's are not bad at this. They're about as good as a human is at determining whether an AI actually worked. 00:04:21.300 |
And that opens up this whole new world of things we can do in that we can use AI to evaluate, is our AI application actually working? 00:04:29.060 |
So I've got a demo here. I'm not going to do this demo live because conference Wi-Fi? Have you all had fun with Wi-Fi? 00:04:36.700 |
Yes. The gentleman in the back there very kindly managed to get me connected to an actual physical cable. So things are great. 00:04:45.340 |
The concept you've just said that set a thief to catch me using that in AI, but isn't that building a not so trustworthy loop because... 00:04:54.980 |
Great question. We will be getting to that. Good question though. 00:04:58.620 |
So here's an example here. This is not, say, live because, well, one, the Wi-Fi and two, this is an AI application. 00:05:06.620 |
There's no guarantee it's actually going to break the way I want it to break when I'm demoing it to you because it's non-deterministic. 00:05:10.940 |
But this is a basic chat box conversation. I've actually got this demo on my laptop. 00:05:15.740 |
If you want to come and see this in action, come to Galileo booth and I'll demo it. 00:05:18.620 |
But I'm basically asking, what is my account balance? Think about kind of a fintech chat box. What's my account balance? 00:05:23.900 |
And the response is, I don't have access to account information. 00:05:26.700 |
It's not very helpful. It's kind of true. I don't have access. Not very helpful. 00:05:32.300 |
You know, ideally I want to say, yes, you've got a million dollars or whatever it is. 00:05:36.260 |
I don't. If anyone wants to donate, I would appreciate that. 00:05:39.500 |
But I'm going to add a follow-up question. What is the balance of my checking account? 00:05:44.700 |
And I was hoping when I did this demo, it would come back and say, 00:05:48.060 |
you've got X amount of money. Instead, it came back to say, 00:05:51.340 |
please could you let me know the name of your checking account? 00:05:53.340 |
So I didn't even know what my bot was going to do as I was working through it. 00:05:56.780 |
It's asking me questions. And I responded to say, it's called checking account. 00:06:01.100 |
You know, four hard things in computer science, 00:06:04.540 |
naming things, cache invalidation, off by one errors. Yeah. 00:06:07.980 |
Someone got the joke. Cool. But yeah, so I called it checking account. 00:06:12.540 |
And now I was able to go to call a tool and go and look at the checking account. Now, 00:06:17.500 |
did this AI work? Did it work? What do we think? Who thinks, 00:06:21.020 |
hands up if you think the whole AI chatbot worked? 00:06:23.660 |
I mean, yes, you're right. It did work because within a few steps, I got my account balance. 00:06:32.060 |
More hands. Yes, you're right. It didn't work because 00:06:36.780 |
it took me three steps to get my account balance. 00:06:38.620 |
So it's not a good thing. So I think about how can I evaluate this? And this is where evaluations comes in. 00:06:44.780 |
So I want to look at all the different steps in a flow and look at different metrics to measure 00:06:49.260 |
how well this did. So when we think about these kind of evaluations, essentially what we have 00:06:55.820 |
to do is we have to take a lot of data, we have to take everything that's coming in, 00:06:58.700 |
and we have to define at all the different steps in the process what things we want to look for. 00:07:04.540 |
Did it successfully call tools? Is it retrieving the right information from a rag system? 00:07:09.180 |
Is it actually giving an answer that makes sense? Is it hallucinating? There's a lot of different 00:07:13.660 |
metrics that you can define that evaluates whether or not the whole thing was successful. 00:07:18.700 |
And ideally, you want to break that down by all the steps in your flow. I have a multi-agent app. 00:07:25.020 |
When I call my app to get my account balance, there's an agent that orchestrates it that calls 00:07:29.820 |
another agent that calls a tool. And I need to look at that breakdown by all the individual steps and 00:07:35.100 |
measure where that the failures happen. I need to better do these evaluations at every single 00:07:40.140 |
component. It's not just that binary did my agent work, yes or no question. It's at what step in the 00:07:47.340 |
process did my agent fail. So I have to get this level of granularity. That is really, really important 00:07:52.860 |
that we have granularity when we look at these things. And then the way we work out these numbers, 00:07:57.580 |
as I said, we set the thief to catch the thief. We use an LLM, or usually multiple calls to an LLM, 00:08:04.620 |
to evaluate the metric. We say to an LLM, with this input, and this information from a RAC system, 00:08:11.820 |
this is the output that came out, score it. And the idea is you use a better LLM to score than the LLM 00:08:19.500 |
you use in your application. In your main application, you want the cheapest LLM possible, 00:08:23.820 |
because we all like making money. If you don't like making money, send it to me. But we want 00:08:28.540 |
the cheapest LLM possible, but we want to use the best LLM possible to do the evaluations. Going back to your 00:08:32.700 |
question there. Ideally, you want to use a better LLM to actually do these evaluations. You want to say, 00:08:37.820 |
you'll get a million traces a day. We're going to test, say, 10,000 of them using expensive LLM 00:08:44.860 |
to prove that it works. Ideally, you want to use a custom-trained LLM. Something Galileo offers is we 00:08:50.460 |
have a custom-trained LLM that's a small language model that's designed to be really, really good at 00:08:54.460 |
evaluations. But the idea is you use this LLM to do it with a well-defined set of prompts to extract 00:09:00.220 |
this information. And then you make this in your workflows. And you do this right from day one. 00:09:04.860 |
So who is just starting building apps? Anyone who's just started building apps? A few hands. Who's got 00:09:10.380 |
an app in production? Okay. All of you need evaluations like now. The best time to put valuations in is as 00:09:18.380 |
you're doing prompt engineering model selection. The second best time is now. So you want to think 00:09:22.860 |
about this right from the get-go. As you're building an application, you want to start adding those 00:09:27.100 |
evaluations when you're doing your initial prompt engineering, when you're doing your model selection. 00:09:30.940 |
You want to keep those in your dev cycle and your CICD pipelines. And then you want to observe 00:09:34.540 |
these in production as users start throwing garbage at your system. So let's look at a couple. 00:09:39.820 |
Here's just a hold of traces from that chatbot with some nice red and green numbers. And I want to 00:09:44.620 |
highlight these three rows. And these rows match what I was trying to do with the chatbot. Okay. So the 00:09:51.340 |
first row here, we have got what is my account balance? I'm sure I don't have access to my account 00:09:56.860 |
information. And I've got two metrics here. Action completion and action advancement. Action completion 00:10:03.900 |
is did it actually do the thing it was asked to do? So it measures across the whole flow from the 00:10:09.340 |
input to the output. Did it actually complete the task that it was asked to do? Action advancement 00:10:15.340 |
is did it move forward towards the end goal? And they're two very subtly distinct metrics. Now in the 00:10:22.700 |
case of the first one, what's my account balance? I don't know. Don't know anything. Didn't complete, 00:10:28.860 |
didn't advance. So we know there's a problem with that one. Second one, what is the balance of my 00:10:34.060 |
checking account? Didn't complete. I don't have a balance, but it advanced. 00:10:39.180 |
So I can see that yes, it realized that it needs to know the name of the account. So it advanced one 00:10:43.740 |
step further. So I can say actually, yes, with this kind of prompt, it advances. And then finally, 00:10:48.700 |
when I say yes, my checking account is called checking account, it completed, gave me the results, 00:10:53.740 |
and it could show the advance way through. So I can see from these metrics, which prompts worked, 00:10:57.580 |
which prompts didn't work. And I can use this to continue improve what I'm doing. Now, 00:11:03.420 |
obviously, these numbers are kind of a whole overarching number across the whole thing. 00:11:08.300 |
Obviously, I kind of need to have some form of breakdown. So that's what I've got here. This is the 00:11:12.940 |
individual trace that comes in. It's a call CLM. The LM decides to call a tool, pulls data, decides to call the 00:11:19.340 |
LLM to process that data and show it out the other side. And that's showing those steps. 00:11:24.220 |
And at each level, I can get whatever metrics are relevant. So I can look at the overarching. 00:11:30.860 |
It's red. It's a bad thing because it's red. And then I can dive into each individual step and see why 00:11:37.180 |
it's red and look at all those different layers. And that's really, really important. You have to have this 00:11:40.860 |
understanding of the architecture of your agentic systems so that you can do this analysis at each 00:11:46.780 |
individual level. And then depending on what's happening, you can then farm out the fixing of the 00:11:52.780 |
problem to the relevant team. Maybe it's your rank application is terrible. Maybe you need to tune 00:11:58.140 |
one of your prompts. But by having this level of granularity, you can make those smart decisions 00:12:03.420 |
around it. Now, what's also cool is this is a lot of unstructured data. What do we know that is good for 00:12:10.860 |
working with unstructured data? AI. Yes. And so what's cool as well is when you start putting an LM over the top of 00:12:18.860 |
this, you can get some really smart insights coming out. So this is some insights that I 00:12:24.220 |
generated. Basically, an AI will go against all the data and say, this metric is low. How can I make it better? 00:12:31.580 |
And this is really cool. And this is saying, yeah, the LM occasionally fails to use the get balance tool when 00:12:36.860 |
asked about account balances. And that's basically the fundamental problem. When I say, what is the 00:12:41.580 |
balance of my account? What would you expect to happen in a chatbot? What's the balance of my account? What would you expect? 00:12:46.620 |
Anyone? Shout out. To get your balance. Exactly. Yeah. And probably if you have multiple accounts, 00:12:53.980 |
you would get the balance of all your accounts. Yeah. This is your checking, savings, credit card, 401k, 00:12:59.180 |
whatever. And so it kind of makes sense to improve the effectiveness, to get us closer to where we all 00:13:05.020 |
have a consensus that this agent is working, would be for if I say, give me the account balance, 00:13:11.660 |
it goes to all the accounts and it shows me all the balances. So the suggested action here is adding 00:13:16.700 |
explicit instructions to my system message. So not only have I identified this problem through my 00:13:22.700 |
evaluations, but I've got a suggestion for fixing it. Now, it's not automatically going to fix it for me 00:13:27.740 |
because there'd be dragons in that because what if that mucks up and they need to evaluate my automatic 00:13:33.020 |
fixings in my evaluations and the snake swallows its tail. But this has given me suggestions. So the human 00:13:38.540 |
in the loop, and that's really important, as a human, I can look at this and say, yeah, 00:13:42.460 |
this is the fix that I want to make. Now, I do want to emphasize that whole human in the loop thing is 00:13:49.260 |
really, really important. So when you're generating metrics, there's no guarantee the metrics you 00:13:54.860 |
generate are actually going to be correct. Because the AI, going back to your question over there, 00:13:59.420 |
the AI could get it wrong. And so one thing you want to do is make sure that you're using a system that 00:14:04.060 |
has human feedback, like CLHF continuous learning by human feedback. You can, you want humans to 00:14:09.980 |
evaluate the numbers and say, okay, this is actually working. The metric was low. Here's the reason. 00:14:15.820 |
Retune and have that continuous training of your metrics because your metrics will never be perfect 00:14:21.100 |
out the box. You need this continuous level of training. 00:14:25.180 |
So to get this all right, what do we have to do? Step one, add evaluations to your agent. As I said, 00:14:32.620 |
the best time to do it is before you even start. The second best time is now. If you don't have 00:14:36.700 |
evaluations, get them in now so you can make sure your agent is not making stuff up. You do not want 00:14:41.580 |
to be the next Chicago Sun-Times. Then you need to measure precisely what you need. Different tools, 00:14:48.780 |
different applications have different measurements of what they need. Do I need to measure whether the 00:14:53.660 |
input and outputs are being toxic? Do I need to measure for the hallucinations? Do I need to measure 00:15:00.140 |
for a comprehensible output? Do I need to measure for RAG? Do I have some kind of custom measurement that 00:15:05.500 |
only I know about that's specific to my use case? You want to be defining those measurements and those 00:15:10.220 |
metrics up front as you are thinking about your prompts, your structure of your app, your agents. 00:15:15.580 |
Right at design time, you think about exactly what we need to measure. And then as you build it, 00:15:21.500 |
keep that going all through your production. This is not just a test in dev. Because let's be honest, 00:15:26.300 |
when users get hold of your system, they do stuff you don't expect. How many times have you tested 00:15:31.020 |
something to the nth degree and it breaks the second a user gets on it? Damn those users. But they do things you 00:15:36.860 |
don't expect. And so you have to have this in production as well to make sure that you've got 00:15:41.340 |
everything in place. And then you want to have this real-time prevention. You want to have alerting when 00:15:45.740 |
it goes wrong. If your AI agent goes rogue, maybe you need to be woken up. So that is how you can tame 00:15:50.940 |
AI agents with evaluations. I'm Jim Bennett. I'm a principal developer advocate at Galileo. Come and talk to 00:15:55.340 |
us on the booth and the expo if you want to learn more. Scan that if you want to sign up for Galileo. 00:16:00.780 |
We have a free offering. But you want to learn more about it, come meet me at the booth. With that, 00:16:05.420 |
thank you very much. And I will take some questions, I believe.