back to index

Taming Rogue AI Agents with Observability-Driven Evaluation — Jim Bennett, Galileo


Chapters

0:0 Introduction
0:27 Who uses AI
4:30 Set a thief
6:50 Evaluations
9:38 Example
12:13 Insights
14:24 Recap

Whisper Transcript | Transcript Only Page

00:00:07.000 | So, I'm here to talk about taming rogue AI agents.
00:00:18.380 | Essentially, I want to talk about evaluation-driven development,
00:00:21.240 | observability-driven development, but really why we need observability.
00:00:25.640 | So, who uses AI?
00:00:28.480 | Is that Jim's most stupid question of the day? Probably.
00:00:31.940 | Who trusts AI?
00:00:36.500 | Right, if you'd like to meet me afterwards, I've got some snake oil you might be interested in buying.
00:00:41.100 | Yeah, we do not trust AI in the slightest.
00:00:45.480 | Now, different question, who reads books?
00:00:48.160 | Let's read the books.
00:00:49.040 | If you want some recommendations for books, the Chicago Sun-Times recently published this list
00:00:55.140 | of books that you could enjoy over the summer.
00:00:58.800 | Atonement by Ian McEwen, great book, a fantastic movie with Keira Knightley.
00:01:03.300 | And then, oh, The Last Algorithm by Andy Weir, that sounds fun.
00:01:06.640 | Who watched The Martian?
00:01:08.380 | Yeah, but do you want to read Andy Weir's new book, The Last Algorithm?
00:01:11.900 | Well, you can't. It doesn't exist.
00:01:14.400 | So, this is a news site, newspaper.
00:01:18.060 | And they had an outside contractor generate this summer reading list, and this contractor used AI, and it hallucinated.
00:01:26.220 | Worse than a 1970s hippie music festival.
00:01:30.060 | A lot of hallucinations going on there.
00:01:31.720 | Now, they actually had to publish an article saying, "Sorry, we mucked up," but this happens.
00:01:38.380 | Now, we're supposed to trust the news.
00:01:40.880 | You know, I'm sure we can all have opinions on that, but we're generally supposed to trust the news.
00:01:45.380 | But yet, we can't if it's using AI to generate this kind of content.
00:01:49.380 | Now, am I worried that the Chicago Sun-Times is going to sue me for saying that they made this stuff up?
00:01:57.040 | Because lawyers are using AI for case law.
00:02:00.480 | This is a recent case where Butler Snow cited false case law defending the Alabama prison system.
00:02:08.180 | Score one against the prisons.
00:02:10.140 | Now, I picked these two examples.
00:02:12.180 | They're from pretty much the same week a couple of weeks ago.
00:02:14.820 | But we've all seen these examples, haven't we?
00:02:16.740 | We've all seen Air Canada's chatbot says you can get a refund, and then they're legally obliged to apply it, things like that.
00:02:23.180 | And so we understand that AI has this problem that it makes stuff up.
00:02:28.140 | Almost like every day in the news, it's another story about how AI has broken something.
00:02:33.180 | And the problem we have is detecting problems with AI is hard.
00:02:37.180 | It is a non-deterministic problem.
00:02:39.900 | Right, who's a coder? Who writes code?
00:02:42.180 | Okay, who writes unit tests?
00:02:44.940 | Is that the same number of hands? I'm not sure it is.
00:02:46.700 | You're bad, people!
00:02:48.380 | But yes, a unit test is kind of easy to write.
00:02:51.380 | Yeah, I have an add function. I can say add two and two, do I get four? Add three and three, do I get six?
00:02:57.820 | But I can't do that for an AI.
00:03:00.260 | I can't say if I put this input into my AI system, will it give this output?
00:03:06.260 | At the most basic, if I ask a single question, I can possibly look for keywords.
00:03:11.260 | But if I've got a complex agentic workflow, I have an application and an input comes in, it calls an LLM.
00:03:17.700 | That LLM gets data, makes a decision, calls an agent, gets data, makes a decision, calls a tool, gets data, and so on and so on and so on.
00:03:24.700 | That is really, really hard for me to actually evaluate. It's really hard for me to say, did it work?
00:03:29.940 | Because partly, what does even work mean? Especially if things like a chatbot where we're having a human conversation, how do we define what does work mean?
00:03:38.940 | And this is the problem that we face. So how do we do it? There's an old, I believe it was a British expression.
00:03:46.180 | Any Brits in the room other than me? Yay, lots of cool people in the room. We like it.
00:03:50.220 | There's an old British expression called set a thief to catch a thief. And the idea with that expression is if you want to know how a thief works, you set a thief to do it.
00:03:59.100 | The thief understands the thief, so knows how to catch it. And we can kind of apply that logic to AI.
00:04:04.740 | We can set an AI to verify an AI. We can actually ask a non-determinant system like an AI to evaluate an AI for us.
00:04:13.740 | And it turns out, AI's are not bad at this. They're about as good as a human is at determining whether an AI actually worked.
00:04:21.300 | And that opens up this whole new world of things we can do in that we can use AI to evaluate, is our AI application actually working?
00:04:29.060 | So I've got a demo here. I'm not going to do this demo live because conference Wi-Fi? Have you all had fun with Wi-Fi?
00:04:36.700 | Yes. The gentleman in the back there very kindly managed to get me connected to an actual physical cable. So things are great.
00:04:42.700 | So I've got this chat box.
00:04:45.340 | The concept you've just said that set a thief to catch me using that in AI, but isn't that building a not so trustworthy loop because...
00:04:54.980 | Great question. We will be getting to that. Good question though.
00:04:58.620 | So here's an example here. This is not, say, live because, well, one, the Wi-Fi and two, this is an AI application.
00:05:06.620 | There's no guarantee it's actually going to break the way I want it to break when I'm demoing it to you because it's non-deterministic.
00:05:10.940 | But this is a basic chat box conversation. I've actually got this demo on my laptop.
00:05:15.740 | If you want to come and see this in action, come to Galileo booth and I'll demo it.
00:05:18.620 | But I'm basically asking, what is my account balance? Think about kind of a fintech chat box. What's my account balance?
00:05:23.900 | And the response is, I don't have access to account information.
00:05:26.700 | It's not very helpful. It's kind of true. I don't have access. Not very helpful.
00:05:32.300 | You know, ideally I want to say, yes, you've got a million dollars or whatever it is.
00:05:36.260 | I don't. If anyone wants to donate, I would appreciate that.
00:05:39.500 | But I'm going to add a follow-up question. What is the balance of my checking account?
00:05:43.420 | I'm now giving it more information.
00:05:44.700 | And I was hoping when I did this demo, it would come back and say,
00:05:48.060 | you've got X amount of money. Instead, it came back to say,
00:05:51.340 | please could you let me know the name of your checking account?
00:05:53.340 | So I didn't even know what my bot was going to do as I was working through it.
00:05:56.780 | It's asking me questions. And I responded to say, it's called checking account.
00:06:01.100 | You know, four hard things in computer science,
00:06:04.540 | naming things, cache invalidation, off by one errors. Yeah.
00:06:07.980 | Someone got the joke. Cool. But yeah, so I called it checking account.
00:06:12.540 | And now I was able to go to call a tool and go and look at the checking account. Now,
00:06:17.500 | did this AI work? Did it work? What do we think? Who thinks,
00:06:21.020 | hands up if you think the whole AI chatbot worked?
00:06:23.660 | I mean, yes, you're right. It did work because within a few steps, I got my account balance.
00:06:31.100 | Who thinks it didn't work?
00:06:32.060 | More hands. Yes, you're right. It didn't work because
00:06:36.780 | it took me three steps to get my account balance.
00:06:38.620 | So it's not a good thing. So I think about how can I evaluate this? And this is where evaluations comes in.
00:06:44.780 | So I want to look at all the different steps in a flow and look at different metrics to measure
00:06:49.260 | how well this did. So when we think about these kind of evaluations, essentially what we have
00:06:55.820 | to do is we have to take a lot of data, we have to take everything that's coming in,
00:06:58.700 | and we have to define at all the different steps in the process what things we want to look for.
00:07:04.540 | Did it successfully call tools? Is it retrieving the right information from a rag system?
00:07:09.180 | Is it actually giving an answer that makes sense? Is it hallucinating? There's a lot of different
00:07:13.660 | metrics that you can define that evaluates whether or not the whole thing was successful.
00:07:18.700 | And ideally, you want to break that down by all the steps in your flow. I have a multi-agent app.
00:07:25.020 | When I call my app to get my account balance, there's an agent that orchestrates it that calls
00:07:29.820 | another agent that calls a tool. And I need to look at that breakdown by all the individual steps and
00:07:35.100 | measure where that the failures happen. I need to better do these evaluations at every single
00:07:40.140 | component. It's not just that binary did my agent work, yes or no question. It's at what step in the
00:07:47.340 | process did my agent fail. So I have to get this level of granularity. That is really, really important
00:07:52.860 | that we have granularity when we look at these things. And then the way we work out these numbers,
00:07:57.580 | as I said, we set the thief to catch the thief. We use an LLM, or usually multiple calls to an LLM,
00:08:04.620 | to evaluate the metric. We say to an LLM, with this input, and this information from a RAC system,
00:08:11.820 | this is the output that came out, score it. And the idea is you use a better LLM to score than the LLM
00:08:19.500 | you use in your application. In your main application, you want the cheapest LLM possible,
00:08:23.820 | because we all like making money. If you don't like making money, send it to me. But we want
00:08:28.540 | the cheapest LLM possible, but we want to use the best LLM possible to do the evaluations. Going back to your
00:08:32.700 | question there. Ideally, you want to use a better LLM to actually do these evaluations. You want to say,
00:08:37.820 | you'll get a million traces a day. We're going to test, say, 10,000 of them using expensive LLM
00:08:44.860 | to prove that it works. Ideally, you want to use a custom-trained LLM. Something Galileo offers is we
00:08:50.460 | have a custom-trained LLM that's a small language model that's designed to be really, really good at
00:08:54.460 | evaluations. But the idea is you use this LLM to do it with a well-defined set of prompts to extract
00:09:00.220 | this information. And then you make this in your workflows. And you do this right from day one.
00:09:04.860 | So who is just starting building apps? Anyone who's just started building apps? A few hands. Who's got
00:09:10.380 | an app in production? Okay. All of you need evaluations like now. The best time to put valuations in is as
00:09:18.380 | you're doing prompt engineering model selection. The second best time is now. So you want to think
00:09:22.860 | about this right from the get-go. As you're building an application, you want to start adding those
00:09:27.100 | evaluations when you're doing your initial prompt engineering, when you're doing your model selection.
00:09:30.940 | You want to keep those in your dev cycle and your CICD pipelines. And then you want to observe
00:09:34.540 | these in production as users start throwing garbage at your system. So let's look at a couple.
00:09:39.820 | Here's just a hold of traces from that chatbot with some nice red and green numbers. And I want to
00:09:44.620 | highlight these three rows. And these rows match what I was trying to do with the chatbot. Okay. So the
00:09:51.340 | first row here, we have got what is my account balance? I'm sure I don't have access to my account
00:09:56.860 | information. And I've got two metrics here. Action completion and action advancement. Action completion
00:10:03.900 | is did it actually do the thing it was asked to do? So it measures across the whole flow from the
00:10:09.340 | input to the output. Did it actually complete the task that it was asked to do? Action advancement
00:10:15.340 | is did it move forward towards the end goal? And they're two very subtly distinct metrics. Now in the
00:10:22.700 | case of the first one, what's my account balance? I don't know. Don't know anything. Didn't complete,
00:10:28.860 | didn't advance. So we know there's a problem with that one. Second one, what is the balance of my
00:10:34.060 | checking account? Didn't complete. I don't have a balance, but it advanced.
00:10:39.180 | So I can see that yes, it realized that it needs to know the name of the account. So it advanced one
00:10:43.740 | step further. So I can say actually, yes, with this kind of prompt, it advances. And then finally,
00:10:48.700 | when I say yes, my checking account is called checking account, it completed, gave me the results,
00:10:53.740 | and it could show the advance way through. So I can see from these metrics, which prompts worked,
00:10:57.580 | which prompts didn't work. And I can use this to continue improve what I'm doing. Now,
00:11:03.420 | obviously, these numbers are kind of a whole overarching number across the whole thing.
00:11:08.300 | Obviously, I kind of need to have some form of breakdown. So that's what I've got here. This is the
00:11:12.940 | individual trace that comes in. It's a call CLM. The LM decides to call a tool, pulls data, decides to call the
00:11:19.340 | LLM to process that data and show it out the other side. And that's showing those steps.
00:11:24.220 | And at each level, I can get whatever metrics are relevant. So I can look at the overarching.
00:11:30.860 | It's red. It's a bad thing because it's red. And then I can dive into each individual step and see why
00:11:37.180 | it's red and look at all those different layers. And that's really, really important. You have to have this
00:11:40.860 | understanding of the architecture of your agentic systems so that you can do this analysis at each
00:11:46.780 | individual level. And then depending on what's happening, you can then farm out the fixing of the
00:11:52.780 | problem to the relevant team. Maybe it's your rank application is terrible. Maybe you need to tune
00:11:58.140 | one of your prompts. But by having this level of granularity, you can make those smart decisions
00:12:03.420 | around it. Now, what's also cool is this is a lot of unstructured data. What do we know that is good for
00:12:10.860 | working with unstructured data? AI. Yes. And so what's cool as well is when you start putting an LM over the top of
00:12:18.860 | this, you can get some really smart insights coming out. So this is some insights that I
00:12:24.220 | generated. Basically, an AI will go against all the data and say, this metric is low. How can I make it better?
00:12:31.580 | And this is really cool. And this is saying, yeah, the LM occasionally fails to use the get balance tool when
00:12:36.860 | asked about account balances. And that's basically the fundamental problem. When I say, what is the
00:12:41.580 | balance of my account? What would you expect to happen in a chatbot? What's the balance of my account? What would you expect?
00:12:46.620 | Anyone? Shout out. To get your balance. Exactly. Yeah. And probably if you have multiple accounts,
00:12:53.980 | you would get the balance of all your accounts. Yeah. This is your checking, savings, credit card, 401k,
00:12:59.180 | whatever. And so it kind of makes sense to improve the effectiveness, to get us closer to where we all
00:13:05.020 | have a consensus that this agent is working, would be for if I say, give me the account balance,
00:13:11.660 | it goes to all the accounts and it shows me all the balances. So the suggested action here is adding
00:13:16.700 | explicit instructions to my system message. So not only have I identified this problem through my
00:13:22.700 | evaluations, but I've got a suggestion for fixing it. Now, it's not automatically going to fix it for me
00:13:27.740 | because there'd be dragons in that because what if that mucks up and they need to evaluate my automatic
00:13:33.020 | fixings in my evaluations and the snake swallows its tail. But this has given me suggestions. So the human
00:13:38.540 | in the loop, and that's really important, as a human, I can look at this and say, yeah,
00:13:42.460 | this is the fix that I want to make. Now, I do want to emphasize that whole human in the loop thing is
00:13:49.260 | really, really important. So when you're generating metrics, there's no guarantee the metrics you
00:13:54.860 | generate are actually going to be correct. Because the AI, going back to your question over there,
00:13:59.420 | the AI could get it wrong. And so one thing you want to do is make sure that you're using a system that
00:14:04.060 | has human feedback, like CLHF continuous learning by human feedback. You can, you want humans to
00:14:09.980 | evaluate the numbers and say, okay, this is actually working. The metric was low. Here's the reason.
00:14:15.820 | Retune and have that continuous training of your metrics because your metrics will never be perfect
00:14:21.100 | out the box. You need this continuous level of training.
00:14:25.180 | So to get this all right, what do we have to do? Step one, add evaluations to your agent. As I said,
00:14:32.620 | the best time to do it is before you even start. The second best time is now. If you don't have
00:14:36.700 | evaluations, get them in now so you can make sure your agent is not making stuff up. You do not want
00:14:41.580 | to be the next Chicago Sun-Times. Then you need to measure precisely what you need. Different tools,
00:14:48.780 | different applications have different measurements of what they need. Do I need to measure whether the
00:14:53.660 | input and outputs are being toxic? Do I need to measure for the hallucinations? Do I need to measure
00:15:00.140 | for a comprehensible output? Do I need to measure for RAG? Do I have some kind of custom measurement that
00:15:05.500 | only I know about that's specific to my use case? You want to be defining those measurements and those
00:15:10.220 | metrics up front as you are thinking about your prompts, your structure of your app, your agents.
00:15:15.580 | Right at design time, you think about exactly what we need to measure. And then as you build it,
00:15:21.500 | keep that going all through your production. This is not just a test in dev. Because let's be honest,
00:15:26.300 | when users get hold of your system, they do stuff you don't expect. How many times have you tested
00:15:31.020 | something to the nth degree and it breaks the second a user gets on it? Damn those users. But they do things you
00:15:36.860 | don't expect. And so you have to have this in production as well to make sure that you've got
00:15:41.340 | everything in place. And then you want to have this real-time prevention. You want to have alerting when
00:15:45.740 | it goes wrong. If your AI agent goes rogue, maybe you need to be woken up. So that is how you can tame
00:15:50.940 | AI agents with evaluations. I'm Jim Bennett. I'm a principal developer advocate at Galileo. Come and talk to
00:15:55.340 | us on the booth and the expo if you want to learn more. Scan that if you want to sign up for Galileo.
00:16:00.780 | We have a free offering. But you want to learn more about it, come meet me at the booth. With that,
00:16:05.420 | thank you very much. And I will take some questions, I believe.
00:16:09.820 | We'll be right back.