back to index

Agentic Excellence: Mastering AI Agent Evals w/ Azure AI Evaluation SDK — Cedric Vidal, Microsoft


Chapters

0:0 Introduction
0:30 Overview
1:23 AI agents are all the rage
2:12 How do you evaluate AI agents
2:40 When to evaluate AI agents
3:26 How to evaluate AI agents
4:15 Manual model evaluation
7:10 Evaluating the whole system
14:33 Scale
16:59 Multimodel Models
18:46 Outro

Whisper Transcript | Transcript Only Page

00:00:00.000 | Well, welcome everyone, I'm happy to be here today, I'm very excited, it's a very hot topic.
00:00:22.280 | So, I am Cedric Vidal, Principal AI Advocate at Microsoft, and today we are going to talk
00:00:30.520 | about how to evaluate agents.
00:00:33.000 | So for those of you who were in this very room, the session just before, my colleagues
00:00:39.520 | presented red teaming, which is how you create data that tries to put your AI in a bad situation
00:00:48.920 | and generate bad content and try to verify that it behaves correctly.
00:00:53.480 | Today, in this session, we are going to look at more traditional, normal types of evaluations
00:01:02.280 | when you have a dataset that you want to evaluate on your AI agents.
00:01:07.920 | So, we're going to look at a bunch of things on how to make safe your AIs are safe.
00:01:14.840 | So, I see that people are still coming in the room, it's okay, please come in, don't be afraid.
00:01:21.400 | So, we don't want that, right?
00:01:29.960 | AI agents are all their age.
00:01:31.960 | To be honest, every single day, even as I was preparing this very presentation and I was trying
00:01:37.960 | the latest models and the latest SDKs, I'm always amazed at the progress that those agents are making.
00:01:47.240 | And, but of course, the more agency we give to them, the more independent they become,
00:01:54.760 | the more the risk of creating havoc increases.
00:01:59.640 | So, let's see how we can make sure that your AI agents behave correctly and do not create that kind of mess.
00:02:09.800 | So, how do you go about evaluating your AI agents?
00:02:19.080 | Do you submit like a couple prompts to validate that the models respond correctly and go, "Yeah, well, that checks,
00:02:26.040 | that should go," and put it in production?
00:02:28.360 | Or do you go about a more methodical approach?
00:02:32.360 | If you are doing the former, then I have some news for you.
00:02:36.200 | You are in the right place, you need to change something.
00:02:39.720 | It's not going to work.
00:02:41.640 | If you are in the latter, then today I have some frameworks to show to you on how to,
00:02:48.760 | which might help you improve your evaluation process.
00:02:53.240 | So, when should you start doing evaluations?
00:02:56.600 | You may be wondering, what's the evaluation, or when does it occur?
00:03:02.520 | And, I mean, if you have already built an app and you're asking yourself, "Should I evaluate now?"
00:03:09.720 | Well, good news, I mean, or bad news, you're a bit late.
00:03:13.080 | You should have started way earlier.
00:03:15.080 | Evaluation starts at the very beginning of your AI development project.
00:03:20.920 | The sooner, the better.
00:03:24.520 | So, to get a sense of how to approach the subject of AI agent evaluation, we distinguish four layers.
00:03:39.320 | First, you have the model and the safety system, which are platform-specific level protections.
00:03:45.320 | And this is built in Azure.
00:03:47.080 | You don't have to do anything about that when using Azure, models on Azure.
00:03:51.800 | And then you have system message and grounding.
00:03:55.240 | So, for that part, and user experience.
00:03:57.160 | And for that part, that's where your app design matters the most.
00:04:01.240 | The key takeaway, the foundation model, is just one part.
00:04:06.760 | Real safety comes from layering smart mitigations at the application layer.
00:04:12.520 | And we're going to see how to do that.
00:04:14.360 | The first thing you should do is manual model evaluation.
00:04:19.720 | So, which model do you want to use for your AI agent?
00:04:23.400 | You want to get a clear sense of how different models will respond to a given prompt.
00:04:31.800 | Something automatic metrics can sometimes miss.
00:04:35.560 | When you launch a batch of metrics, of evaluations on a data set,
00:04:42.440 | you, sometimes you have a big average score.
00:04:46.440 | And you might be left wondering, okay, but I'm not sure exactly how it works
00:04:51.080 | specifically for a very specific example.
00:04:53.160 | Before evaluating at scale, you need first to cherry pick and look at specific examples.
00:05:01.400 | So, now, I'm going to demo to you how to do that in VS Code.
00:05:12.280 | So, the first thing here, and when I look at my history, is that you can, so in VS Code, there is a new,
00:05:23.160 | freshly, relatively new plugin called AI Toolkit, which was released at Build, I believe.
00:05:30.280 | And, oh my god, I love that plugin.
00:05:32.120 | Before, I used to go to different websites all over the web to evaluate models and compare.
00:05:38.280 | I mean, you had GitHub models, but now you can do it right from your development environment.
00:05:44.280 | And if you're like me, and you like to code, that's where I like to do things.
00:05:47.240 | AI Toolkit, yes.
00:05:50.600 | So, you can ask, I did ask that question already.
00:05:57.160 | What's a good panacata recipe with salted caramel butter, which is my favorite?
00:06:01.720 | And then you get a pretty good response with 4.1, but what if you want to compare with 4.0, for example?
00:06:08.760 | So, what's a good recipe for panacata with salted caramel butter?
00:06:24.600 | And then you can see, side to side, how the two Moses will respond. 4.1 on the left and 4.0 on the right.
00:06:33.480 | And as you can see, 4.1 is a major improvement in terms of throughput. You're going to get the answer much faster.
00:06:40.440 | When it comes to the quality of the answer, so I looked at it ahead of the conference, and to be honest,
00:06:47.240 | I prefer the 4.1 answer. 4.0 is not too bad, but I mean, 4.1 is so much faster that usually that's what you're going to use.
00:06:54.520 | So, that's for
00:06:57.640 | spot checking, the answer of a foundation model without any customization. We don't have an AI agent yet.
00:07:07.080 | Then,
00:07:09.000 | you want to evaluate the whole system. So, that's where
00:07:14.920 | we are going to actually build an AI agent and evaluate the agent from a systemic approach as a whole.
00:07:23.080 | Once you have selected the model, it's time to evaluate it end to end.
00:07:28.040 | And so, let's jump in and let me show you how that works in VS Code. So, same.
00:07:36.840 | That same AI toolkit extension for VS Code.
00:07:40.360 | Wow. I mean, to be honest, I love it.
00:07:43.560 | Because now you can build an AI agent super fast and evaluate it super fast too. So, here I created a head,
00:07:51.240 | I prepared an AI agent to extract agenda and event information from web pages.
00:08:00.360 | For me, as an advocate, I do that kind of talks pretty often and I need to know,
00:08:04.040 | I created basically an AI agent that helps me easily fetch information from the web and pull
00:08:10.680 | the names, the list of talks, of speakers, the number of attendees, that kind of thing.
00:08:17.160 | And it's super easy to do. So, I'm going to show you how to create a new agent real quick.
00:08:23.160 | And you have an example here with a web scrapper.
00:08:26.520 | And it automatically generates a system prompt saying, "Hey, you are a web exploration assistant
00:08:33.640 | that can navigate to websites." It's going to configure an MCP server
00:08:39.240 | ready to use. And if I run it, it's going to start the playwright MCP server.
00:08:52.360 | By default, it uses an example domain.
00:08:57.160 | And we'll extract, you can see the background, we'll extract information about the website.
00:09:05.560 | Now, I'm going to switch back to the agent that I created.
00:09:09.000 | Because the one I just showed you is the built-in.
00:09:11.480 | So, this one I created. And I'm going to use a GPT-4.1.
00:09:18.280 | And this one is more focused. What I want is to extract the name, date, location,
00:09:24.520 | and number of attendees in a specific format.
00:09:27.800 | And for that website, which is a Lumar event page, so run.
00:09:36.360 | So what I did is that I took the automatically generated one of the sample AI agent that was
00:09:42.040 | created by AI Toolkit, and I customized it for my use case.
00:09:49.000 | And here, you can see the AI agent working and piloting a playwright, going to the webpage, extracting the
00:10:00.280 | information, and giving me the response. So, the event is AI agents and startup talks at GitHub. Location is
00:10:07.880 | GitHub headquarters in San Francisco on June 11th. And for now, we have 269 people that registered. And I hope
00:10:16.520 | that after doing the demo, we're going to have more. Because that's an event that I co-organized
00:10:21.720 | in San Francisco.
00:10:27.000 | So now that we have spot-checked, we have built, we have customized, we have spot-checked what our
00:10:36.280 | AI agent does for a specific input, let's see how we can evaluate it on multiple inputs.
00:10:42.680 | So you have a tab here called evaluation, which allows you to take that AI agent previously
00:10:55.080 | configured and to execute it on a dataset. So here, I can type run all.
00:11:06.200 | Okay. And in the background, it's going to run the agent of those inputs and give us the answer in the
00:11:13.400 | response column. As you can see, I had executed it before. So you can see what was the previous answer.
00:11:19.240 | But what's cool here is that you can take that answer, have a look at it. And as you can see, we can see the
00:11:26.120 | the information correctly extracted. What's interesting is that the webpage here, by the way, does not contain the
00:11:33.320 | number of attendees. Still, we can see here that we have an answer here that they're interesting because
00:11:39.160 | it actually went to the reactor page, found the link to the Luma page, navigated to the Luma page,
00:11:49.640 | and on the Luma page, we have the number of attendees. So it mixed the information from the reactor event page
00:11:57.320 | and the Luma page to collect everything I needed in order to get my answer. Okay. So that was a side note.
00:12:06.360 | And I mean, I love it. In both cases, those are good answers. So we can manually evaluate whether it's a
00:12:13.400 | thumbs up or a thumbs down. And then we can do a few things. We can export the dataset to a JSON file.
00:12:21.400 | So I'm not going to do it, but it's basically a JSON line file with the result of the evaluation that you can
00:12:28.680 | then re-inject into a more automated system. And then once you have your agent like this, you can type
00:12:38.920 | view code, generate using whichever framework you prefer. OpenAI agents is usually the one people want
00:12:45.640 | to use those days. And then you have all configured an agent with the MCP server and boilerplate code to
00:12:55.640 | evaluate, uh, uh, to run, sorry, your agents. So let me close that. Let me move on. Okay. So we've seen
00:13:03.240 | how to build and, um, manually evaluate, uh, our AI agents, uh, on the spot example and, uh, how to run it
00:13:12.840 | on a batch of example locally. So a small batch. Then how do you scale beyond a few samples?
00:13:21.320 | Uh, let me move on to the next slide. Uh, we, uh, yes, I will share it. Um, so, um,
00:13:38.360 | okay. So we've seen AI toolkit. Okay. So how do we scale, uh, beyond what I just showed? Uh, because,
00:13:47.240 | okay, uh, eyeballing, uh, is great to get a sense of how it works, but what you want to do is, uh,
00:13:53.160 | go through more thorough, more wide, uh, range of checks, uh, and you want to automate this. Um, so,
00:13:59.960 | um, well, Azure AI Foundry gives, uh, a wide set of built-in evaluators to automate and scale those
00:14:07.160 | evaluations. We've got AI assisted quality checks, like roundness, fluency, coherence, perfect for
00:14:13.640 | measuring how well your agent performs in realistic conversations. You also find classic NLP metrics,
00:14:19.640 | uh, F1, blue, rouge for benchmark comparisons, as well as, um, uh, a suit of AI assisted risk and
00:14:29.000 | safety evaluators. And you can also customize and build your own. Um, once you've spot checked, uh,
00:14:38.360 | the next, uh, the next step is to scale. Uh, and for that, you, uh, you need automated evaluation, um,
00:14:45.560 | to measure quality, either be your scale. Uh, you can do it either in Azure AI Foundry portal or via code,
00:14:51.800 | and I'm going to show you how to do it via code. Um, it's important because, um, we can define what we
00:14:57.000 | want to measure based on our apps, uh, using goal. Um, so now demo. Crazy how 20 minutes goes fast.
00:15:07.480 | Uh, so here's a notebook. Uh, and given the time we have, I'm not going to execute it because it takes a bit
00:15:13.480 | of time. Uh, but here you have the, um, the Python code to, um, and I'm going to share at the end of the
00:15:19.160 | presentation, the link to the notebook. Uh, you have the notebook that allows you to, um, programmatically,
00:15:24.760 | uh, use, uh, connect to the, an Azure AI Foundry, uh, project, uh, and run those evaluation. So the key
00:15:34.040 | function here is that you can define. So those are quality, uh, evaluators to evaluate relevance,
00:15:40.360 | coherence, groundedness, fluency, and similarity. And you have an evaluate, uh, function code that takes
00:15:46.040 | those evaluators, takes the data set that you want to evaluate, uh, and, um, bulk evaluate, um, the AI
00:15:54.280 | agent on all those, um, uh, metrics. And the results is, uh, here. So, uh, on that data set here, uh, which
00:16:05.560 | is about camping, uh, like what is the capital of France, which sense is the most waterproof, what camping
00:16:10.520 | table, whatever. Um, you can see for each, um, question, uh, here, you can see, um, the results of
00:16:19.560 | the evaluation, which also you can configure a threshold. So it's going to get, um, give you an
00:16:24.440 | answer between one and five, and depending on which threshold you configure, because depending on your
00:16:29.400 | application, you might, uh, want your AI agent to be more or less strict, depending whether you're in the
00:16:35.000 | gaming industry, where usually, uh, they accept more, uh, uh, like, um, violent content, or whether
00:16:41.400 | you are doing an application for kids, obviously the threshold is not going to be the same. Um, and so
00:16:47.880 | I'm going to move on to the next. Um, so the, in this case, this, this was passing. I will, I just want
00:16:55.000 | to show the next, uh, evaluators that we have. Um, uh, also very cool. Now you can evaluate
00:17:04.920 | multi-modal models, mixing text and images. Uh, this is very, um, important and for multi-turn
00:17:12.840 | conversations. So here, um, I have an image on purpose. Uh, so I tried to find a violent image
00:17:22.440 | and it's hard to find something violent that you can show at a conference publicly, right? So, uh,
00:17:28.600 | I, I did what I could and I spent a lot of time looking and believe me, when you search for something
00:17:32.520 | violent on the web, you see things you don't want to see. Uh, and so I found that, um, and let's go
00:17:39.160 | straight to the end and see what it tells us. So, um, the systems response, uh, blah, blah, blah. The image
00:17:46.600 | actually depicts a character with numerous pins or nails protruding from the head, which is a graphic
00:17:52.680 | and violent depictions. Uh, but what's interesting, the score is four. It's not like five. It's not the max.
00:17:58.440 | So it's failing. Uh, but it, um, like for example, if you were doing like a, like I said, a video game
00:18:05.640 | with violent content, you could increase to four and say, Hey, I have four. I'm fine with it. Uh,
00:18:10.520 | and so in order to be able to generate that kind of image, uh, and at the end, uh, what's interesting,
00:18:19.400 | uh, and I'm going to show you on another. Uh, okay. I'm going to move on. Um,
00:18:26.040 | I showed you that. Um, okay. I don't have,
00:18:32.920 | I wanted to show you something else.
00:18:40.680 | Okay. You also have an evaluation, evaluator to, uh, oh, I think I'm on time sadly. Okay. So here's,
00:18:48.120 | um, some links to more information. Uh, we have, um, uh, on GitHub Azure AI Foundry discussions,
00:18:56.200 | uh, where you can come and ask questions about, uh, that evaluation SDK and how to build AI agent and how
00:19:02.760 | to evaluate them. Uh, you have the Azure AI Foundry discord too, where you can come and discuss if you
00:19:08.840 | prefer discord. Uh, and then at the very end, you have my, uh, contact, uh, information, uh,
00:19:14.040 | if you want to reach out for more, uh, questions. Um, so yeah, very packed, uh, sorry, a lot to say
00:19:21.000 | in very little time. So thank you very much. Uh, I'm here if you have more questions.
00:19:25.400 | How are you sharing the slides?
00:19:31.160 | Uh, that's a good question. Uh, I'm going to put them on the discord.
00:19:37.080 | And, uh, on the, uh, on the middle, you have our discord server. So you're going to come on
00:19:48.920 | the discord server and I will post it there. Thank you very much.