Agentic Excellence: Mastering AI Agent Evals w/ Azure AI Evaluation SDK

00:00:00.000 | Well, welcome everyone, I'm happy to be here today, I'm very excited, it's a very hot topic.

00:00:22.280 | So, I am Cedric Vidal, Principal AI Advocate at Microsoft, and today we are going to talk

00:00:30.520 | about how to evaluate agents.

00:00:33.000 | So for those of you who were in this very room, the session just before, my colleagues

00:00:39.520 | presented red teaming, which is how you create data that tries to put your AI in a bad situation

00:00:48.920 | and generate bad content and try to verify that it behaves correctly.

00:00:53.480 | Today, in this session, we are going to look at more traditional, normal types of evaluations

00:01:02.280 | when you have a dataset that you want to evaluate on your AI agents.

00:01:07.920 | So, we're going to look at a bunch of things on how to make safe your AIs are safe.

00:01:14.840 | So, I see that people are still coming in the room, it's okay, please come in, don't be afraid.

00:01:21.400 | So, we don't want that, right?

00:01:29.960 | AI agents are all their age.

00:01:31.960 | To be honest, every single day, even as I was preparing this very presentation and I was trying

00:01:37.960 | the latest models and the latest SDKs, I'm always amazed at the progress that those agents are making.

00:01:47.240 | And, but of course, the more agency we give to them, the more independent they become,

00:01:54.760 | the more the risk of creating havoc increases.

00:01:59.640 | So, let's see how we can make sure that your AI agents behave correctly and do not create that kind of mess.

00:02:09.800 | So, how do you go about evaluating your AI agents?

00:02:19.080 | Do you submit like a couple prompts to validate that the models respond correctly and go, "Yeah, well, that checks,

00:02:26.040 | that should go," and put it in production?

00:02:28.360 | Or do you go about a more methodical approach?

00:02:32.360 | If you are doing the former, then I have some news for you.

00:02:36.200 | You are in the right place, you need to change something.

00:02:39.720 | It's not going to work.

00:02:41.640 | If you are in the latter, then today I have some frameworks to show to you on how to,

00:02:48.760 | which might help you improve your evaluation process.

00:02:53.240 | So, when should you start doing evaluations?

00:02:56.600 | You may be wondering, what's the evaluation, or when does it occur?

00:03:02.520 | And, I mean, if you have already built an app and you're asking yourself, "Should I evaluate now?"

00:03:09.720 | Well, good news, I mean, or bad news, you're a bit late.

00:03:13.080 | You should have started way earlier.

00:03:15.080 | Evaluation starts at the very beginning of your AI development project.

00:03:20.920 | The sooner, the better.

00:03:24.520 | So, to get a sense of how to approach the subject of AI agent evaluation, we distinguish four layers.

00:03:39.320 | First, you have the model and the safety system, which are platform-specific level protections.

00:03:45.320 | And this is built in Azure.

00:03:47.080 | You don't have to do anything about that when using Azure, models on Azure.

00:03:51.800 | And then you have system message and grounding.

00:03:55.240 | So, for that part, and user experience.

00:03:57.160 | And for that part, that's where your app design matters the most.

00:04:01.240 | The key takeaway, the foundation model, is just one part.

00:04:06.760 | Real safety comes from layering smart mitigations at the application layer.

00:04:12.520 | And we're going to see how to do that.

00:04:14.360 | The first thing you should do is manual model evaluation.

00:04:19.720 | So, which model do you want to use for your AI agent?

00:04:23.400 | You want to get a clear sense of how different models will respond to a given prompt.

00:04:31.800 | Something automatic metrics can sometimes miss.

00:04:35.560 | When you launch a batch of metrics, of evaluations on a data set,

00:04:42.440 | you, sometimes you have a big average score.

00:04:46.440 | And you might be left wondering, okay, but I'm not sure exactly how it works

00:04:51.080 | specifically for a very specific example.

00:04:53.160 | Before evaluating at scale, you need first to cherry pick and look at specific examples.

00:05:01.400 | So, now, I'm going to demo to you how to do that in VS Code.

00:05:12.280 | So, the first thing here, and when I look at my history, is that you can, so in VS Code, there is a new,

00:05:23.160 | freshly, relatively new plugin called AI Toolkit, which was released at Build, I believe.

00:05:30.280 | And, oh my god, I love that plugin.

00:05:32.120 | Before, I used to go to different websites all over the web to evaluate models and compare.

00:05:38.280 | I mean, you had GitHub models, but now you can do it right from your development environment.

00:05:44.280 | And if you're like me, and you like to code, that's where I like to do things.

00:05:47.240 | AI Toolkit, yes.

00:05:50.600 | So, you can ask, I did ask that question already.

00:05:57.160 | What's a good panacata recipe with salted caramel butter, which is my favorite?

00:06:01.720 | And then you get a pretty good response with 4.1, but what if you want to compare with 4.0, for example?

00:06:08.760 | So, what's a good recipe for panacata with salted caramel butter?

00:06:24.600 | And then you can see, side to side, how the two Moses will respond. 4.1 on the left and 4.0 on the right.

00:06:33.480 | And as you can see, 4.1 is a major improvement in terms of throughput. You're going to get the answer much faster.

00:06:40.440 | When it comes to the quality of the answer, so I looked at it ahead of the conference, and to be honest,

00:06:47.240 | I prefer the 4.1 answer. 4.0 is not too bad, but I mean, 4.1 is so much faster that usually that's what you're going to use.

00:06:54.520 | So, that's for

00:06:57.640 | spot checking, the answer of a foundation model without any customization. We don't have an AI agent yet.

00:07:07.080 | Then,

00:07:09.000 | you want to evaluate the whole system. So, that's where

00:07:14.920 | we are going to actually build an AI agent and evaluate the agent from a systemic approach as a whole.

00:07:23.080 | Once you have selected the model, it's time to evaluate it end to end.

00:07:28.040 | And so, let's jump in and let me show you how that works in VS Code. So, same.

00:07:36.840 | That same AI toolkit extension for VS Code.

00:07:40.360 | Wow. I mean, to be honest, I love it.

00:07:43.560 | Because now you can build an AI agent super fast and evaluate it super fast too. So, here I created a head,

00:07:51.240 | I prepared an AI agent to extract agenda and event information from web pages.

00:08:00.360 | For me, as an advocate, I do that kind of talks pretty often and I need to know,

00:08:04.040 | I created basically an AI agent that helps me easily fetch information from the web and pull

00:08:10.680 | the names, the list of talks, of speakers, the number of attendees, that kind of thing.

00:08:17.160 | And it's super easy to do. So, I'm going to show you how to create a new agent real quick.

00:08:23.160 | And you have an example here with a web scrapper.

00:08:26.520 | And it automatically generates a system prompt saying, "Hey, you are a web exploration assistant

00:08:33.640 | that can navigate to websites." It's going to configure an MCP server

00:08:39.240 | ready to use. And if I run it, it's going to start the playwright MCP server.

00:08:52.360 | By default, it uses an example domain.

00:08:57.160 | And we'll extract, you can see the background, we'll extract information about the website.

00:09:05.560 | Now, I'm going to switch back to the agent that I created.

00:09:09.000 | Because the one I just showed you is the built-in.

00:09:11.480 | So, this one I created. And I'm going to use a GPT-4.1.

00:09:18.280 | And this one is more focused. What I want is to extract the name, date, location,

00:09:24.520 | and number of attendees in a specific format.

00:09:27.800 | And for that website, which is a Lumar event page, so run.

00:09:36.360 | So what I did is that I took the automatically generated one of the sample AI agent that was

00:09:42.040 | created by AI Toolkit, and I customized it for my use case.

00:09:49.000 | And here, you can see the AI agent working and piloting a playwright, going to the webpage, extracting the

00:10:00.280 | information, and giving me the response. So, the event is AI agents and startup talks at GitHub. Location is

00:10:07.880 | GitHub headquarters in San Francisco on June 11th. And for now, we have 269 people that registered. And I hope

00:10:16.520 | that after doing the demo, we're going to have more. Because that's an event that I co-organized

00:10:21.720 | in San Francisco.

00:10:27.000 | So now that we have spot-checked, we have built, we have customized, we have spot-checked what our

00:10:36.280 | AI agent does for a specific input, let's see how we can evaluate it on multiple inputs.

00:10:42.680 | So you have a tab here called evaluation, which allows you to take that AI agent previously

00:10:55.080 | configured and to execute it on a dataset. So here, I can type run all.

00:11:06.200 | Okay. And in the background, it's going to run the agent of those inputs and give us the answer in the

00:11:13.400 | response column. As you can see, I had executed it before. So you can see what was the previous answer.

00:11:19.240 | But what's cool here is that you can take that answer, have a look at it. And as you can see, we can see the

00:11:26.120 | the information correctly extracted. What's interesting is that the webpage here, by the way, does not contain the

00:11:33.320 | number of attendees. Still, we can see here that we have an answer here that they're interesting because

00:11:39.160 | it actually went to the reactor page, found the link to the Luma page, navigated to the Luma page,

00:11:49.640 | and on the Luma page, we have the number of attendees. So it mixed the information from the reactor event page

00:11:57.320 | and the Luma page to collect everything I needed in order to get my answer. Okay. So that was a side note.

00:12:06.360 | And I mean, I love it. In both cases, those are good answers. So we can manually evaluate whether it's a

00:12:13.400 | thumbs up or a thumbs down. And then we can do a few things. We can export the dataset to a JSON file.

00:12:21.400 | So I'm not going to do it, but it's basically a JSON line file with the result of the evaluation that you can

00:12:28.680 | then re-inject into a more automated system. And then once you have your agent like this, you can type

00:12:38.920 | view code, generate using whichever framework you prefer. OpenAI agents is usually the one people want

00:12:45.640 | to use those days. And then you have all configured an agent with the MCP server and boilerplate code to

00:12:55.640 | evaluate, uh, uh, to run, sorry, your agents. So let me close that. Let me move on. Okay. So we've seen

00:13:03.240 | how to build and, um, manually evaluate, uh, our AI agents, uh, on the spot example and, uh, how to run it

00:13:12.840 | on a batch of example locally. So a small batch. Then how do you scale beyond a few samples?

00:13:21.320 | Uh, let me move on to the next slide. Uh, we, uh, yes, I will share it. Um, so, um,

00:13:38.360 | okay. So we've seen AI toolkit. Okay. So how do we scale, uh, beyond what I just showed? Uh, because,

00:13:47.240 | okay, uh, eyeballing, uh, is great to get a sense of how it works, but what you want to do is, uh,

00:13:53.160 | go through more thorough, more wide, uh, range of checks, uh, and you want to automate this. Um, so,

00:13:59.960 | um, well, Azure AI Foundry gives, uh, a wide set of built-in evaluators to automate and scale those

00:14:07.160 | evaluations. We've got AI assisted quality checks, like roundness, fluency, coherence, perfect for

00:14:13.640 | measuring how well your agent performs in realistic conversations. You also find classic NLP metrics,

00:14:19.640 | uh, F1, blue, rouge for benchmark comparisons, as well as, um, uh, a suit of AI assisted risk and

00:14:29.000 | safety evaluators. And you can also customize and build your own. Um, once you've spot checked, uh,

00:14:38.360 | the next, uh, the next step is to scale. Uh, and for that, you, uh, you need automated evaluation, um,

00:14:45.560 | to measure quality, either be your scale. Uh, you can do it either in Azure AI Foundry portal or via code,

00:14:51.800 | and I'm going to show you how to do it via code. Um, it's important because, um, we can define what we

00:14:57.000 | want to measure based on our apps, uh, using goal. Um, so now demo. Crazy how 20 minutes goes fast.

00:15:07.480 | Uh, so here's a notebook. Uh, and given the time we have, I'm not going to execute it because it takes a bit

00:15:13.480 | of time. Uh, but here you have the, um, the Python code to, um, and I'm going to share at the end of the

00:15:19.160 | presentation, the link to the notebook. Uh, you have the notebook that allows you to, um, programmatically,

00:15:24.760 | uh, use, uh, connect to the, an Azure AI Foundry, uh, project, uh, and run those evaluation. So the key

00:15:34.040 | function here is that you can define. So those are quality, uh, evaluators to evaluate relevance,

00:15:40.360 | coherence, groundedness, fluency, and similarity. And you have an evaluate, uh, function code that takes

00:15:46.040 | those evaluators, takes the data set that you want to evaluate, uh, and, um, bulk evaluate, um, the AI

00:15:54.280 | agent on all those, um, uh, metrics. And the results is, uh, here. So, uh, on that data set here, uh, which

00:16:05.560 | is about camping, uh, like what is the capital of France, which sense is the most waterproof, what camping

00:16:10.520 | table, whatever. Um, you can see for each, um, question, uh, here, you can see, um, the results of

00:16:19.560 | the evaluation, which also you can configure a threshold. So it's going to get, um, give you an

00:16:24.440 | answer between one and five, and depending on which threshold you configure, because depending on your

00:16:29.400 | application, you might, uh, want your AI agent to be more or less strict, depending whether you're in the

00:16:35.000 | gaming industry, where usually, uh, they accept more, uh, uh, like, um, violent content, or whether

00:16:41.400 | you are doing an application for kids, obviously the threshold is not going to be the same. Um, and so

00:16:47.880 | I'm going to move on to the next. Um, so the, in this case, this, this was passing. I will, I just want

00:16:55.000 | to show the next, uh, evaluators that we have. Um, uh, also very cool. Now you can evaluate

00:17:04.920 | multi-modal models, mixing text and images. Uh, this is very, um, important and for multi-turn

00:17:12.840 | conversations. So here, um, I have an image on purpose. Uh, so I tried to find a violent image

00:17:22.440 | and it's hard to find something violent that you can show at a conference publicly, right? So, uh,

00:17:28.600 | I, I did what I could and I spent a lot of time looking and believe me, when you search for something

00:17:32.520 | violent on the web, you see things you don't want to see. Uh, and so I found that, um, and let's go

00:17:39.160 | straight to the end and see what it tells us. So, um, the systems response, uh, blah, blah, blah. The image

00:17:46.600 | actually depicts a character with numerous pins or nails protruding from the head, which is a graphic

00:17:52.680 | and violent depictions. Uh, but what's interesting, the score is four. It's not like five. It's not the max.

00:17:58.440 | So it's failing. Uh, but it, um, like for example, if you were doing like a, like I said, a video game

00:18:05.640 | with violent content, you could increase to four and say, Hey, I have four. I'm fine with it. Uh,

00:18:10.520 | and so in order to be able to generate that kind of image, uh, and at the end, uh, what's interesting,

00:18:19.400 | uh, and I'm going to show you on another. Uh, okay. I'm going to move on. Um,

00:18:26.040 | I showed you that. Um, okay. I don't have,

00:18:32.920 | I wanted to show you something else.

00:18:40.680 | Okay. You also have an evaluation, evaluator to, uh, oh, I think I'm on time sadly. Okay. So here's,

00:18:48.120 | um, some links to more information. Uh, we have, um, uh, on GitHub Azure AI Foundry discussions,

00:18:56.200 | uh, where you can come and ask questions about, uh, that evaluation SDK and how to build AI agent and how

00:19:02.760 | to evaluate them. Uh, you have the Azure AI Foundry discord too, where you can come and discuss if you

00:19:08.840 | prefer discord. Uh, and then at the very end, you have my, uh, contact, uh, information, uh,

00:19:14.040 | if you want to reach out for more, uh, questions. Um, so yeah, very packed, uh, sorry, a lot to say

00:19:21.000 | in very little time. So thank you very much. Uh, I'm here if you have more questions.

00:19:25.400 | How are you sharing the slides?

00:19:31.160 | Uh, that's a good question. Uh, I'm going to put them on the discord.

00:19:37.080 | And, uh, on the, uh, on the middle, you have our discord server. So you're going to come on

00:19:48.920 | the discord server and I will post it there. Thank you very much.

Agentic Excellence: Mastering AI Agent Evals w/ Azure AI Evaluation SDK — Cedric Vidal, Microsoft

Chapters