back to indexAgentic Excellence: Mastering AI Agent Evals w/ Azure AI Evaluation SDK — Cedric Vidal, Microsoft

Chapters
0:0 Introduction
0:30 Overview
1:23 AI agents are all the rage
2:12 How do you evaluate AI agents
2:40 When to evaluate AI agents
3:26 How to evaluate AI agents
4:15 Manual model evaluation
7:10 Evaluating the whole system
14:33 Scale
16:59 Multimodel Models
18:46 Outro
00:00:00.000 |
Well, welcome everyone, I'm happy to be here today, I'm very excited, it's a very hot topic. 00:00:22.280 |
So, I am Cedric Vidal, Principal AI Advocate at Microsoft, and today we are going to talk 00:00:33.000 |
So for those of you who were in this very room, the session just before, my colleagues 00:00:39.520 |
presented red teaming, which is how you create data that tries to put your AI in a bad situation 00:00:48.920 |
and generate bad content and try to verify that it behaves correctly. 00:00:53.480 |
Today, in this session, we are going to look at more traditional, normal types of evaluations 00:01:02.280 |
when you have a dataset that you want to evaluate on your AI agents. 00:01:07.920 |
So, we're going to look at a bunch of things on how to make safe your AIs are safe. 00:01:14.840 |
So, I see that people are still coming in the room, it's okay, please come in, don't be afraid. 00:01:31.960 |
To be honest, every single day, even as I was preparing this very presentation and I was trying 00:01:37.960 |
the latest models and the latest SDKs, I'm always amazed at the progress that those agents are making. 00:01:47.240 |
And, but of course, the more agency we give to them, the more independent they become, 00:01:54.760 |
the more the risk of creating havoc increases. 00:01:59.640 |
So, let's see how we can make sure that your AI agents behave correctly and do not create that kind of mess. 00:02:09.800 |
So, how do you go about evaluating your AI agents? 00:02:19.080 |
Do you submit like a couple prompts to validate that the models respond correctly and go, "Yeah, well, that checks, 00:02:28.360 |
Or do you go about a more methodical approach? 00:02:32.360 |
If you are doing the former, then I have some news for you. 00:02:36.200 |
You are in the right place, you need to change something. 00:02:41.640 |
If you are in the latter, then today I have some frameworks to show to you on how to, 00:02:48.760 |
which might help you improve your evaluation process. 00:02:56.600 |
You may be wondering, what's the evaluation, or when does it occur? 00:03:02.520 |
And, I mean, if you have already built an app and you're asking yourself, "Should I evaluate now?" 00:03:09.720 |
Well, good news, I mean, or bad news, you're a bit late. 00:03:15.080 |
Evaluation starts at the very beginning of your AI development project. 00:03:24.520 |
So, to get a sense of how to approach the subject of AI agent evaluation, we distinguish four layers. 00:03:39.320 |
First, you have the model and the safety system, which are platform-specific level protections. 00:03:47.080 |
You don't have to do anything about that when using Azure, models on Azure. 00:03:51.800 |
And then you have system message and grounding. 00:03:57.160 |
And for that part, that's where your app design matters the most. 00:04:01.240 |
The key takeaway, the foundation model, is just one part. 00:04:06.760 |
Real safety comes from layering smart mitigations at the application layer. 00:04:14.360 |
The first thing you should do is manual model evaluation. 00:04:19.720 |
So, which model do you want to use for your AI agent? 00:04:23.400 |
You want to get a clear sense of how different models will respond to a given prompt. 00:04:31.800 |
Something automatic metrics can sometimes miss. 00:04:35.560 |
When you launch a batch of metrics, of evaluations on a data set, 00:04:46.440 |
And you might be left wondering, okay, but I'm not sure exactly how it works 00:04:53.160 |
Before evaluating at scale, you need first to cherry pick and look at specific examples. 00:05:01.400 |
So, now, I'm going to demo to you how to do that in VS Code. 00:05:12.280 |
So, the first thing here, and when I look at my history, is that you can, so in VS Code, there is a new, 00:05:23.160 |
freshly, relatively new plugin called AI Toolkit, which was released at Build, I believe. 00:05:32.120 |
Before, I used to go to different websites all over the web to evaluate models and compare. 00:05:38.280 |
I mean, you had GitHub models, but now you can do it right from your development environment. 00:05:44.280 |
And if you're like me, and you like to code, that's where I like to do things. 00:05:50.600 |
So, you can ask, I did ask that question already. 00:05:57.160 |
What's a good panacata recipe with salted caramel butter, which is my favorite? 00:06:01.720 |
And then you get a pretty good response with 4.1, but what if you want to compare with 4.0, for example? 00:06:08.760 |
So, what's a good recipe for panacata with salted caramel butter? 00:06:24.600 |
And then you can see, side to side, how the two Moses will respond. 4.1 on the left and 4.0 on the right. 00:06:33.480 |
And as you can see, 4.1 is a major improvement in terms of throughput. You're going to get the answer much faster. 00:06:40.440 |
When it comes to the quality of the answer, so I looked at it ahead of the conference, and to be honest, 00:06:47.240 |
I prefer the 4.1 answer. 4.0 is not too bad, but I mean, 4.1 is so much faster that usually that's what you're going to use. 00:06:57.640 |
spot checking, the answer of a foundation model without any customization. We don't have an AI agent yet. 00:07:09.000 |
you want to evaluate the whole system. So, that's where 00:07:14.920 |
we are going to actually build an AI agent and evaluate the agent from a systemic approach as a whole. 00:07:23.080 |
Once you have selected the model, it's time to evaluate it end to end. 00:07:28.040 |
And so, let's jump in and let me show you how that works in VS Code. So, same. 00:07:43.560 |
Because now you can build an AI agent super fast and evaluate it super fast too. So, here I created a head, 00:07:51.240 |
I prepared an AI agent to extract agenda and event information from web pages. 00:08:00.360 |
For me, as an advocate, I do that kind of talks pretty often and I need to know, 00:08:04.040 |
I created basically an AI agent that helps me easily fetch information from the web and pull 00:08:10.680 |
the names, the list of talks, of speakers, the number of attendees, that kind of thing. 00:08:17.160 |
And it's super easy to do. So, I'm going to show you how to create a new agent real quick. 00:08:23.160 |
And you have an example here with a web scrapper. 00:08:26.520 |
And it automatically generates a system prompt saying, "Hey, you are a web exploration assistant 00:08:33.640 |
that can navigate to websites." It's going to configure an MCP server 00:08:39.240 |
ready to use. And if I run it, it's going to start the playwright MCP server. 00:08:57.160 |
And we'll extract, you can see the background, we'll extract information about the website. 00:09:05.560 |
Now, I'm going to switch back to the agent that I created. 00:09:09.000 |
Because the one I just showed you is the built-in. 00:09:11.480 |
So, this one I created. And I'm going to use a GPT-4.1. 00:09:18.280 |
And this one is more focused. What I want is to extract the name, date, location, 00:09:24.520 |
and number of attendees in a specific format. 00:09:27.800 |
And for that website, which is a Lumar event page, so run. 00:09:36.360 |
So what I did is that I took the automatically generated one of the sample AI agent that was 00:09:42.040 |
created by AI Toolkit, and I customized it for my use case. 00:09:49.000 |
And here, you can see the AI agent working and piloting a playwright, going to the webpage, extracting the 00:10:00.280 |
information, and giving me the response. So, the event is AI agents and startup talks at GitHub. Location is 00:10:07.880 |
GitHub headquarters in San Francisco on June 11th. And for now, we have 269 people that registered. And I hope 00:10:16.520 |
that after doing the demo, we're going to have more. Because that's an event that I co-organized 00:10:27.000 |
So now that we have spot-checked, we have built, we have customized, we have spot-checked what our 00:10:36.280 |
AI agent does for a specific input, let's see how we can evaluate it on multiple inputs. 00:10:42.680 |
So you have a tab here called evaluation, which allows you to take that AI agent previously 00:10:55.080 |
configured and to execute it on a dataset. So here, I can type run all. 00:11:06.200 |
Okay. And in the background, it's going to run the agent of those inputs and give us the answer in the 00:11:13.400 |
response column. As you can see, I had executed it before. So you can see what was the previous answer. 00:11:19.240 |
But what's cool here is that you can take that answer, have a look at it. And as you can see, we can see the 00:11:26.120 |
the information correctly extracted. What's interesting is that the webpage here, by the way, does not contain the 00:11:33.320 |
number of attendees. Still, we can see here that we have an answer here that they're interesting because 00:11:39.160 |
it actually went to the reactor page, found the link to the Luma page, navigated to the Luma page, 00:11:49.640 |
and on the Luma page, we have the number of attendees. So it mixed the information from the reactor event page 00:11:57.320 |
and the Luma page to collect everything I needed in order to get my answer. Okay. So that was a side note. 00:12:06.360 |
And I mean, I love it. In both cases, those are good answers. So we can manually evaluate whether it's a 00:12:13.400 |
thumbs up or a thumbs down. And then we can do a few things. We can export the dataset to a JSON file. 00:12:21.400 |
So I'm not going to do it, but it's basically a JSON line file with the result of the evaluation that you can 00:12:28.680 |
then re-inject into a more automated system. And then once you have your agent like this, you can type 00:12:38.920 |
view code, generate using whichever framework you prefer. OpenAI agents is usually the one people want 00:12:45.640 |
to use those days. And then you have all configured an agent with the MCP server and boilerplate code to 00:12:55.640 |
evaluate, uh, uh, to run, sorry, your agents. So let me close that. Let me move on. Okay. So we've seen 00:13:03.240 |
how to build and, um, manually evaluate, uh, our AI agents, uh, on the spot example and, uh, how to run it 00:13:12.840 |
on a batch of example locally. So a small batch. Then how do you scale beyond a few samples? 00:13:21.320 |
Uh, let me move on to the next slide. Uh, we, uh, yes, I will share it. Um, so, um, 00:13:38.360 |
okay. So we've seen AI toolkit. Okay. So how do we scale, uh, beyond what I just showed? Uh, because, 00:13:47.240 |
okay, uh, eyeballing, uh, is great to get a sense of how it works, but what you want to do is, uh, 00:13:53.160 |
go through more thorough, more wide, uh, range of checks, uh, and you want to automate this. Um, so, 00:13:59.960 |
um, well, Azure AI Foundry gives, uh, a wide set of built-in evaluators to automate and scale those 00:14:07.160 |
evaluations. We've got AI assisted quality checks, like roundness, fluency, coherence, perfect for 00:14:13.640 |
measuring how well your agent performs in realistic conversations. You also find classic NLP metrics, 00:14:19.640 |
uh, F1, blue, rouge for benchmark comparisons, as well as, um, uh, a suit of AI assisted risk and 00:14:29.000 |
safety evaluators. And you can also customize and build your own. Um, once you've spot checked, uh, 00:14:38.360 |
the next, uh, the next step is to scale. Uh, and for that, you, uh, you need automated evaluation, um, 00:14:45.560 |
to measure quality, either be your scale. Uh, you can do it either in Azure AI Foundry portal or via code, 00:14:51.800 |
and I'm going to show you how to do it via code. Um, it's important because, um, we can define what we 00:14:57.000 |
want to measure based on our apps, uh, using goal. Um, so now demo. Crazy how 20 minutes goes fast. 00:15:07.480 |
Uh, so here's a notebook. Uh, and given the time we have, I'm not going to execute it because it takes a bit 00:15:13.480 |
of time. Uh, but here you have the, um, the Python code to, um, and I'm going to share at the end of the 00:15:19.160 |
presentation, the link to the notebook. Uh, you have the notebook that allows you to, um, programmatically, 00:15:24.760 |
uh, use, uh, connect to the, an Azure AI Foundry, uh, project, uh, and run those evaluation. So the key 00:15:34.040 |
function here is that you can define. So those are quality, uh, evaluators to evaluate relevance, 00:15:40.360 |
coherence, groundedness, fluency, and similarity. And you have an evaluate, uh, function code that takes 00:15:46.040 |
those evaluators, takes the data set that you want to evaluate, uh, and, um, bulk evaluate, um, the AI 00:15:54.280 |
agent on all those, um, uh, metrics. And the results is, uh, here. So, uh, on that data set here, uh, which 00:16:05.560 |
is about camping, uh, like what is the capital of France, which sense is the most waterproof, what camping 00:16:10.520 |
table, whatever. Um, you can see for each, um, question, uh, here, you can see, um, the results of 00:16:19.560 |
the evaluation, which also you can configure a threshold. So it's going to get, um, give you an 00:16:24.440 |
answer between one and five, and depending on which threshold you configure, because depending on your 00:16:29.400 |
application, you might, uh, want your AI agent to be more or less strict, depending whether you're in the 00:16:35.000 |
gaming industry, where usually, uh, they accept more, uh, uh, like, um, violent content, or whether 00:16:41.400 |
you are doing an application for kids, obviously the threshold is not going to be the same. Um, and so 00:16:47.880 |
I'm going to move on to the next. Um, so the, in this case, this, this was passing. I will, I just want 00:16:55.000 |
to show the next, uh, evaluators that we have. Um, uh, also very cool. Now you can evaluate 00:17:04.920 |
multi-modal models, mixing text and images. Uh, this is very, um, important and for multi-turn 00:17:12.840 |
conversations. So here, um, I have an image on purpose. Uh, so I tried to find a violent image 00:17:22.440 |
and it's hard to find something violent that you can show at a conference publicly, right? So, uh, 00:17:28.600 |
I, I did what I could and I spent a lot of time looking and believe me, when you search for something 00:17:32.520 |
violent on the web, you see things you don't want to see. Uh, and so I found that, um, and let's go 00:17:39.160 |
straight to the end and see what it tells us. So, um, the systems response, uh, blah, blah, blah. The image 00:17:46.600 |
actually depicts a character with numerous pins or nails protruding from the head, which is a graphic 00:17:52.680 |
and violent depictions. Uh, but what's interesting, the score is four. It's not like five. It's not the max. 00:17:58.440 |
So it's failing. Uh, but it, um, like for example, if you were doing like a, like I said, a video game 00:18:05.640 |
with violent content, you could increase to four and say, Hey, I have four. I'm fine with it. Uh, 00:18:10.520 |
and so in order to be able to generate that kind of image, uh, and at the end, uh, what's interesting, 00:18:19.400 |
uh, and I'm going to show you on another. Uh, okay. I'm going to move on. Um, 00:18:40.680 |
Okay. You also have an evaluation, evaluator to, uh, oh, I think I'm on time sadly. Okay. So here's, 00:18:48.120 |
um, some links to more information. Uh, we have, um, uh, on GitHub Azure AI Foundry discussions, 00:18:56.200 |
uh, where you can come and ask questions about, uh, that evaluation SDK and how to build AI agent and how 00:19:02.760 |
to evaluate them. Uh, you have the Azure AI Foundry discord too, where you can come and discuss if you 00:19:08.840 |
prefer discord. Uh, and then at the very end, you have my, uh, contact, uh, information, uh, 00:19:14.040 |
if you want to reach out for more, uh, questions. Um, so yeah, very packed, uh, sorry, a lot to say 00:19:21.000 |
in very little time. So thank you very much. Uh, I'm here if you have more questions. 00:19:31.160 |
Uh, that's a good question. Uh, I'm going to put them on the discord. 00:19:37.080 |
And, uh, on the, uh, on the middle, you have our discord server. So you're going to come on 00:19:48.920 |
the discord server and I will post it there. Thank you very much.