Agentic Excellence: Mastering AI Agent Evals w/ Azure AI Evaluation SDK

Well, welcome everyone, I'm happy to be here today, I'm very excited, it's a very hot topic. So, I am Cedric Vidal, Principal AI Advocate at Microsoft, and today we are going to talk about how to evaluate agents. So for those of you who were in this very room, the session just before, my colleagues presented red teaming, which is how you create data that tries to put your AI in a bad situation and generate bad content and try to verify that it behaves correctly.

Today, in this session, we are going to look at more traditional, normal types of evaluations when you have a dataset that you want to evaluate on your AI agents. So, we're going to look at a bunch of things on how to make safe your AIs are safe. So, I see that people are still coming in the room, it's okay, please come in, don't be afraid.

So, we don't want that, right? AI agents are all their age. To be honest, every single day, even as I was preparing this very presentation and I was trying the latest models and the latest SDKs, I'm always amazed at the progress that those agents are making. And, but of course, the more agency we give to them, the more independent they become, the more the risk of creating havoc increases.

So, let's see how we can make sure that your AI agents behave correctly and do not create that kind of mess. So, how do you go about evaluating your AI agents? Do you submit like a couple prompts to validate that the models respond correctly and go, "Yeah, well, that checks, that should go," and put it in production?

Or do you go about a more methodical approach? If you are doing the former, then I have some news for you. You are in the right place, you need to change something. It's not going to work. If you are in the latter, then today I have some frameworks to show to you on how to, which might help you improve your evaluation process.

So, when should you start doing evaluations? You may be wondering, what's the evaluation, or when does it occur? And, I mean, if you have already built an app and you're asking yourself, "Should I evaluate now?" Well, good news, I mean, or bad news, you're a bit late. You should have started way earlier.

Evaluation starts at the very beginning of your AI development project. The sooner, the better. So, to get a sense of how to approach the subject of AI agent evaluation, we distinguish four layers. First, you have the model and the safety system, which are platform-specific level protections. And this is built in Azure.

You don't have to do anything about that when using Azure, models on Azure. And then you have system message and grounding. So, for that part, and user experience. And for that part, that's where your app design matters the most. The key takeaway, the foundation model, is just one part.

Real safety comes from layering smart mitigations at the application layer. And we're going to see how to do that. The first thing you should do is manual model evaluation. So, which model do you want to use for your AI agent? You want to get a clear sense of how different models will respond to a given prompt.

Something automatic metrics can sometimes miss. When you launch a batch of metrics, of evaluations on a data set, you, sometimes you have a big average score. And you might be left wondering, okay, but I'm not sure exactly how it works specifically for a very specific example. Before evaluating at scale, you need first to cherry pick and look at specific examples.

So, now, I'm going to demo to you how to do that in VS Code. So, the first thing here, and when I look at my history, is that you can, so in VS Code, there is a new, freshly, relatively new plugin called AI Toolkit, which was released at Build, I believe.

And, oh my god, I love that plugin. Before, I used to go to different websites all over the web to evaluate models and compare. I mean, you had GitHub models, but now you can do it right from your development environment. And if you're like me, and you like to code, that's where I like to do things.

AI Toolkit, yes. So, you can ask, I did ask that question already. What's a good panacata recipe with salted caramel butter, which is my favorite? And then you get a pretty good response with 4.1, but what if you want to compare with 4.0, for example? So, what's a good recipe for panacata with salted caramel butter?

And then you can see, side to side, how the two Moses will respond. 4.1 on the left and 4.0 on the right. And as you can see, 4.1 is a major improvement in terms of throughput. You're going to get the answer much faster. When it comes to the quality of the answer, so I looked at it ahead of the conference, and to be honest, I prefer the 4.1 answer.

4.0 is not too bad, but I mean, 4.1 is so much faster that usually that's what you're going to use. So, that's for spot checking, the answer of a foundation model without any customization. We don't have an AI agent yet. Then, you want to evaluate the whole system. So, that's where we are going to actually build an AI agent and evaluate the agent from a systemic approach as a whole.

Once you have selected the model, it's time to evaluate it end to end. And so, let's jump in and let me show you how that works in VS Code. So, same. That same AI toolkit extension for VS Code. Wow. I mean, to be honest, I love it. Because now you can build an AI agent super fast and evaluate it super fast too.

So, here I created a head, I prepared an AI agent to extract agenda and event information from web pages. For me, as an advocate, I do that kind of talks pretty often and I need to know, I created basically an AI agent that helps me easily fetch information from the web and pull the names, the list of talks, of speakers, the number of attendees, that kind of thing.

And it's super easy to do. So, I'm going to show you how to create a new agent real quick. And you have an example here with a web scrapper. And it automatically generates a system prompt saying, "Hey, you are a web exploration assistant that can navigate to websites." It's going to configure an MCP server ready to use.

And if I run it, it's going to start the playwright MCP server. By default, it uses an example domain. And we'll extract, you can see the background, we'll extract information about the website. Now, I'm going to switch back to the agent that I created. Because the one I just showed you is the built-in.

So, this one I created. And I'm going to use a GPT-4.1. And this one is more focused. What I want is to extract the name, date, location, and number of attendees in a specific format. And for that website, which is a Lumar event page, so run. So what I did is that I took the automatically generated one of the sample AI agent that was created by AI Toolkit, and I customized it for my use case.

And here, you can see the AI agent working and piloting a playwright, going to the webpage, extracting the information, and giving me the response. So, the event is AI agents and startup talks at GitHub. Location is GitHub headquarters in San Francisco on June 11th. And for now, we have 269 people that registered.

And I hope that after doing the demo, we're going to have more. Because that's an event that I co-organized in San Francisco. So now that we have spot-checked, we have built, we have customized, we have spot-checked what our AI agent does for a specific input, let's see how we can evaluate it on multiple inputs.

So you have a tab here called evaluation, which allows you to take that AI agent previously configured and to execute it on a dataset. So here, I can type run all. Okay. And in the background, it's going to run the agent of those inputs and give us the answer in the response column.

As you can see, I had executed it before. So you can see what was the previous answer. But what's cool here is that you can take that answer, have a look at it. And as you can see, we can see the the information correctly extracted. What's interesting is that the webpage here, by the way, does not contain the number of attendees.

Still, we can see here that we have an answer here that they're interesting because it actually went to the reactor page, found the link to the Luma page, navigated to the Luma page, and on the Luma page, we have the number of attendees. So it mixed the information from the reactor event page and the Luma page to collect everything I needed in order to get my answer.

Okay. So that was a side note. And I mean, I love it. In both cases, those are good answers. So we can manually evaluate whether it's a thumbs up or a thumbs down. And then we can do a few things. We can export the dataset to a JSON file.

So I'm not going to do it, but it's basically a JSON line file with the result of the evaluation that you can then re-inject into a more automated system. And then once you have your agent like this, you can type view code, generate using whichever framework you prefer. OpenAI agents is usually the one people want to use those days.

And then you have all configured an agent with the MCP server and boilerplate code to evaluate, uh, uh, to run, sorry, your agents. So let me close that. Let me move on. Okay. So we've seen how to build and, um, manually evaluate, uh, our AI agents, uh, on the spot example and, uh, how to run it on a batch of example locally.

So a small batch. Then how do you scale beyond a few samples? Uh, let me move on to the next slide. Uh, we, uh, yes, I will share it. Um, so, um, okay. So we've seen AI toolkit. Okay. So how do we scale, uh, beyond what I just showed?

Uh, because, okay, uh, eyeballing, uh, is great to get a sense of how it works, but what you want to do is, uh, go through more thorough, more wide, uh, range of checks, uh, and you want to automate this. Um, so, um, well, Azure AI Foundry gives, uh, a wide set of built-in evaluators to automate and scale those evaluations.

We've got AI assisted quality checks, like roundness, fluency, coherence, perfect for measuring how well your agent performs in realistic conversations. You also find classic NLP metrics, uh, F1, blue, rouge for benchmark comparisons, as well as, um, uh, a suit of AI assisted risk and safety evaluators. And you can also customize and build your own.

Um, once you've spot checked, uh, the next, uh, the next step is to scale. Uh, and for that, you, uh, you need automated evaluation, um, to measure quality, either be your scale. Uh, you can do it either in Azure AI Foundry portal or via code, and I'm going to show you how to do it via code.

Um, it's important because, um, we can define what we want to measure based on our apps, uh, using goal. Um, so now demo. Crazy how 20 minutes goes fast. Uh, so here's a notebook. Uh, and given the time we have, I'm not going to execute it because it takes a bit of time.

Uh, but here you have the, um, the Python code to, um, and I'm going to share at the end of the presentation, the link to the notebook. Uh, you have the notebook that allows you to, um, programmatically, uh, use, uh, connect to the, an Azure AI Foundry, uh, project, uh, and run those evaluation.

So the key function here is that you can define. So those are quality, uh, evaluators to evaluate relevance, coherence, groundedness, fluency, and similarity. And you have an evaluate, uh, function code that takes those evaluators, takes the data set that you want to evaluate, uh, and, um, bulk evaluate, um, the AI agent on all those, um, uh, metrics.

And the results is, uh, here. So, uh, on that data set here, uh, which is about camping, uh, like what is the capital of France, which sense is the most waterproof, what camping table, whatever. Um, you can see for each, um, question, uh, here, you can see, um, the results of the evaluation, which also you can configure a threshold.

So it's going to get, um, give you an answer between one and five, and depending on which threshold you configure, because depending on your application, you might, uh, want your AI agent to be more or less strict, depending whether you're in the gaming industry, where usually, uh, they accept more, uh, uh, like, um, violent content, or whether you are doing an application for kids, obviously the threshold is not going to be the same.

Um, and so I'm going to move on to the next. Um, so the, in this case, this, this was passing. I will, I just want to show the next, uh, evaluators that we have. Um, uh, also very cool. Now you can evaluate multi-modal models, mixing text and images. Uh, this is very, um, important and for multi-turn conversations.

So here, um, I have an image on purpose. Uh, so I tried to find a violent image and it's hard to find something violent that you can show at a conference publicly, right? So, uh, I, I did what I could and I spent a lot of time looking and believe me, when you search for something violent on the web, you see things you don't want to see.

Uh, and so I found that, um, and let's go straight to the end and see what it tells us. So, um, the systems response, uh, blah, blah, blah. The image actually depicts a character with numerous pins or nails protruding from the head, which is a graphic and violent depictions.

Uh, but what's interesting, the score is four. It's not like five. It's not the max. So it's failing. Uh, but it, um, like for example, if you were doing like a, like I said, a video game with violent content, you could increase to four and say, Hey, I have four.

I'm fine with it. Uh, and so in order to be able to generate that kind of image, uh, and at the end, uh, what's interesting, uh, and I'm going to show you on another. Uh, okay. I'm going to move on. Um, I showed you that. Um, okay. I don't have, I wanted to show you something else.

Okay. You also have an evaluation, evaluator to, uh, oh, I think I'm on time sadly. Okay. So here's, um, some links to more information. Uh, we have, um, uh, on GitHub Azure AI Foundry discussions, uh, where you can come and ask questions about, uh, that evaluation SDK and how to build AI agent and how to evaluate them.

Uh, you have the Azure AI Foundry discord too, where you can come and discuss if you prefer discord. Uh, and then at the very end, you have my, uh, contact, uh, information, uh, if you want to reach out for more, uh, questions. Um, so yeah, very packed, uh, sorry, a lot to say in very little time.

So thank you very much. Uh, I'm here if you have more questions. How are you sharing the slides? Uh, that's a good question. Uh, I'm going to put them on the discord. And, uh, on the, uh, on the middle, you have our discord server. So you're going to come on the discord server and I will post it there.

Thank you very much.

Agentic Excellence: Mastering AI Agent Evals w/ Azure AI Evaluation SDK — Cedric Vidal, Microsoft

Chapters

Transcript