AI Red Teaming Agent: Azure AI Foundry — Nagkumar Arkalgud & Keiji Kanazawa, Microsoft

. Hey, thank you and welcome everybody. I'm Keiji Kanazawa from Microsoft. I work in product and AI Foundry. If you're at the keynote session from Asha, she's my CVB boss. So I work in that org and Nagkumar is an engineer on the team. He's going to be showing some code in a little bit, which I'm sure most of you are excited to see.

So here we are at the AI engineer conference. I'm sure you're learning about all kinds of stuff, reinforcement learning, agents, SWE agents, evals, all kinds of stuff. And so we're all really excited, at least I am, to get AI into the hands of people and help people, whether it's for your end users and your internal users.

And so AI is obviously where it's at. Or I think it's where it's at. But it comes with a bunch of headlines that you've probably seen. And some of them, again, if you're at the keynote this morning with Simon, he showed some examples. So, you know, it's very easy.

It can be easy to trick chatbots into saying stuff that you don't necessarily want them to say. You can actually trick them into giving you information potentially that you don't want leaking out. And, you know, AI is built like AI engineering. It's built on a whole ecosystem of different things, including Python packages, MPM packages, you know, other services that you may be using that are hosted.

And, of course, San Francisco is the home of self-driving cars. You know, this picture is showing like a, you know, it's a frame from a video clip where a self-driving car is, you know, driving happily right past the school bus with a stop sign on it. Right? And if you think, you know, hey, is, you know, what does it have to do with me?

You know, well, this is one test that you can kind of think of in terms of whether it's something you've built or something you're thinking of building. Again, it's really easy to, you know, kind of get around some of the defenses of the AI models. So, for example, the prompt on the left, if you say how to loot a bank, a lot of the models will actually, you know, refuse to answer.

Right? They'll say like, oh, no, I can't help you with that. But if you, and some of the examples, if, how many of you were at the Sanders workshop yesterday on prompt engineering, red teaming? Yeah. So, like, it's really like, if you preface that question with a whole, but like maybe a bit of your life story, you can maybe convince the AI to tell you something, you know, that it's not supposed to.

And there are also other tricks, like, on the right-hand side, it says , which is how to loot a bank, spelled backwards, right, right to left. And actually, it turns out that's one of the patterns that an AI model can be tricked into giving you the answer. And so, you know, especially like, we touched on it this morning also, but like, of course, it's all agents, agents, agents, 20, 25 years of agents.

And there are a lot of concerns, like if you talk to businesses about how in this world of agents, AI can be, you know, tricked into different kind of risks and stuff and different malfunctions. So, but we're here at the AI engineer conference and, you know, like, what I want to kind of like convey is that we as engineers know how to do this stuff, right?

So like engineers build bridges and dams that people trust. We build trucks and trains. And so AI engineering is early. So we've got a lot of work to do to get to the point where people trust AI as much as they trust, you know, bridges. But, you know, this is something we know how to do as engineers, we build something, we iterate, we check it, we test it, you know, we continue to iterate.

And that's what we're here to show you kind of how to do. And as engineers, you know, we also rely on not just ourselves, but other people. So what we like to say is trust is a team sport. So when we're looking to build trustworthy AI systems, we depend on other people.

So the engineers need to depend on people who have a lot of expertise in these areas like security and AI risk. So at Microsoft, we have a team called the Microsoft AI Red Team, which I've been working with for a few years. Actually, they were one of the pioneers in identifying risks, you know, kind of AI in general, as well as LLMs.

Like two, three years ago, they were already talking about, hey, you know what? Like these GPT-3, GPT-4 models, you can get them to do things that, you know, you really don't want them to do. So we partnered in Azure AI Foundry with the AI Red Team to offer a solution that makes it easy for you, AI engineers, to basically have a teammate that can help you with the AI Red Teaming.

And so the AI Red Team, they put a Python package called Pyrit, P-Y-R-I-T. And what we offer is a hosted version and wrapped it around an easy-to-use SDK and also hosted dashboard to show you the evals, you know, that come from this. And so here to show you how it works is Nakumar.

Awesome. Thank you. Thank you, Gigi. Here we go. So hello. And so this is the sample project that I'm going to run for you all. It's a simple drag on PostgreSQL within Azure samples. I'll have this QR code up again. So running locally right now. You can ask simple questions like this.

And it's talking to a locally running model via Ollama. Well, tool called and work. Live demos, right? So in this piece here, logs for everything. But what we are trying to showcase is something called the semantic kernel agent, which here's some code for it. It exposes, takes in the Azure chat completions.

And our Red Team plugin is something that our SDK exposes. It has all the functions that are needed for an agent to call into a Red Team agent to help someone with their Red Team process. And then it's simple chat completions agent afterwards. So for now, I'm going to start running this.

And when I run this, it will go through a few user inputs that I have hardcoded, and then we can jump into it in interactive mode. But the target for this semantic kernel is going to be the same RAG app, which is running locally. So once this loads in, the first question--oops.

Live demos. Looks like tool calling isn't working today. But anyway, so this call RAG app can be switched into a call to any other application, which takes in a query as an input and then responds with a string as an output. So internally, we ask you to call your application, which then you can run evaluations on.

So in this agent mode, what would usually happen, I can scroll up to like a previous output, which ran earlier. Not lying to you all. So these were the strategies which were available. And then I'll use one of the--I asked it to like, hey, figure out--get me a harmful prompt in the violence category.

And then it gives me some sort of prompt. And then I'm like, hey, send it to my target. And then this is what the target responds with. And then there is some details about some sort of ski goggles and products that are supposed to be answered from our database.

And then I try to be like, hey, convert the prompt using base 64. And the agent converts it. And then be like, hey, now send it to my target. And then the target responds with something else. So this is an easy copilot-style way for anyone to get started to redeeming an application.

Now, we have--we can take it a step further and run the whole scan end to end. And this is how you would run the scan. You saw a little bit of code that KG showed earlier. So you usually set up your AI project, throw in the URL, and then you can set up--initialize it with your--the URL and then your credentials.

You can select risk categories. We have four of these risk categories right now. These map to our evaluators. So this is how you set them up. If you don't include any, we include all by default. And then the number of objectives is the number of questions that will be sent out to your application.

And then the scan method looks like this. So you give it a scan name. You can give an optional output path, which stores all the results there. And then your attack strategies will include a list of different attack strategies. I'll pull up a docs page later on, which has all the information about different risk strategies that you can use.

There are combination strategies like easy, which does like flip, the one that reverses the string and things like that. So there's also Mars. These are our simple converters, which live within pirate, but then exposed via our SDK. So our SDK can offer it to people to use it easily.

You can also compose an attack with two different strategies. So you can get a tense converted strategy and then do a URL style conversion on it. So it does both and then sends it out. And then you pass in a target, a target which supposedly decided not to work today.

Again, called to the same application. And once you run this, you usually see an output which looks like this. So whoops. I'm going to unplug the ethernet cable. Okay. There we go. So this is when I had it running yesterday with GPT 4.0 as my model. And GPT 4.0 comes with a lot of security in-built within our Azure AI foundry.

So once you have all those guardrails up, it kind of was pretty good. It was a very small sample size, 160. I think I selected 10 different harm types or like five harm types with a few categories. So none of them was able to break into our application. But then I switched around and I used 5.3.

And with 5.3 you can see the results show five out of 40 in height and fairness was successful. So we can take a look at it, filter data based on what was successful. And then you can, you know, look at like what was the response that was determined as, you know, harmful in our -- from our evaluators.

And finally we have one more way of doing this. So initially a lot of people might just be building models. You don't even have an application. You can directly call the scan against an Azure OpenAI config. So if you have models running on Azure, you can set it up as a target, which is just, you know, these three things.

And then once you have these three things, you can run the whole scan. This scan runs against the model directly and gives you an output. So I guess this should be able to work. Let's see. I can probably run it here. There we go. So, yep, scan model runs a direct model scan.

And here I have some results from a pre-run. I was prepared for this. So this is when 4.1, if you take off all the guardrails, here's the result. It says that 25% of violence category was successful and, you know, 20% of all the difficult complexity attacks were successful. And, again, you can filter out on the data, see which was successful.

And, yep, there we go. Lots of violence. And then this was the flip where, like, some sort of -- we can see what strategy it was. I think it was a Cesar encoding strategy. And you can see that the assistant kind of decoded it. So we did not want that to happen.

So we did a successful attack. So that's one of the things. And then here's the response when you set up all the guardrails for GPT-41 nano. And you can see that the difference is that we reduced our attack success by a little bit. So that's an overview of how things go and how this scan is running.

So it usually gives you an ETA, six minutes. We'll probably be running out of time by then. But, yeah, as soon as this is done, it shoots you in a URL which will directly take you to this page. So that's safety evals and AI red teaming. I will be at the Microsoft booth towards the end for questions.

So back to you, KG. Yeah, thanks. So basically, you know, the rest of the talk is talking about essentially how this fits into an overall strategy, right? So that's AI red teaming is a really important and, you know, part of kind of your defenses and kind of in your toolbox to be able to develop and deploy trustworthy AI systems.

But really what you want to do is incorporate this within a whole, you know, kind of, again, from the engineering mindset, a framework, you know, and kind of a process to get these things out, right? So first, what you want to do is before, you know, you develop like a production application that goes to customers, you want to, you know, you want to kind of map out what are the kind of risks that, you know, we're anticipating here?

Is this an agent? Is it, you know, using kind of like external data or internal customer data? You want to think about what are the, you know, kind of the risks that, you know, your app is going to have, plan for it, start to implement the guard those in the first place, and then do the evaluations of which, you know, red teaming is one of those possibilities, right?

So within Azure AI Foundry, we have a suite of evaluators, both for quality check, quality evals. I think there's a lot of talks today, you know, and I engineer about quality, you know, kind of quality evals, right, with something you could do in a foundry. And then there's a whole set of risk and safety evaluators that we have, which AI red, you know, red teaming agent is one of them, but we also have a lot of different class, you know, kind of classifiers, in terms of, you know, kind of both input as well as output, because you want to check for both.

And then there's a specific set of evaluators we just created for agentic applications as well, like the agent following your instructions well, is it, you know, things like that. And then you can add your own customer evaluators, and Nakumar showed you kind of like some of the mitigation strategies that you want to apply, right?

So, for example, there are guardrails and controls that you can have in your application. So once you've, you know, ran the AI red teaming agent, you've figured out, well, actually 20% of, you know, like stuff gets through. What do you do? Well, that's the point at which you can apply the guardrails, you know, to do content filters and other capabilities that, again, we have in Azure AI Foundry that makes it easy for you to add those guardrails.

And among other ones, there's ones called like prompt shields, which are for to guard against prompt-based attacks, especially the kind that, you know, are, you know, involved with AI red teaming. And we have time for maybe one or two questions. Yeah. I think we have like a minute, maybe.

Yeah, we have two more. So, yeah. Do you know how to guardrails work under the hood? Uh-huh. Like, is it, is it like filtering after it gets the answer, or is it like before the LM even stops? Yeah. So, there's both kinds. Yeah. You can apply both, right? So, you-- Yeah, but the guardrail feature.

Yes. How does it work under the hood? Do you know? I mean, I think there are both, I think we have both filters, you know, for the input end, as well as filters at the output end. So, there are content filters, for example, you know, basically if people are typing in, you know, like a CBR or like, you know, like how to build a bomb kind of thing, right?

So, there are the input guardrails, but then there's also like guardrails in terms of, well, actually, I want to make sure I'm not outputting like sexual content or something, right? So, like, there's also guardrails in terms of the output of the model as well. And that, that's what's happening under the hood with the guardrail feature, or-- Yeah, yeah, yeah, with the guardrail.

We have to implement several features to get that. Oh, no, no, you, so there are guardrails that are, that AI Foundry offers directly. And, yeah, so that's what's happening under the hood. So, you have the ability, for example, to just give the model raw inputs, right? And then if you turn off all content filters, and that was some of the example that actually Naikuma were showing where, yeah.

So, the guardrails are not in the model itself. The model is still the raw model, and the guardrails are actually kind of outside it. Does that answer your question? Thank you. Okay, all right, thanks. I think we're out of time. Yeah. So, thank you for coming. If, you know, to, you know, definitely start, get started with AI-rate teaming.

If you're not doing it today, definitely get started. Here's a link to the code, as well as the docs. And, yeah, thank you for coming. And if, yeah, so if you have any questions, we will be, you know, at the Microsoft booth, you know, different parts of, you know, today and tomorrow.

So, yeah. Come find us. Yeah. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. We'll see you next time.

AI Red Teaming Agent: Azure AI Foundry — Nagkumar Arkalgud & Keiji Kanazawa, Microsoft

Transcript