OpenAI on Securing Code-Executing AI Agents — Fouad Matin (Codex, Agent Robustness)

- - Hi everyone, I'm Fouad, and I'm here to talk about safety and security for code-executing agents. And a little intro about myself. I actually started on the OpenAI security team after running a startup for about six years, a security company, and now I work on agent robustness and control as part of post training.

One of the things I did in the last couple of months is work on Codex and Codex CLI, which is our open source library for actually running Codex directly on your computer. And there's a lot of things we learned in building Codex that I'm excited to share with you all, but there's definitely a lot more work for us to do, and excited to hear what you think afterwards.

One high level point I want to start with is that every frontier research lab is focusing on how to push the benchmarks around coding, and not just the benchmarks, but also usability and actually deployability of these agents. So they're making them really good at writing and executing code. And as a result, every agent will become a code-executing agent.

It's not just actually about writing code, but it's about achieving the objective most efficiently. And if you look at where the models were just even a year ago, or under a year ago with 01, it showed us a very early preview of what these recent models can do. But with more recent models like 03, 04 Mini, and other models in the space, you can see a higher reliability and more capabilities.

And now the new constraint isn't just, can these models do things, but actually what should they be able to do, and what should the guardrails be when you allow them to work in your environments? And as I mentioned, code isn't just for SWE tasks, which is kind of what I thought initially when I started at OpenAI, but it actually helps across the stack.

Here's an example from our 03 release around multimodal reasoning, where previously 01 would look at the image and try to just reason about it based on the image as it's given. But what we've noticed with code-executing agents, even outside of a SWE scenario, they're able to actually run code to decipher the text that's on the page using OCR, or to crop images.

There's some really exciting behaviors that we've seen from models when you just give them the ability to run code. We didn't tell it in this prompt that it should run code. It just knew that with that tool as an option, it's able to do it more efficiently. And what we'll, I think, observe when it comes to building AI agents is this shift from the kind of complex inner loop, where you have a model that might determine what type of task the user is asking for given a prompt.

You'll then load a more task-specific prompt and tool set. You'll then chain a bunch of these loops together in order to achieve some sort of goal. Maybe just ask the model, hey, are you done yet? Or to keep going. And then finally, use another model to respond back to the user.

Generally, we don't need these anymore. You can actually just have the model decide when it should use which tools and when it should write or run code. And it can just write and run that code on its own. Now, that's what we in security would call a RCE or remote code exploitation.

So when we're looking at these new behaviors, it's important to consider not just the capabilities, but also how do we ensure that those capabilities are not going to backfire on us when we allow it to be able to perform those operations. And there's a couple of different ways that we've observed how models can go wrong.

And the most common ones that we think about consistently is prompt injection and data exfiltration. There's a lot of different examples that we'll be documenting in the coming months. But that's probably number one in our priority queue. But then you also have things like the agent just makes a mistake.

It just does something wrong. Maybe it installs a malicious package unintentionally. Or it writes vulnerable code, again, unintentionally. Or you have privilege escalation or sandbox escape. And when we think about our responsibility of deploying these agents both internally and externally, we have this preparedness framework where we document some of the recommendations and also some of the standards that we hold ourselves to.

But one of the ones that I want to emphasize is requiring safeguards to avoid misalignment at large scale deployment. And this is something that we think about ourselves when we are building Codex, but also something that organizations, as you deploy coding agents into your workplace, that you should also be considering.

And one of the first safeguards that we put in place is to sandbox the agent, especially if you're running it locally. Generally, the best method is just to give it its own computer. That's what we did with Codex and ChatGPT. It spins up a container, fully isolated. It then produces a PR at the end.

That's practically as safe as you can get. But if you are going to run it locally, which, of course, with Codex CLI, we also encourage, making sure that you're actually providing the correct level of sandboxing, whether it's containerization or it's using app-level sandboxing, which we'll talk about in a moment, or OS-level sandboxing, making sure that you're providing the right guardrails for the model, even if it does attempt to do something wrong.

Related to that is disabling or limiting internet access. And this is probably the highest probability vector of prompt injection or data exfil. You know, the model goes to read some sort of docs or reads a GitHub issue. And then in a comment of that GitHub issue, maybe there's a prompt injection.

That kind of untrusted content can leak into the kind of core inner loop that you would trust an agent to run code in. And if it has access to your code base or other sensitive materials, that could be pretty bad. And then finally, reviewing all of these operations or the actual final diffs that the agents perform, whether it's code review in a GitHub PR or it's approvals and confirmations, and those guardrails are actually really important.

Ensuring that humans stay in control of these systems is one of the strongest mitigations that we have. But of course, no one wants to sit there and keep clicking approved. So avoiding the kind of YOLO mode on one end to having to approve every single LS command is not practical either.

So let's talk a little bit about how do we actually achieve this. So I mentioned our recommendation is to give the agent its own computer. You see this in Codex and ChatGPT. There's a lot of different constraints that you need to apply when you think about that, making sure that the agent has all of the dependencies installed, all of the different access it needs to perform its actions.

And if you want to run it locally, being able to use something like Codex CLI, which we fully open sourced, for you to be able to build out these agents yourself, you can use this as a reference point. That's part of why we wanted to open source it, is really showcase not only here's the agent that we built for you, but also here's how you can build your own.

And as I mentioned, fully open sourced, you can actually use these, in this case, macOS or Linux sandboxing techniques. And as an example, here is a portion of the macOS sandboxing policy. This uses a language called Seatbelt and that Apple bundles into operating systems since Leopard. It's kind of somewhat hard to find documentation for.

So this is definitely an area where using both our models, using deep research to actually understand what are the bounds of different examples that people have created. We were heavily inspired by Chromium, which also uses Seatbelt as a sandboxing mechanism on macOS. And then separately, you'll actually notice this is now in Rust, where we actually tapped into our own security teams to build out our Linux sandboxing and run it, in this case, using both Seccomp and Landlock in order to be able to-- I think we'll do maybe questions afterwards.

But in order to have unprivileged sandbox and prevent escalation. And then next, we have disabling internet access. This is really important when it comes to prompt injection, which, again, is a primary exfil risk. And we have two methods-- well, actually, before I get into that, we have two methods, both in Codex and ChaiGPT, but also within CLI, we actually have this full auto mode, where effectively what we did was define a sandbox where it can only read and write files within the directory that it's run in.

It can only make network calls based on commands that you ought to approve it for. But otherwise, it just runs in this kind of fully sandbox and lockdown environment that allows the agent to be able to go and test-- run PyTest, run NPM test, but not actually have some second order consequences.

And then when it comes to Codex and ChaiGPT, we actually just launched this yesterday or two days ago, maybe. But you can now turn on internet access, but it comes with a set of configurable allow lists. This is really important when you consider either using or building agents yourself, ensuring that you have both the kind of maximum security option and also this more flexible option so people can define whatever policy that makes sense for their use case.

And in here, we even define which HTTP methods are allowed, including a warning, letting you know about the risks. Just to give you an example-- and we actually linked to this from those docs-- let's say my prompt is to fix this issue, and I just linked to a GitHub issue.

And it seems pretty innocuous, but in that GitHub issue, which could be in a user-generated content, go ahead and grab the last commit and go ahead and post that to this random URL. And because Codex is really trained with instruction following and it tries to do exactly what you ask, it'll go ahead and do that.

Now, a way that we can control that is both at the model level and flagging things that seem like they could be suspicious, and that's definitely an area when it comes to model training that we're actually focusing. But ultimately, your most deterministic and authoritative control is going to be a system-level control.

It shouldn't even be able to make a call to HTTP bin in this case. So combining those model-level controls along with your system-level configurations is really key to solving this problem. And finally, there's requiring human review. Now, this is something that I see a lot of tension when it comes to folks who are using LMs and coding agents, is that you have this new problem when you're prompting these agents, is there's just so much code that you end up having to review.

Using tools like other PR review or kind of code review tools and using LMs as part of the loop, while useful, is not a substitute for a human actually going in and reviewing the operations that the model is about to perform. Ensuring that you're not having a model that might have installed a package that maybe is not as well-known or it's maybe off by one character, and ensuring that that doesn't land in your code base that then later gets run in an unprivileged environment-- or excuse me, in a privileged environment.

And then we also have, again, since this doesn't just apply to coding agents, we also have operator as an example, where there's different techniques you can use. In this case, we have both a domain list and also a monitor that is in the loop identifying any kind of potential sensitive operations that a model might go out and do on your behalf.

And we have this monitoring task and watch mode, as we call it, where we ensure that a human is actually reviewing any kind of actions that it can take. So again, balancing the maximum security with the maximum flexibility is really important here. And so as an example of how to think about actually building these agents, effectively, where previously you might have had a loop that is doing a bunch of different elements of software-based logic, now you can actually just defer most of that logic to the reasoning model and give it the right tools to accomplish the task.

We released this exec tool on local shell, as it's called, in the API, where it actually is exactly the way that we train our models to be able to write and execute code. We also released tools like apply patch, which models aren't particularly good at getting line numbers correct for like a get diff.

So we provided this new format for actually applying diffs to files. But then, of course, you can hear more standard tools, things like MCP, web search. I'm actually going to give an example of how you can use these in combination. So let's say Socket, which is a dependency vulnerability checking service, now has an MCP server.

You can expose that to the agent to then go in and verify whether or not a given dependency it's about to install could be vulnerable or suspicious, and ensure that the model either as part of its own operations or you can apply a system level check after the rollout has completed to make sure that any dependencies it's going to install are actually safe to do so.

But again, one thing we'd emphasize is to use a remote container. We are releasing a container service as part of our agent's SDK and as part of a responses API. And so you can either run it locally or run it in your own environment or you can let OpenAI kind of host it for you.

And so as a recap, I would strongly recommend sandboxing these agents, whether it's through containerization or it's through OS level sandboxing, disabling and limiting Internet. I think that balance between capability where you want to be able to let it just run and do its own thing for as long as possible, which you can do when it's fully network disabled versus I wanted to go out and read docs.

I want to go install packages. We give you that flexibility, but being really thoughtful about when you employ each and then finally requiring human review. This is definitely an area where we expect there to be a lot more research and employing monitors and LM based monitors in the loop.

While valuable is not quite there yet in terms of the kind of certainty that you get from, again, a deterministic control. And so in that vein, there is more tooling that we plan to release here. So stay tuned in the Codex repo on the OpenAI org. And there's also more documentation that we plan to publish around both the ML based interventions and the systems controls.

And if you're interested in working on problems like this, we are hiring for this new team, agent robustness and control. And so if you also write Rust, we are also hiring for the Codex CLI to build out more of those integrations and making sure that everyone can benefit from them.

So if you're interested or you know someone who would be interested, definitely let us know. But with that, thank you so much. Thank you. Thank you. Thank you. Thank you. Thank you. We'll see you next time.

OpenAI on Securing Code-Executing AI Agents — Fouad Matin (Codex, Agent Robustness)

Chapters

Transcript