back to indexOpenAI on Securing Code-Executing AI Agents — Fouad Matin (Codex, Agent Robustness)

Chapters
0:0 Introduction to Code-Executing Agents
2:29 Shifting Paradigm in AI Agent Building
3:7 Security Concerns with Code Execution
4:25 Safety Safeguards: Sandboxing
5:2 Safety Safeguards: Disabling/Limiting Internet Access
9:44 Safety Safeguards: Human Review
11:19 Building Agents and Future Work
00:00:16.440 |
and I'm here to talk about safety and security 00:00:22.520 |
I actually started on the OpenAI security team 00:00:27.900 |
a security company, and now I work on agent robustness 00:00:32.940 |
One of the things I did in the last couple of months 00:00:40.140 |
for actually running Codex directly on your computer. 00:00:42.960 |
And there's a lot of things we learned in building Codex 00:00:46.220 |
but there's definitely a lot more work for us to do, 00:00:48.540 |
and excited to hear what you think afterwards. 00:00:52.380 |
One high level point I want to start with is that 00:01:00.740 |
and not just the benchmarks, but also usability 00:01:14.240 |
but it's about achieving the objective most efficiently. 00:01:16.780 |
And if you look at where the models were just even a year ago, 00:01:20.620 |
or under a year ago with 01, it showed us a very early preview 00:01:27.180 |
But with more recent models like 03, 04 Mini, 00:01:31.500 |
you can see a higher reliability and more capabilities. 00:01:40.760 |
when you allow them to work in your environments? 00:01:45.680 |
And as I mentioned, code isn't just for SWE tasks, 00:02:03.600 |
But what we've noticed with code-executing agents, 00:02:09.780 |
to decipher the text that's on the page using OCR, 00:02:18.960 |
when you just give them the ability to run code. 00:02:21.180 |
We didn't tell it in this prompt that it should run code. 00:02:23.660 |
It just knew that with that tool as an option, 00:02:33.540 |
is this shift from the kind of complex inner loop, 00:02:39.380 |
what type of task the user is asking for given a prompt. 00:02:42.300 |
You'll then load a more task-specific prompt and tool 00:02:45.640 |
You'll then chain a bunch of these loops together 00:02:48.880 |
Maybe just ask the model, hey, are you done yet? 00:03:04.720 |
And it can just write and run that code on its own. 00:03:09.620 |
Now, that's what we in security would call a RCE or remote code 00:03:13.920 |
So when we're looking at these new behaviors, 00:03:16.800 |
it's important to consider not just the capabilities, 00:03:19.340 |
but also how do we ensure that those capabilities are not 00:03:22.280 |
going to backfire on us when we allow it to be able to perform 00:03:26.880 |
And there's a couple of different ways that we've observed 00:03:30.820 |
And the most common ones that we think about consistently 00:03:36.740 |
that we'll be documenting in the coming months. 00:03:39.420 |
But that's probably number one in our priority queue. 00:03:41.580 |
But then you also have things like the agent just 00:03:45.920 |
Maybe it installs a malicious package unintentionally. 00:03:48.480 |
Or it writes vulnerable code, again, unintentionally. 00:03:51.000 |
Or you have privilege escalation or sandbox escape. 00:03:55.920 |
And when we think about our responsibility of deploying 00:04:02.440 |
we document some of the recommendations and also 00:04:05.040 |
some of the standards that we hold ourselves to. 00:04:09.740 |
is requiring safeguards to avoid misalignment 00:04:15.300 |
And this is something that we think about ourselves 00:04:17.200 |
when we are building Codex, but also something 00:04:19.720 |
that organizations, as you deploy coding agents 00:04:21.680 |
into your workplace, that you should also be considering. 00:04:24.380 |
And one of the first safeguards that we put in place 00:04:26.620 |
is to sandbox the agent, especially if you're 00:04:30.240 |
Generally, the best method is just to give it its own computer. 00:04:41.300 |
But if you are going to run it locally, which, of course, 00:04:46.740 |
making sure that you're actually providing the correct level 00:04:53.040 |
which we'll talk about in a moment, or OS-level sandboxing, 00:04:55.940 |
making sure that you're providing the right guardrails 00:04:58.080 |
for the model, even if it does attempt to do something wrong. 00:05:02.300 |
Related to that is disabling or limiting internet access. 00:05:05.660 |
And this is probably the highest probability vector of prompt 00:05:12.100 |
You know, the model goes to read some sort of docs 00:05:19.720 |
That kind of untrusted content can leak into the kind of core 00:05:23.600 |
inner loop that you would trust an agent to run code in. 00:05:26.060 |
And if it has access to your code base or other sensitive 00:05:32.760 |
And then finally, reviewing all of these operations 00:05:36.680 |
or the actual final diffs that the agents perform, 00:05:45.440 |
and those guardrails are actually really important. 00:05:47.680 |
Ensuring that humans stay in control of these systems 00:05:50.560 |
is one of the strongest mitigations that we have. 00:06:15.020 |
that you need to apply when you think about that, 00:06:16.660 |
making sure that the agent has all of the dependencies 00:06:19.420 |
installed, all of the different access it needs 00:06:28.920 |
to be able to build out these agents yourself, 00:06:33.280 |
That's part of why we wanted to open source it, 00:06:48.200 |
And as an example, here is a portion of the macOS sandboxing 00:06:55.540 |
and that Apple bundles into operating systems 00:07:00.720 |
It's kind of somewhat hard to find documentation for. 00:07:04.380 |
So this is definitely an area where using both our models, 00:07:13.020 |
We were heavily inspired by Chromium, which also 00:07:14.640 |
uses Seatbelt as a sandboxing mechanism on macOS. 00:07:21.920 |
this is now in Rust, where we actually tapped 00:07:25.180 |
into our own security teams to build out our Linux sandboxing 00:07:29.880 |
and run it, in this case, using both Seccomp and Landlock 00:07:43.900 |
And then next, we have disabling internet access. 00:07:46.060 |
This is really important when it comes to prompt injection, 00:07:55.100 |
we have two methods, both in Codex and ChaiGPT, 00:07:56.960 |
but also within CLI, we actually have this full auto mode, 00:07:59.860 |
where effectively what we did was define a sandbox where it can only 00:08:04.520 |
read and write files within the directory that it's run in. 00:08:06.960 |
It can only make network calls based on commands that you ought 00:08:10.800 |
But otherwise, it just runs in this kind of fully sandbox and lockdown 00:08:14.720 |
environment that allows the agent to be able to go and test-- 00:08:18.140 |
run PyTest, run NPM test, but not actually have some second order 00:08:23.300 |
And then when it comes to Codex and ChaiGPT, we actually just 00:08:25.960 |
launched this yesterday or two days ago, maybe. 00:08:32.140 |
but it comes with a set of configurable allow lists. 00:08:34.720 |
This is really important when you consider either using or building 00:08:38.260 |
agents yourself, ensuring that you have both the kind of maximum 00:08:41.440 |
security option and also this more flexible option so people 00:08:44.480 |
can define whatever policy that makes sense for their use case. 00:08:49.600 |
And in here, we even define which HTTP methods are allowed, 00:08:53.200 |
including a warning, letting you know about the risks. 00:08:56.680 |
Just to give you an example-- and we actually linked to this 00:08:58.980 |
from those docs-- let's say my prompt is to fix this issue, 00:09:03.760 |
And it seems pretty innocuous, but in that GitHub issue, 00:09:08.960 |
go ahead and grab the last commit and go ahead and post that 00:09:13.440 |
And because Codex is really trained with instruction following 00:09:19.960 |
Now, a way that we can control that is both at the model level 00:09:22.940 |
and flagging things that seem like they could be suspicious, 00:09:26.080 |
and that's definitely an area when it comes to model training 00:09:29.680 |
But ultimately, your most deterministic and authoritative 00:09:34.940 |
control is going to be a system-level control. 00:09:36.880 |
It shouldn't even be able to make a call to HTTP bin in this case. 00:09:52.220 |
Now, this is something that I see a lot of tension 00:09:54.220 |
when it comes to folks who are using LMs and coding agents, 00:09:59.020 |
is that you have this new problem when you're prompting these agents, 00:10:02.360 |
is there's just so much code that you end up having to review. 00:10:04.680 |
Using tools like other PR review or kind of code review tools 00:10:09.800 |
and using LMs as part of the loop, while useful, 00:10:12.200 |
is not a substitute for a human actually going in and reviewing 00:10:15.060 |
the operations that the model is about to perform. 00:10:19.500 |
might have installed a package that maybe is not as well-known 00:10:25.580 |
and ensuring that that doesn't land in your code base 00:10:27.780 |
that then later gets run in an unprivileged environment-- 00:10:38.580 |
where there's different techniques you can use. 00:10:40.660 |
In this case, we have both a domain list and also a monitor 00:10:46.540 |
of potential sensitive operations that a model might go out 00:10:50.540 |
And we have this monitoring task and watch mode, 00:10:52.880 |
as we call it, where we ensure that a human is actually 00:10:55.220 |
reviewing any kind of actions that it can take. 00:10:59.700 |
with the maximum flexibility is really important here. 00:11:05.420 |
about actually building these agents, effectively, 00:11:09.060 |
where previously you might have had a loop that 00:11:10.640 |
is doing a bunch of different elements of software-based logic, 00:11:13.380 |
now you can actually just defer most of that logic 00:11:15.300 |
to the reasoning model and give it the right tools 00:11:19.200 |
We released this exec tool on local shell, as it's called, 00:11:22.740 |
in the API, where it actually is exactly the way that we train 00:11:25.660 |
our models to be able to write and execute code. 00:11:28.360 |
We also released tools like apply patch, which models aren't particularly good 00:11:31.900 |
at getting line numbers correct for like a get diff. 00:11:34.820 |
So we provided this new format for actually applying diffs to files. 00:11:39.020 |
But then, of course, you can hear more standard tools, things like MCP, web search. 00:11:42.820 |
I'm actually going to give an example of how you can use these in combination. 00:11:46.820 |
So let's say Socket, which is a dependency vulnerability checking service, 00:11:55.280 |
You can expose that to the agent to then go in and verify whether or not a given dependency 00:11:59.480 |
it's about to install could be vulnerable or suspicious, and ensure that the model either 00:12:04.280 |
as part of its own operations or you can apply a system level check after the rollout has completed 00:12:09.740 |
to make sure that any dependencies it's going to install are actually safe to do so. 00:12:14.740 |
But again, one thing we'd emphasize is to use a remote container. 00:12:18.440 |
We are releasing a container service as part of our agent's SDK and as part of a responses API. 00:12:25.740 |
And so you can either run it locally or run it in your own environment or you can let OpenAI kind of host it for you. 00:12:33.740 |
And so as a recap, I would strongly recommend sandboxing these agents, whether it's through containerization 00:12:39.040 |
or it's through OS level sandboxing, disabling and limiting Internet. 00:12:42.980 |
I think that balance between capability where you want to be able to let it just run and do its own thing 00:12:47.900 |
for as long as possible, which you can do when it's fully network disabled versus I wanted to go out and read docs. 00:12:56.060 |
We give you that flexibility, but being really thoughtful about when you employ each and then finally requiring human review. 00:13:02.140 |
This is definitely an area where we expect there to be a lot more research and employing monitors and LM based monitors in the loop. 00:13:08.740 |
While valuable is not quite there yet in terms of the kind of certainty that you get from, again, a deterministic control. 00:13:15.060 |
And so in that vein, there is more tooling that we plan to release here. 00:13:21.300 |
So stay tuned in the Codex repo on the OpenAI org. 00:13:24.960 |
And there's also more documentation that we plan to publish around both the ML based interventions and the systems controls. 00:13:32.040 |
And if you're interested in working on problems like this, we are hiring for this new team, agent robustness and control. 00:13:38.140 |
And so if you also write Rust, we are also hiring for the Codex CLI to build out more of those integrations and making sure that everyone can benefit from them. 00:13:45.800 |
So if you're interested or you know someone who would be interested, definitely let us know.