Back to Index

AI Agents Take the Wheel: Devin, SIMA, Figure 01 and The Future of Jobs


Transcript

Three developments in the last 48 hours show how we are moving into an era in which AI models can walk the walk, not just talk the talk. Whether the developments quite meet the hype attached to them is another question. I've read and analysed in full the three relevant papers and associated posts to find out more.

We'll first explore Devon, the AI system your boss told you not to worry about. Then Google DeepMind's SEMA, which spends most of its time playing video games. And then Figure One, the humanoid robot which likes to talk while doing the dishes. But the TL;DW is this. These three systems are each a long way from human performance in their domains, but think of them more as containers or shells for the vision language models powering them.

So when the GPT-4 that's behind most of them is swapped out for GPT-5 or Gemini 2, all these systems are going to see big and hard to predict upgrades overnight. And that's a point that seems especially relevant on this, the one year anniversary of the release of GPT-4. But let's start of course with Devon, billed as the first AI software engineer.

Now Devon isn't a model, it's a system that's likely based on GPT-4. It's equipped with a code editor, shell and browser. So of course it can not just understand your prompt, but look up and read documentation. A bit like AutoGPT, it's designed to come up with plans first and then execute them, but it does so much better than AutoGPT did.

But before we get to the benchmark that everyone's talking about, let me show you a 30 second demonstration of Devon in action. All I had to do was send this blog post and a message to Devon. From there, Devon actually does all the work for me, starting with reading this blog post and figuring out how to run the code.

In a couple minutes, Devon's actually made a lot of progress. And if we jump to the middle here, you can see that Devon's been able to find and fix some edge cases and bugs that the blog post did not cover for me. And if we jump to the end, we can see that Devon sends me the final result, which I love.

I also got two bonus images here and here. So let me know if you guys see anything hidden in these. It can also fine tune a model autonomously. And if you're not familiar, think of that as refining a model rather than training it from scratch. That makes me wonder about a future where if a model can't succeed at a task, it fine tunes another model or itself until it can.

Anyway, this is the benchmark that everyone's talking about, SWE Bench, Software Engineering Bench. Devon got almost 14% and in this chart crushes Cloud 2 and GPT-4, which got 1.7%. They say Devon was unassisted, whereas all other models were assisted, meaning the model was told exactly which files need to be edited.

Before we get too much further though, what the hell is this benchmark? Well, unlike many benchmarks, they drew from real world professional problems, 2,294 software engineering problems that people had and their corresponding solutions. Resolving these issues requires understanding and coordinating changes across multiple functions, classes, and files simultaneously. The code involved might require the model to process extremely long contexts and perform, they say, complex reasoning.

These aren't just fill in the blank or multiple choice questions. The model has to understand the issue, read through the relevant parts of the code base, remove lines, and add lines. Fixing a bug might involve navigating a large repo, understanding the interplay between functions in different files, or spotting a small error in convoluted code.

On average, a model might need to edit almost two files, three functions, and about 33 lines of code. But one point to make clear is that Devon was only tested on a subset of this benchmark and the tasks in the benchmark were only a tiny subset of GitHub issues.

And even all of those issues represent just a subset of the skills of software engineering. So when you see all caps videos saying this is AGI, you've got to put it in some context. Here's just one example of what I mean. They selected only pull requests, which are like proposed solutions, that are merged or accepted, that solve the issue, and that introduce new tests.

Would that not slightly bias the dataset toward problems that are easier to detect, report, and fix? In other words, complex issues might not be adequately represented if they're less likely to have straightforward solutions. And narrowing down the proposed solutions to only those that introduce new tests could bias towards bugs or features that are easier to write tests for.

That is to say that highly complex issues, where writing a clear test is difficult, may be underrepresented. Now, having said all of that, I might shock you by saying I think that there will be rapid improvement in the performance on this benchmark. When Devin is equipped with GPT-5, I could see it easily exceeding 50%.

Here are just a few reasons why. First, some of these problems contained images, and therefore the more multimodal these language models get, the better they'll get. Second, and more importantly, a large context window is particularly crucial for this task. When the benchmark came out, they said models are simply ineffective at localizing problematic code in a sea of tokens.

They get distracted by additional context. I don't think that will be true for much longer as we've already seen with Gemini 1.5. Third reason, models, they say, are often trained using standard code files and likely rarely see patch files. I would bet that GPT-5 would have seen everything. Fourth, language models will be augmented, they predict, with program analysis and software engineering tools.

And it's almost like they could see six months in the future because they said, "To this end, we are particularly excited about agent-based approaches like Devin for identifying relevant context from a code base." I could go on, but hopefully that background on the benchmark allows you to put the rest of what I'm going to say in a bit more context.

And yes, of course, I saw how Devin was able to complete a real job on Upwork. Honestly, I could see these kinds of tasks going the way of copywriting tasks on Upwork. Here's some more context though. We don't know the actual cost of running Devin for so long. It actually takes quite a while for it to execute on its task.

We're talking 15, 20, 30 minutes, even 60 minutes sometimes. As Bindu Reddy points out, it can get even more expensive than a human, although costs are, of course, falling. Devin, she says, will not be replacing any software engineer in the near term. And noted deep learning author Francois Charlet predicted this.

There will be more software engineers, the kind that write code, in five years than there are today. And newly unemployed Andre Karpathy says that software engineering is on track to change substantially with humans more supervising the automation, pitching in high level commands, ideas, or progression strategies in English. I would say with the way things are going, they could pitch it in any language and the model will understand.

Frankly, with vision models the way they are, you could practically mime your code idea and it would understand what to do. And while Devin likely relies on GPT-4, other competitors are training their own frontier scale models. Indeed, the startup Magic, which aims to build a co-worker, not just a co-pilot for developers, is going a step further.

They're not even using transformers. They say transformers aren't the final architecture. We have something with a multi-million token context window. Super curious, of course, how that performs on SWE bench. But the thing I want to emphasize again comes from Bloomberg. Cognition AI admit that Devin is very dependent on the underlying models and use GPT-4 together with the reinforcement learning techniques.

Obviously, that's pretty vague, but imagine when GPT-5 comes out. With scale, you get so many things, not just better coding ability. If you remember, GPT-3 couldn't actually reflect effectively, whereas GPT-4 could. If GPT-5 is twice or 10 times better at reflecting and debugging, that is going to dramatically change the performance of the Devin system overnight.

Just delete the GPT-4 API and put in the GPT-5 API. And wait, Jeff Kloon, who I was going to talk about later in this video, has just retweeted one of my own videos. I literally just saw this two seconds ago when it came up as a notification on my Twitter account.

This was not at all supposed to be part of this video, but I am very much honored by that. And actually, I'm going to be talking about Jeff Kloon later in this video. Chances are he's going to see this video, so this is getting very Inception-like. He was key to CIMA, which I'm going to talk about next.

The simulation hypothesis just got 10% more likely. I'm going to recover from that distraction and get back to this video, because there's one more thing to mention about Devin. The reaction to that model has been unlike almost anything I've seen. People are genuinely in some distress about the implications for jobs.

And while I've given the context of what the benchmark does mean and doesn't mean, I can't deny that the job landscape is incredibly unpredictable at the moment. Indeed, I can't see it ever not being unpredictable. I actually still have a lot of optimism about there still being a human economy in the future, but maybe that's a topic for another video.

I just want to acknowledge that people are scared and these companies should start addressing those fears. And I know many of you are getting ready to comment that we want all jobs to go, but you might be, I guess, disappointed by the fact that Cognition AI are asking for people to apply to join them.

So obviously they don't anticipate Devin automating everything just yet. But it's time now to talk about Google DeepMind's CIMA, which is all about scaling up agents that you can instruct with natural language. Essentially a scalable, instructable, commandable, multi-world agent. The goal of CIMA being to develop an instructable agent that can accomplish anything a human can do in any simulated 3D environment.

Their agent uses a mouse and keyboard and takes pixels as input. But if you think about it, that's almost everything you do on a computer. Yes, this paper is about playing games, but couldn't you apply this technique to say video editing or say anything you can do on your phone.

Now, I know I haven't even told you what the CIMA system is, but I'm giving you an idea of the kind of repercussions, implications. If these systems work with games, there's so much else they might soon work with. This was a paper I didn't get a chance to talk about that came out about six weeks ago.

It showed that even current generation models could handle tasks on a phone, like navigating on Google Maps, downloading apps on Google Play or somewhat topically with TikTok, swiping a video about a pet cat in TikTok and clicking a like for that video. No, the success rates weren't perfect, but if you look at the averages and this is for GPT-4 Vision, they are pretty high, 91%, 82%, 82%.

These numbers in the middle, by the way, on the left, reflect the number of steps that GPT-4 Vision took and on the right, the number of steps that a human took. And that's just GPT-4 Vision, not a model optimized for agency, which we know that OpenAI is working on.

So before we even get to video games, you can imagine an internet where there are models that are downloading, liking, commenting, doing pull requests, and we wouldn't even know that it's AI. It would be, as far as I can tell, undetectable. Anyway, I'm getting distracted. Back to the CIMA paper.

What is CIMA? In a nutshell, they got a bunch of games, including commercial video games like Valheim, 12 million copies sold at least, and their own made up games that Google created. They then paid a bunch of humans to play those games and gathered the data. That's what you could see on the screen, the images and the keyboard and mouse inputs that the humans performed.

They gave all of that training data to some pre-trained models. And at this point, the paper gets quite vague. It doesn't mention parameters or the exact composition of these pre-trained models. But from this, we get the CIMA agent, which then plays these games, or more precisely, tries 10 second tasks within these games.

This gives you an idea of the range of tasks, everything from taming and hunting to destroying and headbutting. But I don't want to bury the lead. The main takeaway is this. Training on more games saw positive transfer when CIMA played on a new game. And notice how CIMA in purple across all of these games outperforms an environment specialized agent.

That's one trained for just one game. And there is another gem buried in this graph. I'm colorblind, but I'm pretty sure that's teal or lighter blue. That's zero shot. What that represents is when the model was trained across all the other games, bar the actual game it was about to be tested in.

And so notice how in some games like Goat Simulator 3, that outperformed a model that was specialized for just that one game. The transfer effect was so powerful, it outdid the specialized training. Indeed, CIMA's performance is approaching the ballpark of human performance. Now, I know we've seen that already with Starcraft 2 and OpenAI beating Dota, but this would be a model generalizing to almost any video game.

Yes, even Red Dead Redemption 2, which was covered in an entirely separate paper out of Beijing. That paper, they say, was the first to enable language models to follow the main storyline and finish real missions in complex AAA games. This time we're talking about things like protecting a character, buying supplies, equipping shotguns.

Again, what was holding them back was the underlying model, GPT-4V. As I've covered elsewhere on the channel, it lacks in spatial perception. It's not super accurate with moving the cursor, for example. But visual understanding and performance is getting better fast. Take the challenging benchmark MMMU. It's about answering difficult questions that have a visual component.

The benchmark only came out recently, giving top performance to GPT-4V at 56.8%, but that's already been superseded. Take Clawed 3 Opus, which gets 59.4%. Yes, there is still a gap with human expert performance, but that gap is narrowing, like we've seen across this video. Just like Devon was solving real-world software engineering challenges, CIMA and other models are solving real-world games.

Walking the walk, not just talking the talk. And again, we can expect better and better results the more games CIMA is trained on. As the paper says, in every case, CIMA significantly outperforms the environment specialized agent, thus demonstrating positive transfer across environments. And this is exactly what we see in robotics as well.

The key take-home from that Google DeepMind paper was that our results suggest that co-training with data from other platforms imbues RT2X in robotics with additional skills that were not present in the original dataset, enabling it to perform novel tasks. These were tasks and skills developed by other robots that were then transferred to RT2, just like CIMA getting better at one video game by training on others.

But did you notice there that smooth segue I did to robotics? It's the final container that I want to quickly talk about. Why do I call this humanoid robot a container? Because it contains GPT-4 vision. Yes, of course, it's real-time speed and dexterity is very impressive, but that intelligence of recognizing what's on the table and moving it appropriately comes from the underlying model GPT-4 vision.

So, of course, I have to make the same point that the underlying model could easily be upgraded to GPT-5 when it comes out. This humanoid would have a much deeper understanding of its environment and you as you're talking to it. Figure 1 takes in 10 images per second and this is not teleoperation.

This is an end-to-end neural network. In other words, there's no human behind the scenes controlling this robot. Figure don't release pricing, but the estimate is between $30,000 and $150,000 per robot. Still too pricey for most companies and individuals, but the CEO has a striking vision. He basically wants to completely automate manual labor.

This is the roadmap to a positive future powered by AI. He wants to build the largest company on the planet and eliminate the need for unsafe and undesirable jobs. The obvious question is if it can do those jobs, can't it also do the safe and desirable jobs? I know I'm back to the jobs point again, but all of these questions became a bit more relevant, let's say, in the last 48 hours.

The figure CEO goes on to predict that everywhere from factories to farmland, the cost of labor will decrease until it becomes equivalent to the price of renting a robot, facilitating a long-term holistic reduction in costs. Over time, humans could leave the loop altogether as robots become capable of building other robots, driving prices down even more.

Manual labor, he says, could become optional. And if that's not a big enough vision for the next two decades, he goes on that the plan is also to use these robots to build new worlds on other planets. Again, though, we get the reassurance that our focus is on providing resources for jobs that humans don't want to perform.

He also excludes military applications. I just feel like his company and the world has a bit less control over how the technology is going to be used than he might think it does. Indeed, Jeff Klune of OpenAI, Google DeepMind, SEMA, and earlier on in this video, FAME, reposted this from Edward Harris.

It was a report commissioned by the US government that he worked on, and the TLDR was that things are worse than we thought and nobody's in control. I definitely feel we're noticeably closer to AGI this week than we were last week. As Jeff Klune put out yesterday, so many pieces of the AGI puzzle are coming together.

And I would also agree that as of today, no one's really in control. And we're not alone with Jensen Huang, the CEO of NVIDIA, saying that AI will pass every human test in around five years time. That, by the way, is a timeline shared by Sam Altman. This is a quote from a book that's coming out soon.

He was asked about what AGI means for marketers. He said, "Oh, for that, it will mean that 95% of what marketers use agencies, strategists, and creative professionals for today will easily, nearly instantly, and at almost no cost be handled by the AI. And the AI will likely be able to test its creative outputs against real or synthetic customer focus groups for predicting results and optimizing.

Again, all free, instant, and nearly perfect. Images, videos, campaign ideas, no problem." But specifically on timelines, he said this. When asked about when AGI will be a reality, he said, "Five years, give or take, maybe slightly longer, but no one knows exactly when or what it will mean for society." And it's not like that timeline is even unrealistic in terms of compute.

Using these estimates from semi-analysis, I calculated that just between quarter one of 2024 and the fourth quarter of 2025, there will be a 14x increase in compute. Then if you factor in algorithmic efficiency doubling about every nine months, the effective compute at the end of next year will be almost a hundred times that of right now.

So yes, the world is changing and changing fast and the public really need to start paying attention. But no, Devin is not AGI, no matter how much you put it in all caps. Thank you so much for watching to the end. And of course, I'd love to see you over on AI Insiders on Patreon.

I'd love to see you there, but regardless, thank you so much for watching and as always have a wonderful day.