back to indexAI Agents Take the Wheel: Devin, SIMA, Figure 01 and The Future of Jobs
00:00:00.000 |
Three developments in the last 48 hours show how we are moving into an era in which AI models 00:00:06.240 |
can walk the walk, not just talk the talk. Whether the developments quite meet the hype 00:00:11.920 |
attached to them is another question. I've read and analysed in full the three relevant papers 00:00:17.360 |
and associated posts to find out more. We'll first explore Devon, the AI system your boss 00:00:22.960 |
told you not to worry about. Then Google DeepMind's SEMA, which spends most of its time 00:00:27.760 |
playing video games. And then Figure One, the humanoid robot which likes to talk while doing 00:00:33.120 |
the dishes. But the TL;DW is this. These three systems are each a long way from human performance 00:00:40.880 |
in their domains, but think of them more as containers or shells for the vision language 00:00:46.400 |
models powering them. So when the GPT-4 that's behind most of them is swapped out for GPT-5 or 00:00:53.280 |
Gemini 2, all these systems are going to see big and hard to predict upgrades overnight. And that's 00:00:59.840 |
a point that seems especially relevant on this, the one year anniversary of the release of GPT-4. 00:01:07.040 |
But let's start of course with Devon, billed as the first AI software engineer. Now Devon isn't 00:01:13.440 |
a model, it's a system that's likely based on GPT-4. It's equipped with a code editor, 00:01:19.680 |
shell and browser. So of course it can not just understand your prompt, but look up and read 00:01:26.000 |
documentation. A bit like AutoGPT, it's designed to come up with plans first and then execute them, 00:01:32.560 |
but it does so much better than AutoGPT did. But before we get to the benchmark that everyone's 00:01:38.000 |
talking about, let me show you a 30 second demonstration of Devon in action. All I had 00:01:42.960 |
to do was send this blog post and a message to Devon. From there, Devon actually does all the 00:01:47.760 |
work for me, starting with reading this blog post and figuring out how to run the code. 00:01:52.320 |
In a couple minutes, Devon's actually made a lot of progress. And if we jump to the middle here, 00:02:00.320 |
you can see that Devon's been able to find and fix some edge cases and bugs that the blog post 00:02:06.080 |
did not cover for me. And if we jump to the end, we can see that Devon sends me the final result, 00:02:12.480 |
which I love. I also got two bonus images here and here. So let me know if you guys see anything 00:02:21.280 |
hidden in these. It can also fine tune a model autonomously. And if you're not familiar, 00:02:26.160 |
think of that as refining a model rather than training it from scratch. That makes me wonder 00:02:31.520 |
about a future where if a model can't succeed at a task, it fine tunes another model or itself 00:02:37.840 |
until it can. Anyway, this is the benchmark that everyone's talking about, SWE Bench, 00:02:43.280 |
Software Engineering Bench. Devon got almost 14% and in this chart crushes Cloud 2 and GPT-4, 00:02:50.080 |
which got 1.7%. They say Devon was unassisted, whereas all other models were assisted, 00:02:56.320 |
meaning the model was told exactly which files need to be edited. Before we get too much further 00:03:00.880 |
though, what the hell is this benchmark? Well, unlike many benchmarks, they drew from real world 00:03:06.080 |
professional problems, 2,294 software engineering problems that people had and their corresponding 00:03:13.200 |
solutions. Resolving these issues requires understanding and coordinating changes across 00:03:18.400 |
multiple functions, classes, and files simultaneously. The code involved might 00:03:23.040 |
require the model to process extremely long contexts and perform, they say, complex reasoning. 00:03:29.520 |
These aren't just fill in the blank or multiple choice questions. The model has to understand the 00:03:34.400 |
issue, read through the relevant parts of the code base, remove lines, and add lines. Fixing a bug 00:03:40.480 |
might involve navigating a large repo, understanding the interplay between functions in different files, 00:03:45.920 |
or spotting a small error in convoluted code. On average, a model might need to edit almost 00:03:51.200 |
two files, three functions, and about 33 lines of code. But one point to make clear is that Devon 00:03:56.880 |
was only tested on a subset of this benchmark and the tasks in the benchmark were only a tiny subset 00:04:03.360 |
of GitHub issues. And even all of those issues represent just a subset of the skills of software 00:04:08.960 |
engineering. So when you see all caps videos saying this is AGI, you've got to put it in some context. 00:04:14.800 |
Here's just one example of what I mean. They selected only pull requests, which are like 00:04:19.280 |
proposed solutions, that are merged or accepted, that solve the issue, and that introduce new tests. 00:04:26.000 |
Would that not slightly bias the dataset toward problems that are easier to detect, 00:04:30.640 |
report, and fix? In other words, complex issues might not be adequately represented if they're 00:04:35.680 |
less likely to have straightforward solutions. And narrowing down the proposed solutions to 00:04:40.720 |
only those that introduce new tests could bias towards bugs or features that are easier to write 00:04:45.840 |
tests for. That is to say that highly complex issues, where writing a clear test is difficult, 00:04:51.440 |
may be underrepresented. Now, having said all of that, I might shock you by saying I think that 00:04:56.720 |
there will be rapid improvement in the performance on this benchmark. When Devin is equipped with GPT-5, 00:05:02.480 |
I could see it easily exceeding 50%. Here are just a few reasons why. First, some of these problems 00:05:08.320 |
contained images, and therefore the more multimodal these language models get, the better they'll get. 00:05:13.920 |
Second, and more importantly, a large context window is particularly crucial for this task. 00:05:18.960 |
When the benchmark came out, they said models are simply ineffective at localizing problematic 00:05:24.080 |
code in a sea of tokens. They get distracted by additional context. I don't think that will be 00:05:29.120 |
true for much longer as we've already seen with Gemini 1.5. Third reason, models, they say, 00:05:34.320 |
are often trained using standard code files and likely rarely see patch files. I would bet that 00:05:40.240 |
GPT-5 would have seen everything. Fourth, language models will be augmented, they predict, with 00:05:45.360 |
program analysis and software engineering tools. And it's almost like they could see six months 00:05:50.000 |
in the future because they said, "To this end, we are particularly excited about agent-based 00:05:54.480 |
approaches like Devin for identifying relevant context from a code base." I could go on, but 00:05:59.280 |
hopefully that background on the benchmark allows you to put the rest of what I'm going to say in 00:06:03.840 |
a bit more context. And yes, of course, I saw how Devin was able to complete a real job on Upwork. 00:06:09.760 |
Honestly, I could see these kinds of tasks going the way of copywriting tasks on Upwork. Here's 00:06:14.720 |
some more context though. We don't know the actual cost of running Devin for so long. It actually 00:06:19.040 |
takes quite a while for it to execute on its task. We're talking 15, 20, 30 minutes, even 60 minutes 00:06:25.040 |
sometimes. As Bindu Reddy points out, it can get even more expensive than a human, although costs 00:06:30.240 |
are, of course, falling. Devin, she says, will not be replacing any software engineer in the 00:06:34.640 |
near term. And noted deep learning author Francois Charlet predicted this. There will be more software 00:06:39.520 |
engineers, the kind that write code, in five years than there are today. And newly unemployed Andre 00:06:44.800 |
Karpathy says that software engineering is on track to change substantially with humans more 00:06:50.080 |
supervising the automation, pitching in high level commands, ideas, or progression strategies 00:06:55.520 |
in English. I would say with the way things are going, they could pitch it in any language and 00:06:59.840 |
the model will understand. Frankly, with vision models the way they are, you could practically 00:07:04.240 |
mime your code idea and it would understand what to do. And while Devin likely relies on GPT-4, 00:07:10.080 |
other competitors are training their own frontier scale models. Indeed, the startup Magic, which 00:07:16.000 |
aims to build a co-worker, not just a co-pilot for developers, is going a step further. They're not 00:07:21.760 |
even using transformers. They say transformers aren't the final architecture. We have something 00:07:26.000 |
with a multi-million token context window. Super curious, of course, how that performs on SWE bench. 00:07:32.080 |
But the thing I want to emphasize again comes from Bloomberg. Cognition AI admit that Devin 00:07:37.360 |
is very dependent on the underlying models and use GPT-4 together with the reinforcement learning 00:07:43.040 |
techniques. Obviously, that's pretty vague, but imagine when GPT-5 comes out. With scale, 00:07:47.280 |
you get so many things, not just better coding ability. If you remember, GPT-3 couldn't actually 00:07:52.000 |
reflect effectively, whereas GPT-4 could. If GPT-5 is twice or 10 times better at reflecting 00:07:58.800 |
and debugging, that is going to dramatically change the performance of the Devin system 00:08:03.200 |
overnight. Just delete the GPT-4 API and put in the GPT-5 API. And wait, Jeff Kloon, 00:08:09.440 |
who I was going to talk about later in this video, has just retweeted one of my own videos. I 00:08:15.120 |
literally just saw this two seconds ago when it came up as a notification on my Twitter account. 00:08:20.320 |
This was not at all supposed to be part of this video, but I am very much honored by that. And 00:08:25.120 |
actually, I'm going to be talking about Jeff Kloon later in this video. Chances are he's going to see 00:08:29.040 |
this video, so this is getting very Inception-like. He was key to CIMA, which I'm going to talk about 00:08:33.920 |
next. The simulation hypothesis just got 10% more likely. I'm going to recover from that 00:08:39.600 |
distraction and get back to this video, because there's one more thing to mention about Devin. 00:08:43.920 |
The reaction to that model has been unlike almost anything I've seen. People are genuinely in some 00:08:50.000 |
distress about the implications for jobs. And while I've given the context of what the benchmark does 00:08:55.120 |
mean and doesn't mean, I can't deny that the job landscape is incredibly unpredictable at the 00:09:00.880 |
moment. Indeed, I can't see it ever not being unpredictable. I actually still have a lot of 00:09:05.520 |
optimism about there still being a human economy in the future, but maybe that's a topic for another 00:09:10.800 |
video. I just want to acknowledge that people are scared and these companies should start addressing 00:09:16.160 |
those fears. And I know many of you are getting ready to comment that we want all jobs to go, 00:09:20.800 |
but you might be, I guess, disappointed by the fact that Cognition AI are asking for people to 00:09:27.360 |
apply to join them. So obviously they don't anticipate Devin automating everything just yet. 00:09:32.080 |
But it's time now to talk about Google DeepMind's CIMA, which is all about scaling up agents that 00:09:37.680 |
you can instruct with natural language. Essentially a scalable, instructable, 00:09:43.280 |
commandable, multi-world agent. The goal of CIMA being to develop an instructable agent that can 00:09:49.200 |
accomplish anything a human can do in any simulated 3D environment. Their agent uses a mouse and 00:09:56.240 |
keyboard and takes pixels as input. But if you think about it, that's almost everything you do 00:10:01.600 |
on a computer. Yes, this paper is about playing games, but couldn't you apply this technique to 00:10:06.000 |
say video editing or say anything you can do on your phone. Now, I know I haven't even told you 00:10:10.480 |
what the CIMA system is, but I'm giving you an idea of the kind of repercussions, implications. 00:10:15.440 |
If these systems work with games, there's so much else they might soon work with. 00:10:19.760 |
This was a paper I didn't get a chance to talk about that came out about six weeks ago. 00:10:23.680 |
It showed that even current generation models could handle tasks on a phone, 00:10:27.680 |
like navigating on Google Maps, downloading apps on Google Play or somewhat topically with TikTok, 00:10:33.920 |
swiping a video about a pet cat in TikTok and clicking a like for that video. 00:10:38.640 |
No, the success rates weren't perfect, but if you look at the averages and this is for GPT-4 00:10:43.280 |
Vision, they are pretty high, 91%, 82%, 82%. These numbers in the middle, by the way, on the left, 00:10:49.040 |
reflect the number of steps that GPT-4 Vision took and on the right, the number of steps that 00:10:53.360 |
a human took. And that's just GPT-4 Vision, not a model optimized for agency, which we know 00:10:58.960 |
that OpenAI is working on. So before we even get to video games, you can imagine an internet where 00:11:04.480 |
there are models that are downloading, liking, commenting, doing pull requests, and we wouldn't 00:11:10.480 |
even know that it's AI. It would be, as far as I can tell, undetectable. Anyway, I'm getting 00:11:14.880 |
distracted. Back to the CIMA paper. What is CIMA? In a nutshell, they got a bunch of games, including 00:11:20.640 |
commercial video games like Valheim, 12 million copies sold at least, and their own made up games 00:11:26.560 |
that Google created. They then paid a bunch of humans to play those games and gathered the data. 00:11:32.000 |
That's what you could see on the screen, the images and the keyboard and mouse inputs that 00:11:36.800 |
the humans performed. They gave all of that training data to some pre-trained models. And 00:11:41.520 |
at this point, the paper gets quite vague. It doesn't mention parameters or the exact composition 00:11:46.480 |
of these pre-trained models. But from this, we get the CIMA agent, which then plays these games, 00:11:52.320 |
or more precisely, tries 10 second tasks within these games. This gives you an idea of the range 00:11:58.320 |
of tasks, everything from taming and hunting to destroying and headbutting. But I don't want to 00:12:04.160 |
bury the lead. The main takeaway is this. Training on more games saw positive transfer when CIMA 00:12:10.400 |
played on a new game. And notice how CIMA in purple across all of these games outperforms 00:12:16.320 |
an environment specialized agent. That's one trained for just one game. And there is another 00:12:22.000 |
gem buried in this graph. I'm colorblind, but I'm pretty sure that's teal or lighter blue. That's 00:12:27.200 |
zero shot. What that represents is when the model was trained across all the other games, bar the 00:12:32.960 |
actual game it was about to be tested in. And so notice how in some games like Goat Simulator 3, 00:12:38.880 |
that outperformed a model that was specialized for just that one game. The transfer effect was 00:12:45.280 |
so powerful, it outdid the specialized training. Indeed, CIMA's performance is approaching the 00:12:50.960 |
ballpark of human performance. Now, I know we've seen that already with Starcraft 2 and OpenAI 00:12:56.560 |
beating Dota, but this would be a model generalizing to almost any video game. Yes, 00:13:01.360 |
even Red Dead Redemption 2, which was covered in an entirely separate paper out of Beijing. 00:13:06.320 |
That paper, they say, was the first to enable language models to follow the main storyline 00:13:11.280 |
and finish real missions in complex AAA games. This time we're talking about things like protecting a 00:13:16.640 |
character, buying supplies, equipping shotguns. Again, what was holding them back was the 00:13:21.440 |
underlying model, GPT-4V. As I've covered elsewhere on the channel, it lacks in spatial perception. 00:13:27.040 |
It's not super accurate with moving the cursor, for example. But visual understanding and 00:13:31.680 |
performance is getting better fast. Take the challenging benchmark MMMU. It's about answering 00:13:37.920 |
difficult questions that have a visual component. The benchmark only came out recently, giving top 00:13:42.480 |
performance to GPT-4V at 56.8%, but that's already been superseded. Take Clawed 3 Opus, 00:13:48.960 |
which gets 59.4%. Yes, there is still a gap with human expert performance, but that gap 00:13:54.640 |
is narrowing, like we've seen across this video. Just like Devon was solving real-world software 00:13:59.520 |
engineering challenges, CIMA and other models are solving real-world games. Walking the walk, 00:14:05.440 |
not just talking the talk. And again, we can expect better and better results the more games 00:14:10.640 |
CIMA is trained on. As the paper says, in every case, CIMA significantly outperforms the environment 00:14:16.240 |
specialized agent, thus demonstrating positive transfer across environments. And this is exactly 00:14:21.680 |
what we see in robotics as well. The key take-home from that Google DeepMind paper was that our 00:14:27.120 |
results suggest that co-training with data from other platforms imbues RT2X in robotics with 00:14:34.080 |
additional skills that were not present in the original dataset, enabling it to perform novel 00:14:38.560 |
tasks. These were tasks and skills developed by other robots that were then transferred to RT2, 00:14:44.880 |
just like CIMA getting better at one video game by training on others. But did you notice there 00:14:50.240 |
that smooth segue I did to robotics? It's the final container that I want to quickly talk about. 00:14:56.880 |
Why do I call this humanoid robot a container? Because it contains GPT-4 vision. Yes, of course, 00:15:03.280 |
it's real-time speed and dexterity is very impressive, but that intelligence of recognizing 00:15:09.200 |
what's on the table and moving it appropriately comes from the underlying model GPT-4 vision. 00:15:14.320 |
So, of course, I have to make the same point that the underlying model could easily be upgraded to 00:15:19.280 |
GPT-5 when it comes out. This humanoid would have a much deeper understanding of its environment 00:15:24.960 |
and you as you're talking to it. Figure 1 takes in 10 images per second and this is not teleoperation. 00:15:31.360 |
This is an end-to-end neural network. In other words, there's no human behind the scenes controlling 00:15:36.400 |
this robot. Figure don't release pricing, but the estimate is between $30,000 and $150,000 per robot. 00:15:44.320 |
Still too pricey for most companies and individuals, but the CEO has a striking vision. 00:15:50.240 |
He basically wants to completely automate manual labor. This is the roadmap to a positive future 00:15:56.960 |
powered by AI. He wants to build the largest company on the planet and eliminate the need 00:16:02.560 |
for unsafe and undesirable jobs. The obvious question is if it can do those jobs, can't it 00:16:08.400 |
also do the safe and desirable jobs? I know I'm back to the jobs point again, but all of these 00:16:13.680 |
questions became a bit more relevant, let's say, in the last 48 hours. The figure CEO goes on to 00:16:19.200 |
predict that everywhere from factories to farmland, the cost of labor will decrease until it becomes 00:16:24.960 |
equivalent to the price of renting a robot, facilitating a long-term holistic reduction in 00:16:30.160 |
costs. Over time, humans could leave the loop altogether as robots become capable of building 00:16:35.840 |
other robots, driving prices down even more. Manual labor, he says, could become optional. 00:16:41.920 |
And if that's not a big enough vision for the next two decades, he goes on that the plan is also 00:16:46.800 |
to use these robots to build new worlds on other planets. Again, though, we get the reassurance 00:16:52.560 |
that our focus is on providing resources for jobs that humans don't want to perform. He also excludes 00:16:58.400 |
military applications. I just feel like his company and the world has a bit less control 00:17:03.680 |
over how the technology is going to be used than he might think it does. Indeed, Jeff Klune of 00:17:08.560 |
OpenAI, Google DeepMind, SEMA, and earlier on in this video, FAME, reposted this from Edward Harris. 00:17:16.400 |
It was a report commissioned by the US government that he worked on, and the TLDR was that things 00:17:22.160 |
are worse than we thought and nobody's in control. I definitely feel we're noticeably closer to AGI 00:17:28.080 |
this week than we were last week. As Jeff Klune put out yesterday, so many pieces of the AGI puzzle 00:17:34.000 |
are coming together. And I would also agree that as of today, no one's really in control. And we're 00:17:39.920 |
not alone with Jensen Huang, the CEO of NVIDIA, saying that AI will pass every human test in 00:17:46.320 |
around five years time. That, by the way, is a timeline shared by Sam Altman. This is a quote 00:17:52.000 |
from a book that's coming out soon. He was asked about what AGI means for marketers. He said, "Oh, 00:17:57.280 |
for that, it will mean that 95% of what marketers use agencies, strategists, and creative professionals 00:18:02.880 |
for today will easily, nearly instantly, and at almost no cost be handled by the AI. And the AI 00:18:09.280 |
will likely be able to test its creative outputs against real or synthetic customer focus groups 00:18:14.880 |
for predicting results and optimizing. Again, all free, instant, and nearly perfect. Images, videos, 00:18:20.160 |
campaign ideas, no problem." But specifically on timelines, he said this. When asked about when AGI 00:18:25.280 |
will be a reality, he said, "Five years, give or take, maybe slightly longer, but no one knows 00:18:30.400 |
exactly when or what it will mean for society." And it's not like that timeline is even unrealistic 00:18:36.320 |
in terms of compute. Using these estimates from semi-analysis, I calculated that just between 00:18:40.960 |
quarter one of 2024 and the fourth quarter of 2025, there will be a 14x increase in compute. 00:18:47.040 |
Then if you factor in algorithmic efficiency doubling about every nine months, the effective 00:18:51.840 |
compute at the end of next year will be almost a hundred times that of right now. So yes, the world 00:18:58.240 |
is changing and changing fast and the public really need to start paying attention. But no, 00:19:04.320 |
Devin is not AGI, no matter how much you put it in all caps. Thank you so much for watching to 00:19:09.840 |
the end. And of course, I'd love to see you over on AI Insiders on Patreon. I'd love to see you 00:19:14.880 |
there, but regardless, thank you so much for watching and as always have a wonderful day.