back to index

AI Agents Take the Wheel: Devin, SIMA, Figure 01 and The Future of Jobs


Whisper Transcript | Transcript Only Page

00:00:00.000 | Three developments in the last 48 hours show how we are moving into an era in which AI models
00:00:06.240 | can walk the walk, not just talk the talk. Whether the developments quite meet the hype
00:00:11.920 | attached to them is another question. I've read and analysed in full the three relevant papers
00:00:17.360 | and associated posts to find out more. We'll first explore Devon, the AI system your boss
00:00:22.960 | told you not to worry about. Then Google DeepMind's SEMA, which spends most of its time
00:00:27.760 | playing video games. And then Figure One, the humanoid robot which likes to talk while doing
00:00:33.120 | the dishes. But the TL;DW is this. These three systems are each a long way from human performance
00:00:40.880 | in their domains, but think of them more as containers or shells for the vision language
00:00:46.400 | models powering them. So when the GPT-4 that's behind most of them is swapped out for GPT-5 or
00:00:53.280 | Gemini 2, all these systems are going to see big and hard to predict upgrades overnight. And that's
00:00:59.840 | a point that seems especially relevant on this, the one year anniversary of the release of GPT-4.
00:01:07.040 | But let's start of course with Devon, billed as the first AI software engineer. Now Devon isn't
00:01:13.440 | a model, it's a system that's likely based on GPT-4. It's equipped with a code editor,
00:01:19.680 | shell and browser. So of course it can not just understand your prompt, but look up and read
00:01:26.000 | documentation. A bit like AutoGPT, it's designed to come up with plans first and then execute them,
00:01:32.560 | but it does so much better than AutoGPT did. But before we get to the benchmark that everyone's
00:01:38.000 | talking about, let me show you a 30 second demonstration of Devon in action. All I had
00:01:42.960 | to do was send this blog post and a message to Devon. From there, Devon actually does all the
00:01:47.760 | work for me, starting with reading this blog post and figuring out how to run the code.
00:01:52.320 | In a couple minutes, Devon's actually made a lot of progress. And if we jump to the middle here,
00:02:00.320 | you can see that Devon's been able to find and fix some edge cases and bugs that the blog post
00:02:06.080 | did not cover for me. And if we jump to the end, we can see that Devon sends me the final result,
00:02:12.480 | which I love. I also got two bonus images here and here. So let me know if you guys see anything
00:02:21.280 | hidden in these. It can also fine tune a model autonomously. And if you're not familiar,
00:02:26.160 | think of that as refining a model rather than training it from scratch. That makes me wonder
00:02:31.520 | about a future where if a model can't succeed at a task, it fine tunes another model or itself
00:02:37.840 | until it can. Anyway, this is the benchmark that everyone's talking about, SWE Bench,
00:02:43.280 | Software Engineering Bench. Devon got almost 14% and in this chart crushes Cloud 2 and GPT-4,
00:02:50.080 | which got 1.7%. They say Devon was unassisted, whereas all other models were assisted,
00:02:56.320 | meaning the model was told exactly which files need to be edited. Before we get too much further
00:03:00.880 | though, what the hell is this benchmark? Well, unlike many benchmarks, they drew from real world
00:03:06.080 | professional problems, 2,294 software engineering problems that people had and their corresponding
00:03:13.200 | solutions. Resolving these issues requires understanding and coordinating changes across
00:03:18.400 | multiple functions, classes, and files simultaneously. The code involved might
00:03:23.040 | require the model to process extremely long contexts and perform, they say, complex reasoning.
00:03:29.520 | These aren't just fill in the blank or multiple choice questions. The model has to understand the
00:03:34.400 | issue, read through the relevant parts of the code base, remove lines, and add lines. Fixing a bug
00:03:40.480 | might involve navigating a large repo, understanding the interplay between functions in different files,
00:03:45.920 | or spotting a small error in convoluted code. On average, a model might need to edit almost
00:03:51.200 | two files, three functions, and about 33 lines of code. But one point to make clear is that Devon
00:03:56.880 | was only tested on a subset of this benchmark and the tasks in the benchmark were only a tiny subset
00:04:03.360 | of GitHub issues. And even all of those issues represent just a subset of the skills of software
00:04:08.960 | engineering. So when you see all caps videos saying this is AGI, you've got to put it in some context.
00:04:14.800 | Here's just one example of what I mean. They selected only pull requests, which are like
00:04:19.280 | proposed solutions, that are merged or accepted, that solve the issue, and that introduce new tests.
00:04:26.000 | Would that not slightly bias the dataset toward problems that are easier to detect,
00:04:30.640 | report, and fix? In other words, complex issues might not be adequately represented if they're
00:04:35.680 | less likely to have straightforward solutions. And narrowing down the proposed solutions to
00:04:40.720 | only those that introduce new tests could bias towards bugs or features that are easier to write
00:04:45.840 | tests for. That is to say that highly complex issues, where writing a clear test is difficult,
00:04:51.440 | may be underrepresented. Now, having said all of that, I might shock you by saying I think that
00:04:56.720 | there will be rapid improvement in the performance on this benchmark. When Devin is equipped with GPT-5,
00:05:02.480 | I could see it easily exceeding 50%. Here are just a few reasons why. First, some of these problems
00:05:08.320 | contained images, and therefore the more multimodal these language models get, the better they'll get.
00:05:13.920 | Second, and more importantly, a large context window is particularly crucial for this task.
00:05:18.960 | When the benchmark came out, they said models are simply ineffective at localizing problematic
00:05:24.080 | code in a sea of tokens. They get distracted by additional context. I don't think that will be
00:05:29.120 | true for much longer as we've already seen with Gemini 1.5. Third reason, models, they say,
00:05:34.320 | are often trained using standard code files and likely rarely see patch files. I would bet that
00:05:40.240 | GPT-5 would have seen everything. Fourth, language models will be augmented, they predict, with
00:05:45.360 | program analysis and software engineering tools. And it's almost like they could see six months
00:05:50.000 | in the future because they said, "To this end, we are particularly excited about agent-based
00:05:54.480 | approaches like Devin for identifying relevant context from a code base." I could go on, but
00:05:59.280 | hopefully that background on the benchmark allows you to put the rest of what I'm going to say in
00:06:03.840 | a bit more context. And yes, of course, I saw how Devin was able to complete a real job on Upwork.
00:06:09.760 | Honestly, I could see these kinds of tasks going the way of copywriting tasks on Upwork. Here's
00:06:14.720 | some more context though. We don't know the actual cost of running Devin for so long. It actually
00:06:19.040 | takes quite a while for it to execute on its task. We're talking 15, 20, 30 minutes, even 60 minutes
00:06:25.040 | sometimes. As Bindu Reddy points out, it can get even more expensive than a human, although costs
00:06:30.240 | are, of course, falling. Devin, she says, will not be replacing any software engineer in the
00:06:34.640 | near term. And noted deep learning author Francois Charlet predicted this. There will be more software
00:06:39.520 | engineers, the kind that write code, in five years than there are today. And newly unemployed Andre
00:06:44.800 | Karpathy says that software engineering is on track to change substantially with humans more
00:06:50.080 | supervising the automation, pitching in high level commands, ideas, or progression strategies
00:06:55.520 | in English. I would say with the way things are going, they could pitch it in any language and
00:06:59.840 | the model will understand. Frankly, with vision models the way they are, you could practically
00:07:04.240 | mime your code idea and it would understand what to do. And while Devin likely relies on GPT-4,
00:07:10.080 | other competitors are training their own frontier scale models. Indeed, the startup Magic, which
00:07:16.000 | aims to build a co-worker, not just a co-pilot for developers, is going a step further. They're not
00:07:21.760 | even using transformers. They say transformers aren't the final architecture. We have something
00:07:26.000 | with a multi-million token context window. Super curious, of course, how that performs on SWE bench.
00:07:32.080 | But the thing I want to emphasize again comes from Bloomberg. Cognition AI admit that Devin
00:07:37.360 | is very dependent on the underlying models and use GPT-4 together with the reinforcement learning
00:07:43.040 | techniques. Obviously, that's pretty vague, but imagine when GPT-5 comes out. With scale,
00:07:47.280 | you get so many things, not just better coding ability. If you remember, GPT-3 couldn't actually
00:07:52.000 | reflect effectively, whereas GPT-4 could. If GPT-5 is twice or 10 times better at reflecting
00:07:58.800 | and debugging, that is going to dramatically change the performance of the Devin system
00:08:03.200 | overnight. Just delete the GPT-4 API and put in the GPT-5 API. And wait, Jeff Kloon,
00:08:09.440 | who I was going to talk about later in this video, has just retweeted one of my own videos. I
00:08:15.120 | literally just saw this two seconds ago when it came up as a notification on my Twitter account.
00:08:20.320 | This was not at all supposed to be part of this video, but I am very much honored by that. And
00:08:25.120 | actually, I'm going to be talking about Jeff Kloon later in this video. Chances are he's going to see
00:08:29.040 | this video, so this is getting very Inception-like. He was key to CIMA, which I'm going to talk about
00:08:33.920 | next. The simulation hypothesis just got 10% more likely. I'm going to recover from that
00:08:39.600 | distraction and get back to this video, because there's one more thing to mention about Devin.
00:08:43.920 | The reaction to that model has been unlike almost anything I've seen. People are genuinely in some
00:08:50.000 | distress about the implications for jobs. And while I've given the context of what the benchmark does
00:08:55.120 | mean and doesn't mean, I can't deny that the job landscape is incredibly unpredictable at the
00:09:00.880 | moment. Indeed, I can't see it ever not being unpredictable. I actually still have a lot of
00:09:05.520 | optimism about there still being a human economy in the future, but maybe that's a topic for another
00:09:10.800 | video. I just want to acknowledge that people are scared and these companies should start addressing
00:09:16.160 | those fears. And I know many of you are getting ready to comment that we want all jobs to go,
00:09:20.800 | but you might be, I guess, disappointed by the fact that Cognition AI are asking for people to
00:09:27.360 | apply to join them. So obviously they don't anticipate Devin automating everything just yet.
00:09:32.080 | But it's time now to talk about Google DeepMind's CIMA, which is all about scaling up agents that
00:09:37.680 | you can instruct with natural language. Essentially a scalable, instructable,
00:09:43.280 | commandable, multi-world agent. The goal of CIMA being to develop an instructable agent that can
00:09:49.200 | accomplish anything a human can do in any simulated 3D environment. Their agent uses a mouse and
00:09:56.240 | keyboard and takes pixels as input. But if you think about it, that's almost everything you do
00:10:01.600 | on a computer. Yes, this paper is about playing games, but couldn't you apply this technique to
00:10:06.000 | say video editing or say anything you can do on your phone. Now, I know I haven't even told you
00:10:10.480 | what the CIMA system is, but I'm giving you an idea of the kind of repercussions, implications.
00:10:15.440 | If these systems work with games, there's so much else they might soon work with.
00:10:19.760 | This was a paper I didn't get a chance to talk about that came out about six weeks ago.
00:10:23.680 | It showed that even current generation models could handle tasks on a phone,
00:10:27.680 | like navigating on Google Maps, downloading apps on Google Play or somewhat topically with TikTok,
00:10:33.920 | swiping a video about a pet cat in TikTok and clicking a like for that video.
00:10:38.640 | No, the success rates weren't perfect, but if you look at the averages and this is for GPT-4
00:10:43.280 | Vision, they are pretty high, 91%, 82%, 82%. These numbers in the middle, by the way, on the left,
00:10:49.040 | reflect the number of steps that GPT-4 Vision took and on the right, the number of steps that
00:10:53.360 | a human took. And that's just GPT-4 Vision, not a model optimized for agency, which we know
00:10:58.960 | that OpenAI is working on. So before we even get to video games, you can imagine an internet where
00:11:04.480 | there are models that are downloading, liking, commenting, doing pull requests, and we wouldn't
00:11:10.480 | even know that it's AI. It would be, as far as I can tell, undetectable. Anyway, I'm getting
00:11:14.880 | distracted. Back to the CIMA paper. What is CIMA? In a nutshell, they got a bunch of games, including
00:11:20.640 | commercial video games like Valheim, 12 million copies sold at least, and their own made up games
00:11:26.560 | that Google created. They then paid a bunch of humans to play those games and gathered the data.
00:11:32.000 | That's what you could see on the screen, the images and the keyboard and mouse inputs that
00:11:36.800 | the humans performed. They gave all of that training data to some pre-trained models. And
00:11:41.520 | at this point, the paper gets quite vague. It doesn't mention parameters or the exact composition
00:11:46.480 | of these pre-trained models. But from this, we get the CIMA agent, which then plays these games,
00:11:52.320 | or more precisely, tries 10 second tasks within these games. This gives you an idea of the range
00:11:58.320 | of tasks, everything from taming and hunting to destroying and headbutting. But I don't want to
00:12:04.160 | bury the lead. The main takeaway is this. Training on more games saw positive transfer when CIMA
00:12:10.400 | played on a new game. And notice how CIMA in purple across all of these games outperforms
00:12:16.320 | an environment specialized agent. That's one trained for just one game. And there is another
00:12:22.000 | gem buried in this graph. I'm colorblind, but I'm pretty sure that's teal or lighter blue. That's
00:12:27.200 | zero shot. What that represents is when the model was trained across all the other games, bar the
00:12:32.960 | actual game it was about to be tested in. And so notice how in some games like Goat Simulator 3,
00:12:38.880 | that outperformed a model that was specialized for just that one game. The transfer effect was
00:12:45.280 | so powerful, it outdid the specialized training. Indeed, CIMA's performance is approaching the
00:12:50.960 | ballpark of human performance. Now, I know we've seen that already with Starcraft 2 and OpenAI
00:12:56.560 | beating Dota, but this would be a model generalizing to almost any video game. Yes,
00:13:01.360 | even Red Dead Redemption 2, which was covered in an entirely separate paper out of Beijing.
00:13:06.320 | That paper, they say, was the first to enable language models to follow the main storyline
00:13:11.280 | and finish real missions in complex AAA games. This time we're talking about things like protecting a
00:13:16.640 | character, buying supplies, equipping shotguns. Again, what was holding them back was the
00:13:21.440 | underlying model, GPT-4V. As I've covered elsewhere on the channel, it lacks in spatial perception.
00:13:27.040 | It's not super accurate with moving the cursor, for example. But visual understanding and
00:13:31.680 | performance is getting better fast. Take the challenging benchmark MMMU. It's about answering
00:13:37.920 | difficult questions that have a visual component. The benchmark only came out recently, giving top
00:13:42.480 | performance to GPT-4V at 56.8%, but that's already been superseded. Take Clawed 3 Opus,
00:13:48.960 | which gets 59.4%. Yes, there is still a gap with human expert performance, but that gap
00:13:54.640 | is narrowing, like we've seen across this video. Just like Devon was solving real-world software
00:13:59.520 | engineering challenges, CIMA and other models are solving real-world games. Walking the walk,
00:14:05.440 | not just talking the talk. And again, we can expect better and better results the more games
00:14:10.640 | CIMA is trained on. As the paper says, in every case, CIMA significantly outperforms the environment
00:14:16.240 | specialized agent, thus demonstrating positive transfer across environments. And this is exactly
00:14:21.680 | what we see in robotics as well. The key take-home from that Google DeepMind paper was that our
00:14:27.120 | results suggest that co-training with data from other platforms imbues RT2X in robotics with
00:14:34.080 | additional skills that were not present in the original dataset, enabling it to perform novel
00:14:38.560 | tasks. These were tasks and skills developed by other robots that were then transferred to RT2,
00:14:44.880 | just like CIMA getting better at one video game by training on others. But did you notice there
00:14:50.240 | that smooth segue I did to robotics? It's the final container that I want to quickly talk about.
00:14:56.880 | Why do I call this humanoid robot a container? Because it contains GPT-4 vision. Yes, of course,
00:15:03.280 | it's real-time speed and dexterity is very impressive, but that intelligence of recognizing
00:15:09.200 | what's on the table and moving it appropriately comes from the underlying model GPT-4 vision.
00:15:14.320 | So, of course, I have to make the same point that the underlying model could easily be upgraded to
00:15:19.280 | GPT-5 when it comes out. This humanoid would have a much deeper understanding of its environment
00:15:24.960 | and you as you're talking to it. Figure 1 takes in 10 images per second and this is not teleoperation.
00:15:31.360 | This is an end-to-end neural network. In other words, there's no human behind the scenes controlling
00:15:36.400 | this robot. Figure don't release pricing, but the estimate is between $30,000 and $150,000 per robot.
00:15:44.320 | Still too pricey for most companies and individuals, but the CEO has a striking vision.
00:15:50.240 | He basically wants to completely automate manual labor. This is the roadmap to a positive future
00:15:56.960 | powered by AI. He wants to build the largest company on the planet and eliminate the need
00:16:02.560 | for unsafe and undesirable jobs. The obvious question is if it can do those jobs, can't it
00:16:08.400 | also do the safe and desirable jobs? I know I'm back to the jobs point again, but all of these
00:16:13.680 | questions became a bit more relevant, let's say, in the last 48 hours. The figure CEO goes on to
00:16:19.200 | predict that everywhere from factories to farmland, the cost of labor will decrease until it becomes
00:16:24.960 | equivalent to the price of renting a robot, facilitating a long-term holistic reduction in
00:16:30.160 | costs. Over time, humans could leave the loop altogether as robots become capable of building
00:16:35.840 | other robots, driving prices down even more. Manual labor, he says, could become optional.
00:16:41.920 | And if that's not a big enough vision for the next two decades, he goes on that the plan is also
00:16:46.800 | to use these robots to build new worlds on other planets. Again, though, we get the reassurance
00:16:52.560 | that our focus is on providing resources for jobs that humans don't want to perform. He also excludes
00:16:58.400 | military applications. I just feel like his company and the world has a bit less control
00:17:03.680 | over how the technology is going to be used than he might think it does. Indeed, Jeff Klune of
00:17:08.560 | OpenAI, Google DeepMind, SEMA, and earlier on in this video, FAME, reposted this from Edward Harris.
00:17:16.400 | It was a report commissioned by the US government that he worked on, and the TLDR was that things
00:17:22.160 | are worse than we thought and nobody's in control. I definitely feel we're noticeably closer to AGI
00:17:28.080 | this week than we were last week. As Jeff Klune put out yesterday, so many pieces of the AGI puzzle
00:17:34.000 | are coming together. And I would also agree that as of today, no one's really in control. And we're
00:17:39.920 | not alone with Jensen Huang, the CEO of NVIDIA, saying that AI will pass every human test in
00:17:46.320 | around five years time. That, by the way, is a timeline shared by Sam Altman. This is a quote
00:17:52.000 | from a book that's coming out soon. He was asked about what AGI means for marketers. He said, "Oh,
00:17:57.280 | for that, it will mean that 95% of what marketers use agencies, strategists, and creative professionals
00:18:02.880 | for today will easily, nearly instantly, and at almost no cost be handled by the AI. And the AI
00:18:09.280 | will likely be able to test its creative outputs against real or synthetic customer focus groups
00:18:14.880 | for predicting results and optimizing. Again, all free, instant, and nearly perfect. Images, videos,
00:18:20.160 | campaign ideas, no problem." But specifically on timelines, he said this. When asked about when AGI
00:18:25.280 | will be a reality, he said, "Five years, give or take, maybe slightly longer, but no one knows
00:18:30.400 | exactly when or what it will mean for society." And it's not like that timeline is even unrealistic
00:18:36.320 | in terms of compute. Using these estimates from semi-analysis, I calculated that just between
00:18:40.960 | quarter one of 2024 and the fourth quarter of 2025, there will be a 14x increase in compute.
00:18:47.040 | Then if you factor in algorithmic efficiency doubling about every nine months, the effective
00:18:51.840 | compute at the end of next year will be almost a hundred times that of right now. So yes, the world
00:18:58.240 | is changing and changing fast and the public really need to start paying attention. But no,
00:19:04.320 | Devin is not AGI, no matter how much you put it in all caps. Thank you so much for watching to
00:19:09.840 | the end. And of course, I'd love to see you over on AI Insiders on Patreon. I'd love to see you
00:19:14.880 | there, but regardless, thank you so much for watching and as always have a wonderful day.