On days like today, I really do feel sorry for the public, for you guys essentially trying to keep up with AI news and the kind of questions you might have. Did OpenAI just automate my job with Operator? Who knows, you're probably thinking, it's $200 and it's really hard to get past those clickbait headlines.
Did the US government just invest half a trillion dollars into a Stargate? What the hell is that? I heard China just caught up to the West in AI with something called DeepSeek and why are people talking about Humanity's last exam? So to try to help a little bit, I'm going to cover the 9 developments of the last 100 hours, honestly, if not exhaustively.
Yes, of course, I read the DeepSeek paper in full and have spent hours testing the OpenAI Operator, as well as the Perplexity Assistant and all the rest. In case you're wondering, yes, I did edit this Perplexity response. First up in this listicle video is, of course, OpenAI's Operator and it's kind of decent, it's okay.
You have to use a VPN if you're not in the US and honestly, I wouldn't do that for the functionality. I would do it if you want to kind of get a sense of where agents are going. Just straight up front, though, I can tell you that it's nowhere close to automating any kind of job for two major reasons.
One is it often gets stuck in these kind of loops where it attempts the same kind of basic failed plan again and again. It's not essentially smart enough to get itself out of these kind of basic loops. The second big reason is because of OpenAI's own impositions on what the model can't do, understandable impositions.
I tried about 20 random tasks and the truth is you can never fully relax. You always have to keep going back to the system and saying, yes, proceed. And no, you can't manually hard override that. You can't put it in the prompt not to ask permission. Then, of course, many sites have captures that you have to manually take over and input.
I'm pretty certain if you iterate again and again on a prompt, you can develop a workflow that you could save on the top right and share and maybe save a little bit of time for certain tasks. At the moment, though, if I'm being honest, it is a bit of a stretch to say that it's useful.
But if we step back, you can see where all of this is going. This operator has a ton of safeguards that might slow it down, but people will just migrate to ones that don't have those safeguards. Ones where downloading files is easier and captures are done for you. That'll be great for usability, but not so great for the dead Internet theory.
Then there's just flat out mistakes. I read the system card in full for OpenAI's operator and it was quite revealing. The operator is known to make "irreversible mistakes" like sending an email to the wrong recipient or having an incorrectly dated reminder for the user to take their medication. And yes, they did reduce those mistakes, but not eliminate them.
Also, when I talked about those confirmations that the model asked before proceeding to the next step, that happens most times, but not every time. Sometimes the operator just goes ahead and does it, which is a good thing or a bad thing, depending on your perspective. You'll be quite glad, I guess, to know that when it's asked to do things like make banking transactions, it refuses at a rate of around 94%.
Then you might be wondering, what about if the operator navigates to a malicious site that's trying to trick the operator? Well, there was one time where it did so and didn't notice that it was doing so. OpenAI are aware of this, though, and have an extra layer of safety on top called a prompt injection monitor, checking to see if sites are trying to trick the operator.
And it did catch this concerning example, but there's one problem. It, too, fails around 1% of the time. They, of course, commit to rapidly updating it in response to newly discovered attacks. But there is a slim chance things could go wrong at every layer. In my last video, if you remember, I gave you early leaked results on its performance in various computer use benchmarks and web browsing benchmarks.
But remember, it uses chain of thought to think through what it should do at each stage. It monitors the screen, taking screenshots, and then decides. Now, whenever you hear chain of thought, think rapid improvement in the near future. If we get a widely accessible or open source agent this year, say from China, that gets 80%, 90% on computer use benchmarks like this one, then the internet is going to change forever.
Fun fact, by the way, the system prompt lies to the model and encourages the model to lie. It tells the model it has 20 years of experience in using a computer. And it says, if you recognize someone while using the computer, say browsing an image, you should not identify them.
Just say you don't know even if you do know that person. Personally, can't see any problem with encouraging models to lie. But anyway, we have to get to the next story. The story is a quick one, which is the announcement yesterday of the perplexity assistant for Android. I immediately downloaded and tried it out.
Obviously, it's much smarter than something like Siri, and I've been using it to play very particular songs or specific YouTube videos. But there's a slight problem. At the moment, it's not quite smart enough. It doesn't understand commands like play me the latest video from the YouTube channel AI Explained.
Now, for a story that many people think is bigger than agents, as big as it gets, half a trillion dollars into Project Stargate. Except it's kind of not really half a trillion dollars. It's definitely a hundred billion dollars, which was kind of reported on a while back. I even did a video on it.
And the rest is all promises. Mind you, a hundred billion dollars is still a hell of a lot of money. And that will get you a lot of big, beautiful buildings in someone's words or massive data centers. You don't build that, in other words, unless you think AI is going to radically transform society.
By promised size of investment, we're talking about something on the scale of the Manhattan Project as a fraction of GDP. The analogy, of course, of building a nuclear bomb is appropriate, at least in terms of its ambiguity. Because even though Project Stargate, according to the U.S. president, is going to be, quote, great for jobs.
And according to Sam Altman, incredible for the country. Let's not pretend that for many of the companies investing in this project, one of the first things they would do with an AGI is cut down on labor costs. Sam Altman has himself directly predicted as much many, many times over the years, including fairly recently.
He said things like the cost of labor will go to zero and he expects massive inequality. Now, obviously, the boost to shareholder value will be amazing and there will be other upsides, according to Larry Ellison, one of the key investors in Project Stargate. And what's that, you ask? Well, AI surveillance.
The police will be on their best behavior because we're constantly recording, watching and recording everything that's going on. Citizens will be on their best behavior because we're constantly recording and reporting everything that's going on. And it's it's unimpeachable. I'm not the only one, by the way, who is a little bit concerned about the downsides of that kind of surveillance.
This was Anthropic CEO Dario Amadei's reaction to the news of Stargate. At the end of this Bloomberg article, he said, I'm very worried about 1984 scenarios or worse. If I were a predicting man, I would say that that would first come to a place like China and only later to the West.
But that's cold comfort. Basically, to spell it out, imagine every text message and email being monitored by a massive AGI LLM for signs of subversion. Of course, there are doubters of Stargate, including, curiously, Microsoft, who weren't there at the announcement. Apparently their executives have been studying whether building such large data centers for open AI would even pay off in the long run.
Speaking of Anthropic, the next story is a quick one because it's just a rumor, but it's from a pretty reliable source. Dilip Patel of Semianalysis says that Anthropic have a model that is better than O3. If you're not sure what O3 is, I've done a video on it, but it's a model that broke various benchmarks in mathematics and coding and is the smartest model that's known currently, although it's not publicly released yet.
Google's already got a reasoning model, Anthropic allegedly has one internally that's like really good, better than O3 even, but, you know, we'll see when they eventually release it, like it's like. Now, though, for the story that many of you have been waiting for, which is DeepSeek R1, the model out of China that shocked many in the West.
DeepSeek, for those who don't know, is kind of a side project of a Chinese quant trading firm. And yet they've produced a model that's more or less as good as the best that AGI Labs in the West have come up with. It's not quite as good, in my opinion, but it is massively cheaper to use.
And no one, I don't think, was expecting them to catch up as quickly as they have done. And to give you more of a sense of context, the likely budget for the entire DeepSeek R1 model and the entire DeepSeek team is likely less than the annual salary of certain CEOs of AGI Labs in the West whose models underperform DeepSeek R1.
At least according to the benchmark figures, which you've got to admit look pretty tall and impressive. And by the way, I don't think these numbers are faked. If this model had been released 100 days ago, it would definitely have counted as the best model in the world. And I don't rule out the possibility that DeepSeek comes out with a model that's better than any other model around this year, especially in domains like mathematics and on certain science benchmarks.
Not likely, but possible. If you're wondering, by the way, it got 30.9% on my own benchmark, a test of basic reasoning capacity. That again would have been the best in the world just a few months ago. Now, I am going to get to the detail of how it's made in a moment, but first some wider comments on what it means.
First, people keep calling it open source, but it's not fully open source. They didn't release the data set behind the model. They did say that DeepSeek R1 was using the base model DeepSeek V3, which was trained on around 15 trillion tokens, but they don't say what those tokens were.
In other words, we don't really know about the training data, so it's not fully open source. Back to DeepSeek R1 though, and you might be wondering, didn't the US impose sanctions on China so they couldn't use advanced chips like the B100, for example, from Nvidia? Yes, they did, but that might have had the unintended side effect of forcing Chinese AI companies to be more innovative with what they've got.
In other words, there is a chance that those chip sanctions actually have brought China to being competitive with the West in AI. The next comment is on the sheer acceleration this will unleash because it is mostly open sourced. Anyone, including rival companies like Meta, can copy what DeepSeek have done.
Indeed, according to one possible leak, R1 massively outperforms Alarm 4, which isn't yet released from Meta, and so they're just dropping everything and copying what DeepSeek have done. Of course, this is unconfirmed, but the principle is still the same. It's almost like DeepSeek R1 is now the minimum performance because anyone can just copy it.
That's of course bad news and good news for safety, depending on how you look at it. On the one hand, governance and control of AI look set to be borderline hopeless. One very respected figure, formerly of Google DeepMind and OpenAI, said when asked what is the plan after DeepSeek for AGI safety, he said there is no plan.
But on the other hand, some have welcomed the fact that safety researchers can now inspect the chains of thought behind DeepSeek R1 in a way they couldn't have done with O1 or O3. That's of course great for safety testers like Apollo Research, whom I interviewed three of them just a couple of days ago for AI Insiders on my Patreon.
And if you're wondering why studying R1 might be important, it's because the model emits chains of thought before answering, like the O-series of models from OpenAI. We can only see summaries of those thoughts for the O-series, but with R1 we see everything. So we can better study when the models might be scheming, which is what we covered in this interview.
All of which gets us to how DeepSeek R1 was trained in the first place. And summarizing this 22-page paper full of research is going to be difficult, but I'm going to try and do it in one or two paragraphs. Of course, this will be oversimplifying, but here we go.
So start with the base model, DeepSeek V3, which they had already made. Then let's kick things off with some lovely, long chain of thought examples to give the model a cold start. Now you can skip that stage and go straight to reinforcement learning, but they found the training to be a bit unstable, unpredictable.
Anyway, having fine-tuned the base model on that cold start data, it's time to move to the next stage, reinforcement learning. We're going to test the model repeatedly in verifiable domains like mathematics and code, rewarding it whenever it gets a correct outcome. Not correct individual steps, and we'll get to that later, but the correct outcome.
Also, we need to throw in some fine-tuning on correct outputs that follow the right format in the appropriate language. The format being always thinking first in tags, and then answering afterwards. Then rinse and repeat this RL and fine-tuning, this time with some "non-reasoning data". Let's bring in some wider domains like factuality and "self-cognition".
All of these correct outputs and fine-tuning data that we're gathering, by the way, can of course be used for distilling smaller, smarter models. Anyway, Bob's your uncle, do all of that, and you get DeepSeq R1. Of course, I'm skipping lots, if it were that easy then every company would have done it, but that's the basic idea.
And did you notice how synthetic that process is? Get the model to generate chains of thought, and then reinforce the model on those outputs that led to a correct answer. They did not mandate reflective reasoning, or promote particular problem-solving strategies. They wanted to accurately observe the model's natural progression during the RL process.
It's the bitter lesson in action, don't hard-code human rules, let the models discover them for themselves. One of the things that the model teaches itself, by the way, is to output longer and longer responses to get better results. Notice the average length of response going up and up and up the more it's trained.
Kind of makes sense, to solve harder problems you need longer outputs. The models themselves learned that they needed to self-correct, that's not something inputted by researchers. So that's why the model constantly does things like say, "Wait, wait, wait" in the middle of responses, and then change its mind. Now what humans have learned is how to "jailbreak" the model, or get it to do whatever you want it to do.
And if that piques your interest, I've got an arena for you. That's the Gray Swan Arena, which you can enter yourself, and they are the sponsors of today's video. It's all about testing whether you can jailbreak these models, including the very latest ones. By the way, you don't have to be an AI researcher, you could just be a creative writer or hacker, and there are monetary rewards.
Sometimes you're even testing models that aren't out yet, and there is one competition that is live as of today. Pretty much every unreleased model can be jailbroken, and there are also leaderboards for those who do it best. As ever, links in the description. The next story is of great interest to me personally, because it pertains to the type of verifier they use.
This part of the paper updated my belief about how even O3 is trained. To get the insane results in mathematics that O3 did, I thought every single reasoning step had to be verified. Otherwise, just one miscalculation in an entire chain of thought could undo all the good work. That's called process reward modeling, and that could still be how O1 and O3 are trained, but probably not.
Instead, it looks more likely now that it's simple outcome reward modeling. That's the approach that underperformed in the original "let's verify step-by-step" paper. I should say, many famous researchers, including Francois Chalet, still believe that the O series performs a kind of search every step, or a verification step. But the DeepSeq team said that step-by-step verification adds additional computational overhead.
It's also susceptible, apparently, to reward hacking, where the base model just gets good at convincing the verifier that it's passed. In short, it seems simpler just to grade the final answer, not every single reasoning step. And here's another hint that it's a purer form of RL than I initially suspected, this time from Sébastien Boubec of OpenAI Now.
It's really, everything is kind of emergent. Nothing is hard-coded. It's anything that you see, you know, out there with the reasoning, nothing has been done to say to the model, "hey, you should maybe, you know, verify your solution. You should backtrack, you should X, Y, Z." No tactic was given to the model.
Everything is emergent. Everything is learned through reinforcement learning. This is insane, insanity. At this point in the video, I want to point out a kind of whitewashing done by OpenAI that I don't think anyone else has noticed. The O series has been celebrated by OpenAI for its robustness, for example, just two days ago in this paper.
Great news for safety, apparently, that the model can think for longer before replying. But I'm, of course, old enough to remember when it was supposed to be process reward modeling that was good for safety. When OpenAI boasted that rewarding the thought process itself rather than the outcome is an encouraging sign for alignment.
This was echoed by Sam Altman because it was thought that we could review each step in the process rather than just look at the overall outcome. If we just rewarded the outcome, which it seems like we are now doing, then the models would get up to all sorts of shenanigans on the way toward getting the outcome.
Instead, if process supervision worked best, where we could scrutinize and optimize each individual step, we'd have better scrutiny of the overall process. My question is, if optimizing each individual step in process supervision is a positive sign for alignment, what does it say now that we're rewarding outcomes? Shouldn't there be a new blog post saying that outcome-based supervision has an important alignment downside?
No, it seems like we only get the blog post if it seems good. Give up on your dreams of producing a chain of thought that is endorsed by humans. This is the kind of chain of thought summary that I get for an English language request. A chain of thought in Spanish, which makes it a bit harder for me to endorse.
I've also seen many chains of thought, of course, in Mandarin. This kind of language mixing, by the way, was, of course, foreseen by people like Andrea Karpathy, who said, "You can tell that reinforcement learning is done properly when the models cease to speak English in their chain of thought." Why would English, or indeed ultimately any human language, be the optimum way to do step-by-step reasoning?
What happens if a model proposes a solution to climate change and we inspect their chain of thought and it's just random characters? It's a bit harder to trust what's going on. Indeed, Demis Hassabis, CEO of Google DeepMind, in an interview published yesterday, warned that he worried that models will become "deceptive" and "underperform" on tests of their malicious capability.
Pretend, in other words, not to be able to produce a bioweapon. Also, I had noticed Demis Hassabis changed his timelines in recent months, saying that he expected AGI, or superintelligence, within a decade. If you guys have been following my channel, you'll know that he gave deadlines like 2034. Well, check this out.
And I think one thing that's clearly missing, and I always, always had as a benchmark for AGI, was the ability for these systems to invent their own hypotheses or conjectures about science, not just prove existing ones. So, of course, that's extremely useful already, to prove an existing maths conjecture or something like that, or play a game of Go to a world champion level.
But could a system invent Go? Could it come up with a new Riemann hypothesis? Or could it come up with relativity back in the days that Einstein did it, with the information that he had? And I think today's systems are still pretty far away from having that kind of creative, inventive capability.
Okay, so a couple of years away till we hit AGI. I think, you know, I would say probably like three to five years away. So if someone were to declare that they've reached AGI in 2025, probably marketing. I think so. Almost every AI CEO, in other words, seems to be converging on this one to five year timeline.
Why not this year? Well, let me try to give you a strange anecdotal example. Models like DeepSeek R1 have weird, quirky reasoning flaws. For the purposes of testing this coding side project that I'm doing, I asked DeepSeek R1 to come up with this multiple choice quiz. It had to meet certain parameters and it failed to meet them, but that wasn't the real issue.
You notice a slight flaw with the multiple choice answers it produced for these 25 questions. Let's just say that they are somewhat biased towards answers B and C. Here's my bigger question though. Will these remaining reasoning blind spots, you could call them, be filled as a by-product of continued scaling of, say, RL?
Or will they need to be patched one by one? If the former is the case, we could have AGI in those very short timelines that the AI CEOs publicly predict now. If the latter scenario is the case, that they have to be patched one by one, there could be AGI denialists in 2030 and beyond.
Where better to end the video then, than on "Humanity's Last Exam". I don't regard this as a test for AGI, but it is an interesting new benchmark. I would say the title is a little bit misleading because the creators of the benchmark are working on another challenging benchmark, which apparently takes groups of humans days to complete, so is even harder.
Some people have focused on the fact that DeepSeek R1 performs best, getting 9.4% on this hardest of the hard benchmarks. The truth is, it's the way they created the benchmark. They kept testing models like O1 until they found questions that O1 struggled on. Because DeepSeek R1 wasn't out yet, they couldn't do that kind of iteration on it, so it's not fully accurate to say that it performs best because it is the smartest model.
As far as I can see, it tests heavily obscure knowledge on things like minute details of hummingbird anatomy. Now, I will say that a model getting, say 90% on this benchmark will be amazing and incredible, and I would use that model, but I don't think it would be quite as impactful as a model getting, say, 90% on an agency benchmark, as I touched on at the beginning of this video.
An agent properly being able to do remote tasks would transform the world economy. The New York Times reported that the original name for this particular benchmark was Humanity's Last Stand, so I am glad they changed the title. Let's hope it's not, because I could see this particular benchmark being crushed by the end of next year, if not even this year.
Man, that was a lot to cover in one video, so thank you so much for making it to the end. As I say, I feel less sorry for myself and more sorry for the public who have to wade through countless random headlines to get to what's actually happening. I've tried to do my best in this video, but let me know what you think.
As ever, thank you so much for watching and have a wonderful day.