back to indexNothing Much Happens in AI, Then Everything Does All At Once

Chapters
0:0 Introduction
0:54 OpenAI Operator
4:53 Perplexity Assistant
5:15 Stargate
7:51 Better than o3?
8:25 DeepSeek R1 Analysis
12:12 Training Secrets
15:19 No More Process Rewarding ?
19:1 Hassabis Timeline Accelerates
21:22 Humanity’s Last Exam
00:00:00.000 |
On days like today, I really do feel sorry for the public, for you guys essentially trying to keep up with AI news and the kind of questions you might have. 00:00:09.120 |
Did OpenAI just automate my job with Operator? 00:00:13.120 |
Who knows, you're probably thinking, it's $200 and it's really hard to get past those clickbait headlines. 00:00:19.280 |
Did the US government just invest half a trillion dollars into a Stargate? What the hell is that? 00:00:25.240 |
I heard China just caught up to the West in AI with something called DeepSeek and why are people talking about Humanity's last exam? 00:00:32.600 |
So to try to help a little bit, I'm going to cover the 9 developments of the last 100 hours, honestly, if not exhaustively. 00:00:41.200 |
Yes, of course, I read the DeepSeek paper in full and have spent hours testing the OpenAI Operator, as well as the Perplexity Assistant and all the rest. 00:00:50.600 |
In case you're wondering, yes, I did edit this Perplexity response. 00:00:54.080 |
First up in this listicle video is, of course, OpenAI's Operator and it's kind of decent, it's okay. 00:01:01.880 |
You have to use a VPN if you're not in the US and honestly, I wouldn't do that for the functionality. 00:01:07.480 |
I would do it if you want to kind of get a sense of where agents are going. 00:01:11.080 |
Just straight up front, though, I can tell you that it's nowhere close to automating any kind of job for two major reasons. 00:01:17.680 |
One is it often gets stuck in these kind of loops where it attempts the same kind of basic failed plan again and again. 00:01:25.200 |
It's not essentially smart enough to get itself out of these kind of basic loops. 00:01:30.080 |
The second big reason is because of OpenAI's own impositions on what the model can't do, understandable impositions. 00:01:37.480 |
I tried about 20 random tasks and the truth is you can never fully relax. 00:01:41.880 |
You always have to keep going back to the system and saying, yes, proceed. 00:01:46.200 |
And no, you can't manually hard override that. 00:01:49.200 |
You can't put it in the prompt not to ask permission. 00:01:51.800 |
Then, of course, many sites have captures that you have to manually take over and input. 00:01:57.200 |
I'm pretty certain if you iterate again and again on a prompt, 00:02:00.480 |
you can develop a workflow that you could save on the top right and share and maybe save a little bit of time for certain tasks. 00:02:08.200 |
At the moment, though, if I'm being honest, it is a bit of a stretch to say that it's useful. 00:02:12.600 |
But if we step back, you can see where all of this is going. 00:02:16.080 |
This operator has a ton of safeguards that might slow it down, 00:02:19.240 |
but people will just migrate to ones that don't have those safeguards. 00:02:22.800 |
Ones where downloading files is easier and captures are done for you. 00:02:27.720 |
That'll be great for usability, but not so great for the dead Internet theory. 00:02:34.720 |
I read the system card in full for OpenAI's operator and it was quite revealing. 00:02:39.160 |
The operator is known to make "irreversible mistakes" like sending an email to the wrong recipient 00:02:45.040 |
or having an incorrectly dated reminder for the user to take their medication. 00:02:50.240 |
And yes, they did reduce those mistakes, but not eliminate them. 00:02:53.760 |
Also, when I talked about those confirmations that the model asked before proceeding to the next step, 00:03:01.400 |
Sometimes the operator just goes ahead and does it, which is a good thing or a bad thing, depending on your perspective. 00:03:07.000 |
You'll be quite glad, I guess, to know that when it's asked to do things like make banking transactions, 00:03:15.680 |
Then you might be wondering, what about if the operator navigates to a malicious site that's trying to trick the operator? 00:03:22.640 |
Well, there was one time where it did so and didn't notice that it was doing so. 00:03:28.400 |
OpenAI are aware of this, though, and have an extra layer of safety on top called a prompt injection monitor, 00:03:33.800 |
checking to see if sites are trying to trick the operator. 00:03:36.320 |
And it did catch this concerning example, but there's one problem. 00:03:44.360 |
They, of course, commit to rapidly updating it in response to newly discovered attacks. 00:03:49.120 |
But there is a slim chance things could go wrong at every layer. 00:03:53.360 |
In my last video, if you remember, I gave you early leaked results on its performance in various computer use benchmarks and web browsing benchmarks. 00:04:02.040 |
But remember, it uses chain of thought to think through what it should do at each stage. 00:04:06.520 |
It monitors the screen, taking screenshots, and then decides. 00:04:09.680 |
Now, whenever you hear chain of thought, think rapid improvement in the near future. 00:04:14.320 |
If we get a widely accessible or open source agent this year, say from China, that gets 80%, 90% on computer use benchmarks like this one, 00:04:23.360 |
then the internet is going to change forever. 00:04:25.680 |
Fun fact, by the way, the system prompt lies to the model and encourages the model to lie. 00:04:30.560 |
It tells the model it has 20 years of experience in using a computer. 00:04:35.080 |
And it says, if you recognize someone while using the computer, say browsing an image, you should not identify them. 00:04:41.160 |
Just say you don't know even if you do know that person. 00:04:44.760 |
Personally, can't see any problem with encouraging models to lie. 00:04:48.000 |
But anyway, we have to get to the next story. 00:04:50.320 |
The story is a quick one, which is the announcement yesterday of the perplexity assistant for Android. 00:04:57.880 |
Obviously, it's much smarter than something like Siri, and I've been using it to play very particular songs or specific YouTube videos. 00:05:08.840 |
It doesn't understand commands like play me the latest video from the YouTube channel AI Explained. 00:05:14.040 |
Now, for a story that many people think is bigger than agents, as big as it gets, half a trillion dollars into Project Stargate. 00:05:22.800 |
Except it's kind of not really half a trillion dollars. 00:05:26.200 |
It's definitely a hundred billion dollars, which was kind of reported on a while back. 00:05:33.920 |
Mind you, a hundred billion dollars is still a hell of a lot of money. 00:05:37.240 |
And that will get you a lot of big, beautiful buildings in someone's words or massive data centers. 00:05:43.000 |
You don't build that, in other words, unless you think AI is going to radically transform society. 00:05:48.200 |
By promised size of investment, we're talking about something on the scale of the Manhattan Project as a fraction of GDP. 00:05:55.640 |
The analogy, of course, of building a nuclear bomb is appropriate, at least in terms of its ambiguity. 00:06:01.480 |
Because even though Project Stargate, according to the U.S. president, is going to be, quote, great for jobs. 00:06:06.680 |
And according to Sam Altman, incredible for the country. 00:06:09.680 |
Let's not pretend that for many of the companies investing in this project, one of the first things they would do with an AGI is cut down on labor costs. 00:06:18.080 |
Sam Altman has himself directly predicted as much many, many times over the years, including fairly recently. 00:06:24.360 |
He said things like the cost of labor will go to zero and he expects massive inequality. 00:06:30.120 |
Now, obviously, the boost to shareholder value will be amazing and there will be other upsides, according to Larry Ellison, one of the key investors in Project Stargate. 00:06:40.880 |
And what's that, you ask? Well, AI surveillance. 00:06:43.720 |
The police will be on their best behavior because we're constantly recording, watching and recording everything that's going on. 00:06:52.160 |
Citizens will be on their best behavior because we're constantly recording and reporting everything that's going on. 00:07:02.800 |
I'm not the only one, by the way, who is a little bit concerned about the downsides of that kind of surveillance. 00:07:08.920 |
This was Anthropic CEO Dario Amadei's reaction to the news of Stargate. 00:07:13.640 |
At the end of this Bloomberg article, he said, I'm very worried about 1984 scenarios or worse. 00:07:19.680 |
If I were a predicting man, I would say that that would first come to a place like China and only later to the West. 00:07:26.840 |
But that's cold comfort. Basically, to spell it out, imagine every text message and email being monitored by a massive AGI LLM for signs of subversion. 00:07:37.120 |
Of course, there are doubters of Stargate, including, curiously, Microsoft, who weren't there at the announcement. 00:07:43.600 |
Apparently their executives have been studying whether building such large data centers for open AI would even pay off in the long run. 00:07:50.960 |
Speaking of Anthropic, the next story is a quick one because it's just a rumor, but it's from a pretty reliable source. 00:07:57.520 |
Dilip Patel of Semianalysis says that Anthropic have a model that is better than O3. 00:08:03.920 |
If you're not sure what O3 is, I've done a video on it, but it's a model that broke various benchmarks in mathematics and coding 00:08:10.520 |
and is the smartest model that's known currently, although it's not publicly released yet. 00:08:15.120 |
Google's already got a reasoning model, Anthropic allegedly has one internally that's like really good, better than O3 even, 00:08:20.720 |
but, you know, we'll see when they eventually release it, like it's like. 00:08:24.080 |
Now, though, for the story that many of you have been waiting for, which is DeepSeek R1, the model out of China that shocked many in the West. 00:08:32.600 |
DeepSeek, for those who don't know, is kind of a side project of a Chinese quant trading firm. 00:08:37.960 |
And yet they've produced a model that's more or less as good as the best that AGI Labs in the West have come up with. 00:08:44.280 |
It's not quite as good, in my opinion, but it is massively cheaper to use. 00:08:48.880 |
And no one, I don't think, was expecting them to catch up as quickly as they have done. 00:08:53.240 |
And to give you more of a sense of context, the likely budget for the entire DeepSeek R1 model and the entire DeepSeek team 00:09:01.120 |
is likely less than the annual salary of certain CEOs of AGI Labs in the West whose models underperform DeepSeek R1. 00:09:09.320 |
At least according to the benchmark figures, which you've got to admit look pretty tall and impressive. 00:09:16.160 |
And by the way, I don't think these numbers are faked. 00:09:18.400 |
If this model had been released 100 days ago, it would definitely have counted as the best model in the world. 00:09:24.360 |
And I don't rule out the possibility that DeepSeek comes out with a model that's better than any other model around this year, 00:09:31.360 |
especially in domains like mathematics and on certain science benchmarks. 00:09:37.280 |
If you're wondering, by the way, it got 30.9% on my own benchmark, a test of basic reasoning capacity. 00:09:44.280 |
That again would have been the best in the world just a few months ago. 00:09:47.800 |
Now, I am going to get to the detail of how it's made in a moment, but first some wider comments on what it means. 00:09:53.880 |
First, people keep calling it open source, but it's not fully open source. 00:09:58.040 |
They didn't release the data set behind the model. 00:10:00.720 |
They did say that DeepSeek R1 was using the base model DeepSeek V3, which was trained on around 15 trillion tokens, 00:10:11.160 |
In other words, we don't really know about the training data, so it's not fully open source. 00:10:15.480 |
Back to DeepSeek R1 though, and you might be wondering, didn't the US impose sanctions on China 00:10:21.240 |
so they couldn't use advanced chips like the B100, for example, from Nvidia? 00:10:25.960 |
Yes, they did, but that might have had the unintended side effect of forcing Chinese AI companies to be more innovative with what they've got. 00:10:33.080 |
In other words, there is a chance that those chip sanctions actually have brought China to being competitive with the West in AI. 00:10:40.840 |
The next comment is on the sheer acceleration this will unleash because it is mostly open sourced. 00:10:47.240 |
Anyone, including rival companies like Meta, can copy what DeepSeek have done. 00:10:51.960 |
Indeed, according to one possible leak, R1 massively outperforms Alarm 4, which isn't yet released from Meta, 00:10:59.880 |
and so they're just dropping everything and copying what DeepSeek have done. 00:11:03.640 |
Of course, this is unconfirmed, but the principle is still the same. 00:11:06.440 |
It's almost like DeepSeek R1 is now the minimum performance because anyone can just copy it. 00:11:11.240 |
That's of course bad news and good news for safety, depending on how you look at it. 00:11:15.000 |
On the one hand, governance and control of AI look set to be borderline hopeless. 00:11:20.440 |
One very respected figure, formerly of Google DeepMind and OpenAI, 00:11:24.600 |
said when asked what is the plan after DeepSeek for AGI safety, he said there is no plan. 00:11:30.760 |
But on the other hand, some have welcomed the fact that safety researchers can now inspect 00:11:36.120 |
the chains of thought behind DeepSeek R1 in a way they couldn't have done with O1 or O3. 00:11:42.200 |
That's of course great for safety testers like Apollo Research, 00:11:45.800 |
whom I interviewed three of them just a couple of days ago for AI Insiders on my Patreon. 00:11:50.680 |
And if you're wondering why studying R1 might be important, 00:11:53.480 |
it's because the model emits chains of thought before answering, 00:11:58.920 |
We can only see summaries of those thoughts for the O-series, but with R1 we see everything. 00:12:03.880 |
So we can better study when the models might be scheming, which is what we covered in this interview. 00:12:09.720 |
All of which gets us to how DeepSeek R1 was trained in the first place. 00:12:15.320 |
And summarizing this 22-page paper full of research is going to be difficult, 00:12:20.200 |
but I'm going to try and do it in one or two paragraphs. 00:12:22.920 |
Of course, this will be oversimplifying, but here we go. 00:12:25.720 |
So start with the base model, DeepSeek V3, which they had already made. 00:12:33.320 |
long chain of thought examples to give the model a cold start. 00:12:36.840 |
Now you can skip that stage and go straight to reinforcement learning, 00:12:40.760 |
but they found the training to be a bit unstable, unpredictable. 00:12:44.280 |
Anyway, having fine-tuned the base model on that cold start data, 00:12:47.640 |
it's time to move to the next stage, reinforcement learning. 00:12:50.360 |
We're going to test the model repeatedly in verifiable domains like mathematics and code, 00:12:56.360 |
rewarding it whenever it gets a correct outcome. 00:12:59.240 |
Not correct individual steps, and we'll get to that later, but the correct outcome. 00:13:03.640 |
Also, we need to throw in some fine-tuning on correct outputs 00:13:07.240 |
that follow the right format in the appropriate language. 00:13:10.120 |
The format being always thinking first in tags, and then answering afterwards. 00:13:14.520 |
Then rinse and repeat this RL and fine-tuning, this time with some "non-reasoning data". 00:13:20.920 |
Let's bring in some wider domains like factuality and "self-cognition". 00:13:26.040 |
All of these correct outputs and fine-tuning data that we're gathering, by the way, 00:13:29.560 |
can of course be used for distilling smaller, smarter models. 00:13:33.240 |
Anyway, Bob's your uncle, do all of that, and you get DeepSeq R1. 00:13:36.600 |
Of course, I'm skipping lots, if it were that easy then every company would have done it, 00:13:42.200 |
And did you notice how synthetic that process is? 00:13:48.040 |
and then reinforce the model on those outputs that led to a correct answer. 00:13:55.160 |
or promote particular problem-solving strategies. 00:13:58.120 |
They wanted to accurately observe the model's natural progression during the RL process. 00:14:03.800 |
It's the bitter lesson in action, don't hard-code human rules, 00:14:10.120 |
One of the things that the model teaches itself, by the way, 00:14:13.240 |
is to output longer and longer responses to get better results. 00:14:17.000 |
Notice the average length of response going up and up and up the more it's trained. 00:14:21.960 |
Kind of makes sense, to solve harder problems you need longer outputs. 00:14:25.800 |
The models themselves learned that they needed to self-correct, 00:14:29.400 |
that's not something inputted by researchers. 00:14:32.360 |
So that's why the model constantly does things like say, 00:14:34.920 |
"Wait, wait, wait" in the middle of responses, and then change its mind. 00:14:37.880 |
Now what humans have learned is how to "jailbreak" the model, 00:14:44.040 |
And if that piques your interest, I've got an arena for you. 00:14:47.960 |
That's the Gray Swan Arena, which you can enter yourself, 00:14:53.800 |
It's all about testing whether you can jailbreak these models, 00:14:58.920 |
By the way, you don't have to be an AI researcher, 00:15:00.920 |
you could just be a creative writer or hacker, 00:15:05.160 |
Sometimes you're even testing models that aren't out yet, 00:15:07.560 |
and there is one competition that is live as of today. 00:15:10.520 |
Pretty much every unreleased model can be jailbroken, 00:15:13.960 |
and there are also leaderboards for those who do it best. 00:15:18.760 |
The next story is of great interest to me personally, 00:15:21.400 |
because it pertains to the type of verifier they use. 00:15:24.520 |
This part of the paper updated my belief about how even O3 is trained. 00:15:29.400 |
To get the insane results in mathematics that O3 did, 00:15:32.520 |
I thought every single reasoning step had to be verified. 00:15:36.040 |
Otherwise, just one miscalculation in an entire chain of thought 00:15:42.920 |
and that could still be how O1 and O3 are trained, 00:15:47.720 |
Instead, it looks more likely now that it's simple outcome reward modeling. 00:15:52.360 |
That's the approach that underperformed in the original 00:15:56.680 |
I should say, many famous researchers, including Francois Chalet, 00:16:00.200 |
still believe that the O series performs a kind of search every step, 00:16:06.280 |
But the DeepSeq team said that step-by-step verification 00:16:12.760 |
It's also susceptible, apparently, to reward hacking, 00:16:15.800 |
where the base model just gets good at convincing the verifier that it's passed. 00:16:20.440 |
In short, it seems simpler just to grade the final answer, 00:16:26.280 |
And here's another hint that it's a purer form of RL than I initially suspected, 00:16:30.920 |
this time from Sébastien Boubec of OpenAI Now. 00:16:38.520 |
It's anything that you see, you know, out there with the reasoning, 00:16:44.360 |
"hey, you should maybe, you know, verify your solution. 00:16:54.280 |
Everything is learned through reinforcement learning. 00:17:01.240 |
I want to point out a kind of whitewashing done by OpenAI 00:17:06.280 |
The O series has been celebrated by OpenAI for its robustness, 00:17:10.360 |
for example, just two days ago in this paper. 00:17:14.840 |
that the model can think for longer before replying. 00:17:19.880 |
when it was supposed to be process reward modeling that was good for safety. 00:17:24.680 |
When OpenAI boasted that rewarding the thought process itself 00:17:29.080 |
rather than the outcome is an encouraging sign for alignment. 00:17:33.400 |
This was echoed by Sam Altman because it was thought 00:17:36.440 |
that we could review each step in the process 00:17:39.480 |
rather than just look at the overall outcome. 00:17:41.720 |
If we just rewarded the outcome, which it seems like we are now doing, 00:17:46.200 |
then the models would get up to all sorts of shenanigans 00:17:53.480 |
where we could scrutinize and optimize each individual step, 00:17:56.600 |
we'd have better scrutiny of the overall process. 00:17:59.480 |
My question is, if optimizing each individual step 00:18:03.160 |
in process supervision is a positive sign for alignment, 00:18:06.600 |
what does it say now that we're rewarding outcomes? 00:18:09.640 |
Shouldn't there be a new blog post saying that outcome-based supervision 00:18:15.480 |
No, it seems like we only get the blog post if it seems good. 00:18:18.360 |
Give up on your dreams of producing a chain of thought that is endorsed by humans. 00:18:28.520 |
A chain of thought in Spanish, which makes it a bit harder for me to endorse. 00:18:32.280 |
I've also seen many chains of thought, of course, in Mandarin. 00:18:37.160 |
was, of course, foreseen by people like Andrea Karpathy, who said, 00:18:40.760 |
"You can tell that reinforcement learning is done properly 00:18:42.920 |
when the models cease to speak English in their chain of thought." 00:18:45.800 |
Why would English, or indeed ultimately any human language, 00:18:48.840 |
be the optimum way to do step-by-step reasoning? 00:18:52.200 |
What happens if a model proposes a solution to climate change 00:18:55.720 |
and we inspect their chain of thought and it's just random characters? 00:19:01.160 |
Indeed, Demis Hassabis, CEO of Google DeepMind, 00:19:03.480 |
in an interview published yesterday, warned that he worried that models 00:19:07.560 |
will become "deceptive" and "underperform" on tests of their malicious capability. 00:19:13.080 |
Pretend, in other words, not to be able to produce a bioweapon. 00:19:16.680 |
Also, I had noticed Demis Hassabis changed his timelines in recent months, 00:19:21.400 |
saying that he expected AGI, or superintelligence, within a decade. 00:19:27.240 |
you'll know that he gave deadlines like 2034. 00:19:31.320 |
And I think one thing that's clearly missing, 00:19:33.000 |
and I always, always had as a benchmark for AGI, 00:19:35.480 |
was the ability for these systems to invent their own hypotheses 00:19:39.960 |
or conjectures about science, not just prove existing ones. 00:19:43.080 |
So, of course, that's extremely useful already, 00:19:44.760 |
to prove an existing maths conjecture or something like that, 00:19:47.800 |
or play a game of Go to a world champion level. 00:19:52.680 |
Could it come up with a new Riemann hypothesis? 00:19:55.240 |
Or could it come up with relativity back in the days that Einstein did it, 00:20:01.240 |
And I think today's systems are still pretty far away 00:20:04.360 |
from having that kind of creative, inventive capability. 00:20:08.200 |
Okay, so a couple of years away till we hit AGI. 00:20:10.520 |
I think, you know, I would say probably like three to five years away. 00:20:14.440 |
So if someone were to declare that they've reached AGI in 2025, 00:20:22.440 |
seems to be converging on this one to five year timeline. 00:20:27.240 |
Well, let me try to give you a strange anecdotal example. 00:20:30.920 |
Models like DeepSeek R1 have weird, quirky reasoning flaws. 00:20:35.400 |
For the purposes of testing this coding side project that I'm doing, 00:20:38.360 |
I asked DeepSeek R1 to come up with this multiple choice quiz. 00:20:41.560 |
It had to meet certain parameters and it failed to meet them, 00:20:45.160 |
You notice a slight flaw with the multiple choice answers 00:20:50.120 |
Let's just say that they are somewhat biased towards answers B and C. 00:20:56.120 |
Will these remaining reasoning blind spots, you could call them, 00:20:59.400 |
be filled as a by-product of continued scaling of, say, RL? 00:21:08.680 |
we could have AGI in those very short timelines 00:21:18.360 |
there could be AGI denialists in 2030 and beyond. 00:21:31.240 |
I would say the title is a little bit misleading 00:21:35.880 |
are working on another challenging benchmark, 00:21:38.440 |
which apparently takes groups of humans days to complete, 00:21:41.880 |
Some people have focused on the fact that DeepSeek R1 performs best, 00:21:45.880 |
getting 9.4% on this hardest of the hard benchmarks. 00:21:49.880 |
The truth is, it's the way they created the benchmark. 00:21:54.600 |
until they found questions that O1 struggled on. 00:21:59.080 |
they couldn't do that kind of iteration on it, 00:22:01.000 |
so it's not fully accurate to say that it performs best 00:22:09.640 |
on things like minute details of hummingbird anatomy. 00:22:15.400 |
say 90% on this benchmark will be amazing and incredible, 00:22:20.600 |
but I don't think it would be quite as impactful 00:22:23.000 |
as a model getting, say, 90% on an agency benchmark, 00:22:26.520 |
as I touched on at the beginning of this video. 00:22:28.600 |
An agent properly being able to do remote tasks 00:22:34.600 |
that the original name for this particular benchmark 00:22:42.360 |
because I could see this particular benchmark 00:22:51.400 |
so thank you so much for making it to the end. 00:22:57.320 |
who have to wade through countless random headlines