back to indexo3 - wow
Chapters
0:0 Introduction
1:19 What is o3?
3:18 FrontierMath
5:15 o4, o5
6:3 GPQA
6:24 Coding, Codeforces + SWE-verified, AlphaCode 2
8:13 1st Caveat
9:3 Compositionality?
10:16 SimpleBench?
13:11 ARC-AGI, Chollet
20:25 Safety Implicaitons
00:00:00.000 |
The model announced tonight by OpenAI, called O3, could well be the final refutation that 00:00:08.200 |
artificial intelligence was hitting a wall. OpenAI, it seems, have not so much as surmounted 00:00:15.100 |
that wall, they have supplied evidence that the wall did not in fact exist. The real news 00:00:20.220 |
of tonight isn't, for me, that O3 just crushed benchmarks designed to stand for decades. 00:00:28.180 |
Estimates that OpenAI have shown that anything you can benchmark, the O-series of models 00:00:33.680 |
can eventually beat. Let me invite you to think of any challenge. If that challenge 00:00:39.340 |
is ultimately susceptible to reasoning, and if the reasoning steps are represented anywhere 00:00:44.580 |
in the training data, the O-series of models will eventually crush that challenge. Yes, 00:00:50.400 |
it might have cost O3, or OpenAI, $350,000 in thinking time to beat some of these benchmarks, 00:00:58.380 |
but costs alone will not hold the tide at bay for long. Yes, I'll give the caveats, 00:01:03.620 |
I always do, and there are quite a few. But I must admit, and I will admit, that this 00:01:08.540 |
is a monumental day in AI, and pretty much everyone listening should adjust their timelines. 00:01:15.740 |
Before we get to the absolutely crazy benchmark scores, what actually is O3? What did they 00:01:21.440 |
do? Well, I've given more detail on the O-series of models in previous videos on this 00:01:26.980 |
channel but let me give you a 30 second summary. OpenAI get the base model to generate hundreds 00:01:32.700 |
or potentially thousands of candidate solutions, following long chains of thought, to get to 00:01:37.940 |
an answer. A verifier model, likely based on the same base model, then reviews those 00:01:43.000 |
answers and ranks them, looking for classic calculation mistakes or reasoning mistakes. 00:01:48.300 |
That verifier model, of course, is trained on thousands of correct reasoning steps. But 00:01:52.940 |
here's the kicker, in scientific domains like mathematics and coding, you can know 00:01:57.420 |
what the correct answer is. So when the system generates a correct set of reasoning steps, 00:02:03.260 |
steps that lead to the correct verified answer, then the model as a whole can be fine-tuned 00:02:08.860 |
on those correct steps. This fundamentally shifts us from predicting the next word to 00:02:13.520 |
predicting the series of tokens that will lead to an objectively correct answer. That 00:02:18.740 |
fine-tuning on just the correct answers can be classed as reinforcement learning. 00:02:23.500 |
So what then is O3? Well, more of the same. As one researcher at OpenAI told us tonight, 00:02:30.220 |
O3 is powered by further scaling up reinforcement learning beyond O1. No special ingredient 00:02:36.700 |
added to O1, it seems. No secret source. No wall. And that's why I said in the intro, 00:02:42.900 |
if you can benchmark it, the O series of models can eventually beat it. What I don't want 00:02:47.820 |
to imply, though, is that this leap forward with O3 was entirely predictable. Yes, I talked 00:02:53.380 |
about AI being on an exponential in my first video of this year, and I even referenced 00:02:59.940 |
verifiers and inference time compute. That's the fancy term for thinking longer and generating 00:03:05.100 |
more candidate solutions. But I am in pretty good company in not predicting this much of 00:03:10.700 |
a leap this soon. Let's briefly start with frontier math and 00:03:14.660 |
how did O3 do? This is considered today the toughest mathematical 00:03:18.380 |
benchmark out there. This is a data set that consists of novel, unpublished, and also very 00:03:25.500 |
Yeah, very, very hard problems. Even in terms of analysis, you know, it would take professional 00:03:28.980 |
mathematicians hours or even days to solve one of these problems. And today all offerings 00:03:36.060 |
out there have less than 2% accuracy on this benchmark. And we're seeing with O3, in aggressive 00:03:41.820 |
test time settings, we're able to get over 25%. Yeah. 00:03:45.780 |
They didn't say this in the announcement tonight, but the darker part of the bar, the smaller 00:03:50.060 |
part is the model getting it right with only one attempt. The lighter part of the bar is 00:03:55.340 |
when the model gave lots of different solutions, but the one that came up the most often, the 00:04:00.820 |
consensus answer was the correct answer. We'll get to time and cost in a moment, but those 00:04:05.420 |
details aside, the achievement of 25% is monumental. Here's what Terence Tao said at the beginning 00:04:12.580 |
of November. These questions are extremely challenging. He's arguably the smartest 00:04:17.220 |
guy in the world, by the way. I think that in the near term, basically the only way to 00:04:21.780 |
solve them, short of having a real domain expert in the area, is by a combination of 00:04:26.400 |
a semi-expert, like a grad student in a related field, paired with some combination of a modern 00:04:31.780 |
AI and lots of other algebra packages. Given that O3 doesn't rely on algebra packages, 00:04:38.340 |
he's basically saying that O3 must be a real domain expert in mathematics. Summing 00:04:43.540 |
up, Terence Tao said that this benchmark would resist AIs for several years at least. Sam 00:04:49.900 |
Altman seemed to imply that they were releasing the full O3 perhaps in February or at least 00:04:55.220 |
the first quarter of next year. And that implies to me at least that they didn't just bust 00:04:59.740 |
every single GPU on the planet to get this score, but could never serve it realistically 00:05:05.100 |
to the public. Or to phrase things another way, we are not at the limits of the compute 00:05:09.620 |
we even have available today. The next generation, O4, could be with us by quarter two of next 00:05:15.700 |
year. O5 by quarter three. Here's what another top OpenAI researcher said "O3 is very performant. 00:05:22.860 |
More importantly, progress from O1 to O3 was only three months, which shows how fast progress 00:05:29.060 |
will be in the new paradigm of reinforcement learning on chain of thought to scale inference 00:05:34.180 |
compute." Way faster than the pre-training paradigm of a new model every one to two years. 00:05:39.500 |
We may never get GPT-5, but get AGI anyway. Of course, safety testing may well end up 00:05:45.660 |
delaying the release to the public of these new generations of models. And so there might 00:05:49.980 |
end up being an increasingly wide gap between what the frontier labs have available to use 00:05:55.500 |
themselves and what the public has. What about Google proof graduate level science questions? 00:06:01.380 |
And as one OpenAI researcher put it, "Take a moment of silence for that benchmark. It 00:06:05.700 |
was born in November of 2023 and died just a year later." Why RIP GPQA? Well, O3 gets 00:06:13.380 |
87.7 percent. Benchmarks are being crushed almost as quickly as they can be created. 00:06:21.020 |
Then there's competitive coding where O3 establishes itself as the 175th highest scoring 00:06:27.820 |
global competitor. Better at this coding competition than 99.95 percent of humans. Now you might 00:06:34.260 |
say that's competition coding. That's not real software engineering. But then we had 00:06:38.380 |
SWE Bench verified. That benchmark tests real issues faced by real software engineers. The 00:06:44.060 |
verified part refers to the fact that the benchmark was combed for only genuine questions 00:06:48.660 |
with real clear answers. Claude 3.5 SONNET gets 49 percent. O3 71.7 percent. As foreseen, 00:06:57.660 |
you could argue by the CEO of Anthropic, the creators of Claude. 00:07:03.300 |
The latest model we released, SONNET 3.5, the new or updated version, it gets something 00:07:09.580 |
like 50 percent on SWE Bench. And SWE Bench is an example of a bunch of professional, 00:07:14.940 |
real-world software engineering tasks. At the beginning of the year, I think the state 00:07:20.220 |
of the art was three or four percent. So in 10 months, we've gone from three percent to 00:07:25.540 |
50 percent on this task. And I think in another year, we'll probably be at 90 percent. I mean, 00:07:29.900 |
I don't know, but it might even be less than that. 00:07:33.380 |
Before you ask, by the way, yes, these were unseen programming competitions. This isn't 00:07:38.100 |
data contamination. Again, if you can benchmark it, the O series of models will eventually 00:07:44.540 |
or imminently beat it. Interestingly, if you were following the channel closely, you might 00:07:49.540 |
have guessed that this was coming in Codeforces as of this time last year. Google produced 00:07:54.900 |
AlphaCode 2, which in certain parts of the Codeforces competition outperformed 99.5 percent 00:08:01.260 |
of competition participants. And they went on prophetically, "We find that performance 00:08:05.980 |
increases roughly log linearly with more samples." 00:08:09.420 |
Yes, of course, I'm going to get to Arc AGI, but I just want to throw in my first 00:08:13.180 |
quick caveat. What happens if you can't benchmark it, or at least it's harder to 00:08:17.740 |
benchmark or the field isn't as susceptible to reasoning steps? How about personal writing, 00:08:23.740 |
for example? Well, as OpenAI admitted back in September, the O series of models starting 00:08:28.780 |
with O1 preview is not preferred on some natural language tasks, suggesting that it's not 00:08:34.180 |
well suited for all use cases. Again then, think of a task. Is there an objectively correct 00:08:40.500 |
answer to that task? The O series will likely soon beat it. As O3 proved tonight, that's 00:08:47.340 |
regardless of how difficult that task is. Is the correctness of the answer or the quality 00:08:52.780 |
of the output more a matter of taste, however? Well, that might take longer to beat. 00:08:57.780 |
What about core reasoning, though? Out of distribution generalization? What I started 00:09:02.720 |
this channel to cover back at the beginning of last year. Forgetting about cost or latency 00:09:07.740 |
for a moment, what we all want to know is how intrinsically intelligent are these models? 00:09:12.260 |
That will dictate everything else, and I will raise that question through three examples 00:09:17.500 |
to end the video. The first is compositionality, which came in a famous paper in Nature published 00:09:24.140 |
last year. Essentially, you test models by making up a language full of concepts like 00:09:30.100 |
between, or double, or colours, and see if they can compose those concepts into a correct 00:09:36.620 |
answer. The concepts are abstract enough that they would of course never have been seen 00:09:41.520 |
in the training data. The original GPT-4 flopped hard at this challenge in the paper in Nature, 00:09:47.460 |
and O1 Pro mode gets close, but still can't do it. After thinking for 9 minutes, it successfully 00:09:55.460 |
translates "who" as "double", but doesn't quite understand "moreau". It thinks it's 00:10:02.180 |
something about symmetry, but doesn't grasp that it means between. Will O3 master compositionality? 00:10:08.940 |
I can't answer that question because I can't yet test it. 00:10:12.260 |
Next is of course my own benchmark called SimpleBench. This video was originally meant 00:10:17.020 |
to be a summary of the 12 days, I was going to show off VO2 and talk about Gemini 2.0 00:10:22.700 |
Flash Thinking Experimental from Google. The thinking, this time in visible chains of thought, 00:10:28.300 |
is reminiscent then of the O series of models. On the 3 runs we've done so far, it scores 00:10:33.460 |
around 25%, which is great for such a small model as Flash, but isn't quite as good 00:10:39.320 |
as even their own model, Gemini Experimental 1206. For this particular day of shipmas, 00:10:45.580 |
we are though putting Google to one side because OpenAI have produced O3. 00:10:50.300 |
So here's what I'm looking out for in O3 to see whether it would crush SimpleBench. 00:10:56.140 |
Essentially it needs to master spatial reasoning. Now you can pause and read the question yourself, 00:11:01.900 |
but I helpfully supplied O1 Pro mode with this visual as well. And without even reading 00:11:07.100 |
the question, what would you say would happen to this glove if it fell off of the bike? 00:11:12.980 |
And let's say I also supplied you with the speed of the river. Well you might well say 00:11:17.140 |
to me, thanks for all of those details, but honestly the glove is just going to fall onto 00:11:21.540 |
the road. O1 doesn't even consider that possibility, and never does, because spatial 00:11:27.820 |
data isn't really in its training data, nor is sophisticated social reasoning data. 00:11:33.380 |
Wait, let me caveat that, of course we don't know what is in the training data, I just 00:11:38.140 |
suspect it's not in the training data of O1 at least. Likely not in O3, but we don't 00:11:43.460 |
know. Is the base model for O3 Orion or what would have been GPT 4.5, GPT 5? OpenAI never 00:11:49.900 |
mentioned a shift in what the base model was, but they haven't denied it either. Someone 00:11:54.860 |
could make the argument that O3 is so good at something like physics that it can intuit 00:12:00.300 |
for itself what would happen in spatial reasoning scenarios. Maybe, but we'd have to test it. 00:12:06.180 |
What I do have to remind myself though, with simple bench and spatial reasoning more generally, 00:12:11.540 |
is it doesn't strike me perhaps as a fundamental limitation for the model going forward. As 00:12:16.700 |
I said right in the start of the intro to this video, OpenAI have fundamentally with 00:12:21.420 |
O3 demonstrated the extent of a generalizable approach to solving things. In other words, 00:12:27.100 |
with enough spatial reasoning data, and good spatial reasoning benchmarks, and some more 00:12:32.220 |
of that scaled up reinforcement learning, I think models would get great at this too. 00:12:36.700 |
And frankly, even if benchmarks like simple bench can last a little bit longer because 00:12:41.340 |
of a paucity of spatial reasoning data, or text based spatial reasoning data not being 00:12:46.300 |
enough, you have simulators like Genesis that can model physics and give models like O3 00:12:54.140 |
almost infinite training data of lifelike simulations. You could almost imagine O3 or 00:12:59.500 |
O4 being unsure of an answer, spinning up a simulation, spotting what would happen and 00:13:06.740 |
And now at last, what about Arc AGI? I made an entire video not that long ago about how 00:13:12.940 |
this particular challenge created by Francois Chalet was a necessary but not sufficient 00:13:18.660 |
condition for AGI. The reason why O3 beating this benchmark is so significant is because 00:13:25.180 |
each example is supposed to be a novel test. A challenge, in other words, that's deliberately 00:13:31.100 |
designed not to be in any training data, past or present. Beating it therefore has to involve 00:13:40.420 |
In case you're wondering by the way, I think reasoning is actually a spectrum. I define 00:13:45.140 |
it as deriving efficient functions and composite functions. LLMs therefore always have done 00:13:52.040 |
a form of reasoning, it's just that their functions that they derive are not particularly 00:13:56.980 |
efficient. More like convoluted interpolations. Humans tend to spot things quicker, have more 00:14:02.620 |
meta rules of thumb. And with these more meta rules of thumb, we can generalise better and 00:14:08.920 |
solve challenges that we haven't seen before more efficiently. Hence why many humans can 00:14:13.620 |
see what has occurred to get from input 1 to output 1, input 2 to output 2. GPT-4 couldn't 00:14:21.740 |
and even O1 couldn't really. And for these specific examples, even O3 can't. Yes, it 00:14:28.820 |
might surprise you, there are still questions that aren't crazy hard that O3 can't get 00:14:34.620 |
right. Nevertheless, O3, when given maximal compute, what I've calculated it at being 00:14:40.860 |
350 grand's worth, gets 88%. And here's what the author of that benchmark said. "This 00:14:48.580 |
isn't just brute force. Yes, it's very expensive, but these capabilities are new 00:14:54.120 |
territory and they demand serious scientific attention. We believe, he said, it represents 00:15:00.020 |
a significant breakthrough in getting AI to adapt to novel tasks. Reinforced again and 00:15:06.180 |
again with those chains of thought or reasoning steps that led it to correct answers, O3 has 00:15:11.260 |
gotten pretty good at deriving efficient functions." In other words, it reasons pretty well. 00:15:17.220 |
Now Chalet has often mentioned in the past that many of his smart friends scored around 00:15:22.620 |
98% in Arc AGI. But a fairly recent paper from September showed that when an exhaustive 00:15:30.360 |
study was done on average human performance, it was 64.2% on the public evaluation set. 00:15:38.140 |
Chalet himself predicted two and a half years ago that there wouldn't be a "pure" transformer 00:15:44.020 |
based model that gets greater than 50% on previously unseen Arc tasks within a time 00:15:49.780 |
limit of five years. Again, I want to give you a couple of quick 00:15:53.020 |
caveats before we get to his assessment of whether O3 is AGI. One OpenAI researcher admitted 00:16:00.240 |
that it took 16 hours to get O3 to get 87.5% with an increase rate of 3.5% an hour to get 00:16:08.580 |
to solved. And another caveat, this time from his public statement on O3. OpenAI apparently 00:16:15.340 |
requested that they didn't publish the high compute costs involved in getting that high 00:16:20.300 |
score. But they kind of did anyway, saying the amount of compute was roughly 172x the 00:16:27.180 |
low compute configuration. If the low compute high efficiency retail cost was $2,000, by 00:16:34.100 |
my calculation, that's around $350,000 to get the 87.5%. If your day job is solving 00:16:41.740 |
Arc AGI challenges and you're paid less than $350,000 a year, you're safe just for 00:16:47.700 |
now. And of course, if you're crazy worried by cost, there's always O3 mini, which gets 00:16:52.720 |
close to the performance of O3 for a fraction of the cost. But more seriously, he said later 00:16:57.580 |
in the statement, but cost performance will likely improve quite dramatically over the 00:17:02.420 |
next few months and years. So you should plan for these capabilities to become competitive 00:17:07.900 |
with human work within a fairly short timeline. The challenge was always to get models to 00:17:13.940 |
reason. The costs and latency came second. Those can drop later with more GPUs, Moore's 00:17:20.420 |
law and algorithmic efficiency. It's the crushing of these challenges that was the 00:17:25.620 |
hard part. Cost is not a barrier that's going to last long. Now, Shillay does go on 00:17:30.100 |
to say that O3 still fails on some very easy tasks. And you might argue that that Arc challenge 00:17:36.180 |
I showed just earlier was such an example. The blocks move essentially in the direction 00:17:41.540 |
of the lines that protrude out of them. And he mentions that he's crafting a so-called 00:17:46.580 |
Arc AGI 2 benchmark that he thinks will still pose a significant challenge to O3, potentially 00:17:53.080 |
reducing its score to under 30%. Sounds like he's almost already tested it. He goes on, 00:17:58.340 |
"Even at high compute, while a smart human would still be able to score over 95% with 00:18:03.740 |
no training." Notice that's smart human rather than average human though. And also 00:18:08.180 |
it's kind of like O3 is under 30%, but what about O4, O5? What if even O6 is released 00:18:14.980 |
before the end of 2025? That's maybe why Mike Knoop, the funder of the Arc $1 million 00:18:21.860 |
prize, says, "We want AGI benchmarks that can endure many years. I do not expect V2 00:18:28.020 |
will." And so, cryptically, he says, "We're also starting turning attention to V3, which 00:18:33.340 |
will be very different." That sets up the crucial definition then of 00:18:37.420 |
what counts as AGI. Is it still not AGI as long as there's any benchmark that the average 00:18:43.780 |
human can outperform a model at? Shillay's position, at least as of tonight, is that 00:18:48.660 |
he doesn't believe that O3 is AGI. The reason? Because it's still feasible to create unsaturated, 00:18:55.700 |
not crushed, interesting benchmarks that are easy for humans yet impossible for AI, without 00:19:01.700 |
involving specialist knowledge. In sum, we will have AGI when creating such evals becomes 00:19:07.980 |
outright impossible. The question is, is that a fair marker? Does it have to be impossible 00:19:13.900 |
to create such a benchmark? One that humans can beat easily, yet is impossible for AI? 00:19:19.780 |
Or should the definition of AGI be when it's harder to create a benchmark that's easier 00:19:26.260 |
for humans than it is for AI? In a way, that seems like a fairer definition, such that 00:19:31.800 |
there isn't just a single benchmark out there that's holding out and the rest have 00:19:36.220 |
fallen, and we're still saying not AGI. That of course leaves the question of is it 00:19:40.740 |
harder to create a benchmark that O3 can't solve and yet is easy for humans? Do we consider 00:19:45.620 |
different modalities? Can it spot the lack of realism in certain AI generated videos? 00:19:51.020 |
What kind of benchmarks are allowed or are not allowed? What about benchmarks where we 00:19:55.220 |
factor in how quickly challenges are solved? I alas can't provide a satisfying answer 00:20:01.100 |
for those of you who want a simple yes/no AGI or not. What I can do though is shine 00:20:07.200 |
a light on the significance of this achievement. Again, it's not about particular benchmarks. 00:20:12.660 |
It's about an approach that can be used again and again on whatever benchmark you 00:20:17.380 |
create and to whatever scale you can pay for. It's almost like they've shown that 00:20:21.940 |
they can defeat the very concept of a benchmark. Yes, of course I read the paper released tonight 00:20:27.500 |
by OpenAI on deliberative alignment. Essentially, they use these same reasoning techniques to 00:20:32.500 |
get the models to be great at refusing harmful requests while also not over-refusing innocent 00:20:38.540 |
ones. Noam Brown, who is one of the research leads for O1, said that frontier math result 00:20:45.100 |
actually had safety implications. He said even if LLMs are dumb in some ways, and of 00:20:50.300 |
course I can't yet test O3 on SimpleBench, nor even O1, they haven't yet given me API 00:20:55.800 |
access. He went on "Saturating evals, like frontier math, suggests AI is surpassing top 00:21:02.220 |
human intelligence in certain domains." The first implication of that, he said, is 00:21:06.920 |
that we may see a broad acceleration in scientific research. But then he went on "This also 00:21:11.920 |
means that AI safety topics, like scalable oversight, may soon stop being hypothetical. 00:21:18.280 |
Research in these domains needs to be a priority for the field." Scalable oversight, in 00:21:23.000 |
a ridiculous nutshell, is answering the question of how essentially a dumber model, or dumber 00:21:29.120 |
human, can still have oversight over a smarter model. 00:21:32.440 |
This then is one of the co-creators of O3 saying we really need to start focusing on 00:21:38.640 |
safety. It's perhaps then more credible when OpenAI researchers like John Holman say 00:21:43.600 |
this, "When Sam and us researchers say AGI is coming, we aren't doing it to sell you 00:21:48.720 |
Kool-Aid, a $2,000 subscription, or to trick you to invest in our next round. It's actually 00:21:54.880 |
coming." Whatever you've made of O3 tonight, let me 00:21:57.400 |
know in the comments, I personally can't wait to test it. 00:22:01.020 |
This has been a big night in AI, and thank you so much for joining me on it. 00:22:05.640 |
As always, we'd love to see you over on Patreon, where I'll be continuing the discussion 00:22:10.360 |
and actually fairly soon releasing a mini-documentary on the fateful year 2015 when OpenAI started. 00:22:17.080 |
But regardless, wherever you are, have a wonderful day.