Back to Index

o3 - wow


Chapters

0:0 Introduction
1:19 What is o3?
3:18 FrontierMath
5:15 o4, o5
6:3 GPQA
6:24 Coding, Codeforces + SWE-verified, AlphaCode 2
8:13 1st Caveat
9:3 Compositionality?
10:16 SimpleBench?
13:11 ARC-AGI, Chollet
20:25 Safety Implicaitons

Transcript

The model announced tonight by OpenAI, called O3, could well be the final refutation that artificial intelligence was hitting a wall. OpenAI, it seems, have not so much as surmounted that wall, they have supplied evidence that the wall did not in fact exist. The real news of tonight isn't, for me, that O3 just crushed benchmarks designed to stand for decades.

Estimates that OpenAI have shown that anything you can benchmark, the O-series of models can eventually beat. Let me invite you to think of any challenge. If that challenge is ultimately susceptible to reasoning, and if the reasoning steps are represented anywhere in the training data, the O-series of models will eventually crush that challenge.

Yes, it might have cost O3, or OpenAI, $350,000 in thinking time to beat some of these benchmarks, but costs alone will not hold the tide at bay for long. Yes, I'll give the caveats, I always do, and there are quite a few. But I must admit, and I will admit, that this is a monumental day in AI, and pretty much everyone listening should adjust their timelines.

Before we get to the absolutely crazy benchmark scores, what actually is O3? What did they do? Well, I've given more detail on the O-series of models in previous videos on this channel but let me give you a 30 second summary. OpenAI get the base model to generate hundreds or potentially thousands of candidate solutions, following long chains of thought, to get to an answer.

A verifier model, likely based on the same base model, then reviews those answers and ranks them, looking for classic calculation mistakes or reasoning mistakes. That verifier model, of course, is trained on thousands of correct reasoning steps. But here's the kicker, in scientific domains like mathematics and coding, you can know what the correct answer is.

So when the system generates a correct set of reasoning steps, steps that lead to the correct verified answer, then the model as a whole can be fine-tuned on those correct steps. This fundamentally shifts us from predicting the next word to predicting the series of tokens that will lead to an objectively correct answer.

That fine-tuning on just the correct answers can be classed as reinforcement learning. So what then is O3? Well, more of the same. As one researcher at OpenAI told us tonight, O3 is powered by further scaling up reinforcement learning beyond O1. No special ingredient added to O1, it seems. No secret source.

No wall. And that's why I said in the intro, if you can benchmark it, the O series of models can eventually beat it. What I don't want to imply, though, is that this leap forward with O3 was entirely predictable. Yes, I talked about AI being on an exponential in my first video of this year, and I even referenced verifiers and inference time compute.

That's the fancy term for thinking longer and generating more candidate solutions. But I am in pretty good company in not predicting this much of a leap this soon. Let's briefly start with frontier math and how did O3 do? This is considered today the toughest mathematical benchmark out there. This is a data set that consists of novel, unpublished, and also very hard.

These are extremely hard. Yeah, very, very hard problems. Even in terms of analysis, you know, it would take professional mathematicians hours or even days to solve one of these problems. And today all offerings out there have less than 2% accuracy on this benchmark. And we're seeing with O3, in aggressive test time settings, we're able to get over 25%.

Yeah. They didn't say this in the announcement tonight, but the darker part of the bar, the smaller part is the model getting it right with only one attempt. The lighter part of the bar is when the model gave lots of different solutions, but the one that came up the most often, the consensus answer was the correct answer.

We'll get to time and cost in a moment, but those details aside, the achievement of 25% is monumental. Here's what Terence Tao said at the beginning of November. These questions are extremely challenging. He's arguably the smartest guy in the world, by the way. I think that in the near term, basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert, like a grad student in a related field, paired with some combination of a modern AI and lots of other algebra packages.

Given that O3 doesn't rely on algebra packages, he's basically saying that O3 must be a real domain expert in mathematics. Summing up, Terence Tao said that this benchmark would resist AIs for several years at least. Sam Altman seemed to imply that they were releasing the full O3 perhaps in February or at least the first quarter of next year.

And that implies to me at least that they didn't just bust every single GPU on the planet to get this score, but could never serve it realistically to the public. Or to phrase things another way, we are not at the limits of the compute we even have available today.

The next generation, O4, could be with us by quarter two of next year. O5 by quarter three. Here's what another top OpenAI researcher said "O3 is very performant. More importantly, progress from O1 to O3 was only three months, which shows how fast progress will be in the new paradigm of reinforcement learning on chain of thought to scale inference compute." Way faster than the pre-training paradigm of a new model every one to two years.

We may never get GPT-5, but get AGI anyway. Of course, safety testing may well end up delaying the release to the public of these new generations of models. And so there might end up being an increasingly wide gap between what the frontier labs have available to use themselves and what the public has.

What about Google proof graduate level science questions? And as one OpenAI researcher put it, "Take a moment of silence for that benchmark. It was born in November of 2023 and died just a year later." Why RIP GPQA? Well, O3 gets 87.7 percent. Benchmarks are being crushed almost as quickly as they can be created.

Then there's competitive coding where O3 establishes itself as the 175th highest scoring global competitor. Better at this coding competition than 99.95 percent of humans. Now you might say that's competition coding. That's not real software engineering. But then we had SWE Bench verified. That benchmark tests real issues faced by real software engineers.

The verified part refers to the fact that the benchmark was combed for only genuine questions with real clear answers. Claude 3.5 SONNET gets 49 percent. O3 71.7 percent. As foreseen, you could argue by the CEO of Anthropic, the creators of Claude. The latest model we released, SONNET 3.5, the new or updated version, it gets something like 50 percent on SWE Bench.

And SWE Bench is an example of a bunch of professional, real-world software engineering tasks. At the beginning of the year, I think the state of the art was three or four percent. So in 10 months, we've gone from three percent to 50 percent on this task. And I think in another year, we'll probably be at 90 percent.

I mean, I don't know, but it might even be less than that. Before you ask, by the way, yes, these were unseen programming competitions. This isn't data contamination. Again, if you can benchmark it, the O series of models will eventually or imminently beat it. Interestingly, if you were following the channel closely, you might have guessed that this was coming in Codeforces as of this time last year.

Google produced AlphaCode 2, which in certain parts of the Codeforces competition outperformed 99.5 percent of competition participants. And they went on prophetically, "We find that performance increases roughly log linearly with more samples." Yes, of course, I'm going to get to Arc AGI, but I just want to throw in my first quick caveat.

What happens if you can't benchmark it, or at least it's harder to benchmark or the field isn't as susceptible to reasoning steps? How about personal writing, for example? Well, as OpenAI admitted back in September, the O series of models starting with O1 preview is not preferred on some natural language tasks, suggesting that it's not well suited for all use cases.

Again then, think of a task. Is there an objectively correct answer to that task? The O series will likely soon beat it. As O3 proved tonight, that's regardless of how difficult that task is. Is the correctness of the answer or the quality of the output more a matter of taste, however?

Well, that might take longer to beat. What about core reasoning, though? Out of distribution generalization? What I started this channel to cover back at the beginning of last year. Forgetting about cost or latency for a moment, what we all want to know is how intrinsically intelligent are these models?

That will dictate everything else, and I will raise that question through three examples to end the video. The first is compositionality, which came in a famous paper in Nature published last year. Essentially, you test models by making up a language full of concepts like between, or double, or colours, and see if they can compose those concepts into a correct answer.

The concepts are abstract enough that they would of course never have been seen in the training data. The original GPT-4 flopped hard at this challenge in the paper in Nature, and O1 Pro mode gets close, but still can't do it. After thinking for 9 minutes, it successfully translates "who" as "double", but doesn't quite understand "moreau".

It thinks it's something about symmetry, but doesn't grasp that it means between. Will O3 master compositionality? I can't answer that question because I can't yet test it. Next is of course my own benchmark called SimpleBench. This video was originally meant to be a summary of the 12 days, I was going to show off VO2 and talk about Gemini 2.0 Flash Thinking Experimental from Google.

The thinking, this time in visible chains of thought, is reminiscent then of the O series of models. On the 3 runs we've done so far, it scores around 25%, which is great for such a small model as Flash, but isn't quite as good as even their own model, Gemini Experimental 1206.

For this particular day of shipmas, we are though putting Google to one side because OpenAI have produced O3. So here's what I'm looking out for in O3 to see whether it would crush SimpleBench. Essentially it needs to master spatial reasoning. Now you can pause and read the question yourself, but I helpfully supplied O1 Pro mode with this visual as well.

And without even reading the question, what would you say would happen to this glove if it fell off of the bike? And let's say I also supplied you with the speed of the river. Well you might well say to me, thanks for all of those details, but honestly the glove is just going to fall onto the road.

O1 doesn't even consider that possibility, and never does, because spatial data isn't really in its training data, nor is sophisticated social reasoning data. Wait, let me caveat that, of course we don't know what is in the training data, I just suspect it's not in the training data of O1 at least.

Likely not in O3, but we don't know. Is the base model for O3 Orion or what would have been GPT 4.5, GPT 5? OpenAI never mentioned a shift in what the base model was, but they haven't denied it either. Someone could make the argument that O3 is so good at something like physics that it can intuit for itself what would happen in spatial reasoning scenarios.

Maybe, but we'd have to test it. What I do have to remind myself though, with simple bench and spatial reasoning more generally, is it doesn't strike me perhaps as a fundamental limitation for the model going forward. As I said right in the start of the intro to this video, OpenAI have fundamentally with O3 demonstrated the extent of a generalizable approach to solving things.

In other words, with enough spatial reasoning data, and good spatial reasoning benchmarks, and some more of that scaled up reinforcement learning, I think models would get great at this too. And frankly, even if benchmarks like simple bench can last a little bit longer because of a paucity of spatial reasoning data, or text based spatial reasoning data not being enough, you have simulators like Genesis that can model physics and give models like O3 almost infinite training data of lifelike simulations.

You could almost imagine O3 or O4 being unsure of an answer, spinning up a simulation, spotting what would happen and then outputting the answer. And now at last, what about Arc AGI? I made an entire video not that long ago about how this particular challenge created by Francois Chalet was a necessary but not sufficient condition for AGI.

The reason why O3 beating this benchmark is so significant is because each example is supposed to be a novel test. A challenge, in other words, that's deliberately designed not to be in any training data, past or present. Beating it therefore has to involve at least a certain level of reasoning.

In case you're wondering by the way, I think reasoning is actually a spectrum. I define it as deriving efficient functions and composite functions. LLMs therefore always have done a form of reasoning, it's just that their functions that they derive are not particularly efficient. More like convoluted interpolations. Humans tend to spot things quicker, have more meta rules of thumb.

And with these more meta rules of thumb, we can generalise better and solve challenges that we haven't seen before more efficiently. Hence why many humans can see what has occurred to get from input 1 to output 1, input 2 to output 2. GPT-4 couldn't and even O1 couldn't really.

And for these specific examples, even O3 can't. Yes, it might surprise you, there are still questions that aren't crazy hard that O3 can't get right. Nevertheless, O3, when given maximal compute, what I've calculated it at being 350 grand's worth, gets 88%. And here's what the author of that benchmark said.

"This isn't just brute force. Yes, it's very expensive, but these capabilities are new territory and they demand serious scientific attention. We believe, he said, it represents a significant breakthrough in getting AI to adapt to novel tasks. Reinforced again and again with those chains of thought or reasoning steps that led it to correct answers, O3 has gotten pretty good at deriving efficient functions." In other words, it reasons pretty well.

Now Chalet has often mentioned in the past that many of his smart friends scored around 98% in Arc AGI. But a fairly recent paper from September showed that when an exhaustive study was done on average human performance, it was 64.2% on the public evaluation set. Chalet himself predicted two and a half years ago that there wouldn't be a "pure" transformer based model that gets greater than 50% on previously unseen Arc tasks within a time limit of five years.

Again, I want to give you a couple of quick caveats before we get to his assessment of whether O3 is AGI. One OpenAI researcher admitted that it took 16 hours to get O3 to get 87.5% with an increase rate of 3.5% an hour to get to solved. And another caveat, this time from his public statement on O3.

OpenAI apparently requested that they didn't publish the high compute costs involved in getting that high score. But they kind of did anyway, saying the amount of compute was roughly 172x the low compute configuration. If the low compute high efficiency retail cost was $2,000, by my calculation, that's around $350,000 to get the 87.5%.

If your day job is solving Arc AGI challenges and you're paid less than $350,000 a year, you're safe just for now. And of course, if you're crazy worried by cost, there's always O3 mini, which gets close to the performance of O3 for a fraction of the cost. But more seriously, he said later in the statement, but cost performance will likely improve quite dramatically over the next few months and years.

So you should plan for these capabilities to become competitive with human work within a fairly short timeline. The challenge was always to get models to reason. The costs and latency came second. Those can drop later with more GPUs, Moore's law and algorithmic efficiency. It's the crushing of these challenges that was the hard part.

Cost is not a barrier that's going to last long. Now, Shillay does go on to say that O3 still fails on some very easy tasks. And you might argue that that Arc challenge I showed just earlier was such an example. The blocks move essentially in the direction of the lines that protrude out of them.

And he mentions that he's crafting a so-called Arc AGI 2 benchmark that he thinks will still pose a significant challenge to O3, potentially reducing its score to under 30%. Sounds like he's almost already tested it. He goes on, "Even at high compute, while a smart human would still be able to score over 95% with no training." Notice that's smart human rather than average human though.

And also it's kind of like O3 is under 30%, but what about O4, O5? What if even O6 is released before the end of 2025? That's maybe why Mike Knoop, the funder of the Arc $1 million prize, says, "We want AGI benchmarks that can endure many years. I do not expect V2 will." And so, cryptically, he says, "We're also starting turning attention to V3, which will be very different." That sets up the crucial definition then of what counts as AGI.

Is it still not AGI as long as there's any benchmark that the average human can outperform a model at? Shillay's position, at least as of tonight, is that he doesn't believe that O3 is AGI. The reason? Because it's still feasible to create unsaturated, not crushed, interesting benchmarks that are easy for humans yet impossible for AI, without involving specialist knowledge.

In sum, we will have AGI when creating such evals becomes outright impossible. The question is, is that a fair marker? Does it have to be impossible to create such a benchmark? One that humans can beat easily, yet is impossible for AI? Or should the definition of AGI be when it's harder to create a benchmark that's easier for humans than it is for AI?

In a way, that seems like a fairer definition, such that there isn't just a single benchmark out there that's holding out and the rest have fallen, and we're still saying not AGI. That of course leaves the question of is it harder to create a benchmark that O3 can't solve and yet is easy for humans?

Do we consider different modalities? Can it spot the lack of realism in certain AI generated videos? What kind of benchmarks are allowed or are not allowed? What about benchmarks where we factor in how quickly challenges are solved? I alas can't provide a satisfying answer for those of you who want a simple yes/no AGI or not.

What I can do though is shine a light on the significance of this achievement. Again, it's not about particular benchmarks. It's about an approach that can be used again and again on whatever benchmark you create and to whatever scale you can pay for. It's almost like they've shown that they can defeat the very concept of a benchmark.

Yes, of course I read the paper released tonight by OpenAI on deliberative alignment. Essentially, they use these same reasoning techniques to get the models to be great at refusing harmful requests while also not over-refusing innocent ones. Noam Brown, who is one of the research leads for O1, said that frontier math result actually had safety implications.

He said even if LLMs are dumb in some ways, and of course I can't yet test O3 on SimpleBench, nor even O1, they haven't yet given me API access. He went on "Saturating evals, like frontier math, suggests AI is surpassing top human intelligence in certain domains." The first implication of that, he said, is that we may see a broad acceleration in scientific research.

But then he went on "This also means that AI safety topics, like scalable oversight, may soon stop being hypothetical. Research in these domains needs to be a priority for the field." Scalable oversight, in a ridiculous nutshell, is answering the question of how essentially a dumber model, or dumber human, can still have oversight over a smarter model.

This then is one of the co-creators of O3 saying we really need to start focusing on safety. It's perhaps then more credible when OpenAI researchers like John Holman say this, "When Sam and us researchers say AGI is coming, we aren't doing it to sell you Kool-Aid, a $2,000 subscription, or to trick you to invest in our next round.

It's actually coming." Whatever you've made of O3 tonight, let me know in the comments, I personally can't wait to test it. This has been a big night in AI, and thank you so much for joining me on it. As always, we'd love to see you over on Patreon, where I'll be continuing the discussion and actually fairly soon releasing a mini-documentary on the fateful year 2015 when OpenAI started.

But regardless, wherever you are, have a wonderful day.