back to index

o3 - wow


Chapters

0:0 Introduction
1:19 What is o3?
3:18 FrontierMath
5:15 o4, o5
6:3 GPQA
6:24 Coding, Codeforces + SWE-verified, AlphaCode 2
8:13 1st Caveat
9:3 Compositionality?
10:16 SimpleBench?
13:11 ARC-AGI, Chollet
20:25 Safety Implicaitons

Whisper Transcript | Transcript Only Page

00:00:00.000 | The model announced tonight by OpenAI, called O3, could well be the final refutation that
00:00:08.200 | artificial intelligence was hitting a wall. OpenAI, it seems, have not so much as surmounted
00:00:15.100 | that wall, they have supplied evidence that the wall did not in fact exist. The real news
00:00:20.220 | of tonight isn't, for me, that O3 just crushed benchmarks designed to stand for decades.
00:00:28.180 | Estimates that OpenAI have shown that anything you can benchmark, the O-series of models
00:00:33.680 | can eventually beat. Let me invite you to think of any challenge. If that challenge
00:00:39.340 | is ultimately susceptible to reasoning, and if the reasoning steps are represented anywhere
00:00:44.580 | in the training data, the O-series of models will eventually crush that challenge. Yes,
00:00:50.400 | it might have cost O3, or OpenAI, $350,000 in thinking time to beat some of these benchmarks,
00:00:58.380 | but costs alone will not hold the tide at bay for long. Yes, I'll give the caveats,
00:01:03.620 | I always do, and there are quite a few. But I must admit, and I will admit, that this
00:01:08.540 | is a monumental day in AI, and pretty much everyone listening should adjust their timelines.
00:01:15.740 | Before we get to the absolutely crazy benchmark scores, what actually is O3? What did they
00:01:21.440 | do? Well, I've given more detail on the O-series of models in previous videos on this
00:01:26.980 | channel but let me give you a 30 second summary. OpenAI get the base model to generate hundreds
00:01:32.700 | or potentially thousands of candidate solutions, following long chains of thought, to get to
00:01:37.940 | an answer. A verifier model, likely based on the same base model, then reviews those
00:01:43.000 | answers and ranks them, looking for classic calculation mistakes or reasoning mistakes.
00:01:48.300 | That verifier model, of course, is trained on thousands of correct reasoning steps. But
00:01:52.940 | here's the kicker, in scientific domains like mathematics and coding, you can know
00:01:57.420 | what the correct answer is. So when the system generates a correct set of reasoning steps,
00:02:03.260 | steps that lead to the correct verified answer, then the model as a whole can be fine-tuned
00:02:08.860 | on those correct steps. This fundamentally shifts us from predicting the next word to
00:02:13.520 | predicting the series of tokens that will lead to an objectively correct answer. That
00:02:18.740 | fine-tuning on just the correct answers can be classed as reinforcement learning.
00:02:23.500 | So what then is O3? Well, more of the same. As one researcher at OpenAI told us tonight,
00:02:30.220 | O3 is powered by further scaling up reinforcement learning beyond O1. No special ingredient
00:02:36.700 | added to O1, it seems. No secret source. No wall. And that's why I said in the intro,
00:02:42.900 | if you can benchmark it, the O series of models can eventually beat it. What I don't want
00:02:47.820 | to imply, though, is that this leap forward with O3 was entirely predictable. Yes, I talked
00:02:53.380 | about AI being on an exponential in my first video of this year, and I even referenced
00:02:59.940 | verifiers and inference time compute. That's the fancy term for thinking longer and generating
00:03:05.100 | more candidate solutions. But I am in pretty good company in not predicting this much of
00:03:10.700 | a leap this soon. Let's briefly start with frontier math and
00:03:14.660 | how did O3 do? This is considered today the toughest mathematical
00:03:18.380 | benchmark out there. This is a data set that consists of novel, unpublished, and also very
00:03:24.100 | hard. These are extremely hard.
00:03:25.500 | Yeah, very, very hard problems. Even in terms of analysis, you know, it would take professional
00:03:28.980 | mathematicians hours or even days to solve one of these problems. And today all offerings
00:03:36.060 | out there have less than 2% accuracy on this benchmark. And we're seeing with O3, in aggressive
00:03:41.820 | test time settings, we're able to get over 25%. Yeah.
00:03:45.780 | They didn't say this in the announcement tonight, but the darker part of the bar, the smaller
00:03:50.060 | part is the model getting it right with only one attempt. The lighter part of the bar is
00:03:55.340 | when the model gave lots of different solutions, but the one that came up the most often, the
00:04:00.820 | consensus answer was the correct answer. We'll get to time and cost in a moment, but those
00:04:05.420 | details aside, the achievement of 25% is monumental. Here's what Terence Tao said at the beginning
00:04:12.580 | of November. These questions are extremely challenging. He's arguably the smartest
00:04:17.220 | guy in the world, by the way. I think that in the near term, basically the only way to
00:04:21.780 | solve them, short of having a real domain expert in the area, is by a combination of
00:04:26.400 | a semi-expert, like a grad student in a related field, paired with some combination of a modern
00:04:31.780 | AI and lots of other algebra packages. Given that O3 doesn't rely on algebra packages,
00:04:38.340 | he's basically saying that O3 must be a real domain expert in mathematics. Summing
00:04:43.540 | up, Terence Tao said that this benchmark would resist AIs for several years at least. Sam
00:04:49.900 | Altman seemed to imply that they were releasing the full O3 perhaps in February or at least
00:04:55.220 | the first quarter of next year. And that implies to me at least that they didn't just bust
00:04:59.740 | every single GPU on the planet to get this score, but could never serve it realistically
00:05:05.100 | to the public. Or to phrase things another way, we are not at the limits of the compute
00:05:09.620 | we even have available today. The next generation, O4, could be with us by quarter two of next
00:05:15.700 | year. O5 by quarter three. Here's what another top OpenAI researcher said "O3 is very performant.
00:05:22.860 | More importantly, progress from O1 to O3 was only three months, which shows how fast progress
00:05:29.060 | will be in the new paradigm of reinforcement learning on chain of thought to scale inference
00:05:34.180 | compute." Way faster than the pre-training paradigm of a new model every one to two years.
00:05:39.500 | We may never get GPT-5, but get AGI anyway. Of course, safety testing may well end up
00:05:45.660 | delaying the release to the public of these new generations of models. And so there might
00:05:49.980 | end up being an increasingly wide gap between what the frontier labs have available to use
00:05:55.500 | themselves and what the public has. What about Google proof graduate level science questions?
00:06:01.380 | And as one OpenAI researcher put it, "Take a moment of silence for that benchmark. It
00:06:05.700 | was born in November of 2023 and died just a year later." Why RIP GPQA? Well, O3 gets
00:06:13.380 | 87.7 percent. Benchmarks are being crushed almost as quickly as they can be created.
00:06:21.020 | Then there's competitive coding where O3 establishes itself as the 175th highest scoring
00:06:27.820 | global competitor. Better at this coding competition than 99.95 percent of humans. Now you might
00:06:34.260 | say that's competition coding. That's not real software engineering. But then we had
00:06:38.380 | SWE Bench verified. That benchmark tests real issues faced by real software engineers. The
00:06:44.060 | verified part refers to the fact that the benchmark was combed for only genuine questions
00:06:48.660 | with real clear answers. Claude 3.5 SONNET gets 49 percent. O3 71.7 percent. As foreseen,
00:06:57.660 | you could argue by the CEO of Anthropic, the creators of Claude.
00:07:03.300 | The latest model we released, SONNET 3.5, the new or updated version, it gets something
00:07:09.580 | like 50 percent on SWE Bench. And SWE Bench is an example of a bunch of professional,
00:07:14.940 | real-world software engineering tasks. At the beginning of the year, I think the state
00:07:20.220 | of the art was three or four percent. So in 10 months, we've gone from three percent to
00:07:25.540 | 50 percent on this task. And I think in another year, we'll probably be at 90 percent. I mean,
00:07:29.900 | I don't know, but it might even be less than that.
00:07:33.380 | Before you ask, by the way, yes, these were unseen programming competitions. This isn't
00:07:38.100 | data contamination. Again, if you can benchmark it, the O series of models will eventually
00:07:44.540 | or imminently beat it. Interestingly, if you were following the channel closely, you might
00:07:49.540 | have guessed that this was coming in Codeforces as of this time last year. Google produced
00:07:54.900 | AlphaCode 2, which in certain parts of the Codeforces competition outperformed 99.5 percent
00:08:01.260 | of competition participants. And they went on prophetically, "We find that performance
00:08:05.980 | increases roughly log linearly with more samples."
00:08:09.420 | Yes, of course, I'm going to get to Arc AGI, but I just want to throw in my first
00:08:13.180 | quick caveat. What happens if you can't benchmark it, or at least it's harder to
00:08:17.740 | benchmark or the field isn't as susceptible to reasoning steps? How about personal writing,
00:08:23.740 | for example? Well, as OpenAI admitted back in September, the O series of models starting
00:08:28.780 | with O1 preview is not preferred on some natural language tasks, suggesting that it's not
00:08:34.180 | well suited for all use cases. Again then, think of a task. Is there an objectively correct
00:08:40.500 | answer to that task? The O series will likely soon beat it. As O3 proved tonight, that's
00:08:47.340 | regardless of how difficult that task is. Is the correctness of the answer or the quality
00:08:52.780 | of the output more a matter of taste, however? Well, that might take longer to beat.
00:08:57.780 | What about core reasoning, though? Out of distribution generalization? What I started
00:09:02.720 | this channel to cover back at the beginning of last year. Forgetting about cost or latency
00:09:07.740 | for a moment, what we all want to know is how intrinsically intelligent are these models?
00:09:12.260 | That will dictate everything else, and I will raise that question through three examples
00:09:17.500 | to end the video. The first is compositionality, which came in a famous paper in Nature published
00:09:24.140 | last year. Essentially, you test models by making up a language full of concepts like
00:09:30.100 | between, or double, or colours, and see if they can compose those concepts into a correct
00:09:36.620 | answer. The concepts are abstract enough that they would of course never have been seen
00:09:41.520 | in the training data. The original GPT-4 flopped hard at this challenge in the paper in Nature,
00:09:47.460 | and O1 Pro mode gets close, but still can't do it. After thinking for 9 minutes, it successfully
00:09:55.460 | translates "who" as "double", but doesn't quite understand "moreau". It thinks it's
00:10:02.180 | something about symmetry, but doesn't grasp that it means between. Will O3 master compositionality?
00:10:08.940 | I can't answer that question because I can't yet test it.
00:10:12.260 | Next is of course my own benchmark called SimpleBench. This video was originally meant
00:10:17.020 | to be a summary of the 12 days, I was going to show off VO2 and talk about Gemini 2.0
00:10:22.700 | Flash Thinking Experimental from Google. The thinking, this time in visible chains of thought,
00:10:28.300 | is reminiscent then of the O series of models. On the 3 runs we've done so far, it scores
00:10:33.460 | around 25%, which is great for such a small model as Flash, but isn't quite as good
00:10:39.320 | as even their own model, Gemini Experimental 1206. For this particular day of shipmas,
00:10:45.580 | we are though putting Google to one side because OpenAI have produced O3.
00:10:50.300 | So here's what I'm looking out for in O3 to see whether it would crush SimpleBench.
00:10:56.140 | Essentially it needs to master spatial reasoning. Now you can pause and read the question yourself,
00:11:01.900 | but I helpfully supplied O1 Pro mode with this visual as well. And without even reading
00:11:07.100 | the question, what would you say would happen to this glove if it fell off of the bike?
00:11:12.980 | And let's say I also supplied you with the speed of the river. Well you might well say
00:11:17.140 | to me, thanks for all of those details, but honestly the glove is just going to fall onto
00:11:21.540 | the road. O1 doesn't even consider that possibility, and never does, because spatial
00:11:27.820 | data isn't really in its training data, nor is sophisticated social reasoning data.
00:11:33.380 | Wait, let me caveat that, of course we don't know what is in the training data, I just
00:11:38.140 | suspect it's not in the training data of O1 at least. Likely not in O3, but we don't
00:11:43.460 | know. Is the base model for O3 Orion or what would have been GPT 4.5, GPT 5? OpenAI never
00:11:49.900 | mentioned a shift in what the base model was, but they haven't denied it either. Someone
00:11:54.860 | could make the argument that O3 is so good at something like physics that it can intuit
00:12:00.300 | for itself what would happen in spatial reasoning scenarios. Maybe, but we'd have to test it.
00:12:06.180 | What I do have to remind myself though, with simple bench and spatial reasoning more generally,
00:12:11.540 | is it doesn't strike me perhaps as a fundamental limitation for the model going forward. As
00:12:16.700 | I said right in the start of the intro to this video, OpenAI have fundamentally with
00:12:21.420 | O3 demonstrated the extent of a generalizable approach to solving things. In other words,
00:12:27.100 | with enough spatial reasoning data, and good spatial reasoning benchmarks, and some more
00:12:32.220 | of that scaled up reinforcement learning, I think models would get great at this too.
00:12:36.700 | And frankly, even if benchmarks like simple bench can last a little bit longer because
00:12:41.340 | of a paucity of spatial reasoning data, or text based spatial reasoning data not being
00:12:46.300 | enough, you have simulators like Genesis that can model physics and give models like O3
00:12:54.140 | almost infinite training data of lifelike simulations. You could almost imagine O3 or
00:12:59.500 | O4 being unsure of an answer, spinning up a simulation, spotting what would happen and
00:13:05.180 | then outputting the answer.
00:13:06.740 | And now at last, what about Arc AGI? I made an entire video not that long ago about how
00:13:12.940 | this particular challenge created by Francois Chalet was a necessary but not sufficient
00:13:18.660 | condition for AGI. The reason why O3 beating this benchmark is so significant is because
00:13:25.180 | each example is supposed to be a novel test. A challenge, in other words, that's deliberately
00:13:31.100 | designed not to be in any training data, past or present. Beating it therefore has to involve
00:13:37.940 | at least a certain level of reasoning.
00:13:40.420 | In case you're wondering by the way, I think reasoning is actually a spectrum. I define
00:13:45.140 | it as deriving efficient functions and composite functions. LLMs therefore always have done
00:13:52.040 | a form of reasoning, it's just that their functions that they derive are not particularly
00:13:56.980 | efficient. More like convoluted interpolations. Humans tend to spot things quicker, have more
00:14:02.620 | meta rules of thumb. And with these more meta rules of thumb, we can generalise better and
00:14:08.920 | solve challenges that we haven't seen before more efficiently. Hence why many humans can
00:14:13.620 | see what has occurred to get from input 1 to output 1, input 2 to output 2. GPT-4 couldn't
00:14:21.740 | and even O1 couldn't really. And for these specific examples, even O3 can't. Yes, it
00:14:28.820 | might surprise you, there are still questions that aren't crazy hard that O3 can't get
00:14:34.620 | right. Nevertheless, O3, when given maximal compute, what I've calculated it at being
00:14:40.860 | 350 grand's worth, gets 88%. And here's what the author of that benchmark said. "This
00:14:48.580 | isn't just brute force. Yes, it's very expensive, but these capabilities are new
00:14:54.120 | territory and they demand serious scientific attention. We believe, he said, it represents
00:15:00.020 | a significant breakthrough in getting AI to adapt to novel tasks. Reinforced again and
00:15:06.180 | again with those chains of thought or reasoning steps that led it to correct answers, O3 has
00:15:11.260 | gotten pretty good at deriving efficient functions." In other words, it reasons pretty well.
00:15:17.220 | Now Chalet has often mentioned in the past that many of his smart friends scored around
00:15:22.620 | 98% in Arc AGI. But a fairly recent paper from September showed that when an exhaustive
00:15:30.360 | study was done on average human performance, it was 64.2% on the public evaluation set.
00:15:38.140 | Chalet himself predicted two and a half years ago that there wouldn't be a "pure" transformer
00:15:44.020 | based model that gets greater than 50% on previously unseen Arc tasks within a time
00:15:49.780 | limit of five years. Again, I want to give you a couple of quick
00:15:53.020 | caveats before we get to his assessment of whether O3 is AGI. One OpenAI researcher admitted
00:16:00.240 | that it took 16 hours to get O3 to get 87.5% with an increase rate of 3.5% an hour to get
00:16:08.580 | to solved. And another caveat, this time from his public statement on O3. OpenAI apparently
00:16:15.340 | requested that they didn't publish the high compute costs involved in getting that high
00:16:20.300 | score. But they kind of did anyway, saying the amount of compute was roughly 172x the
00:16:27.180 | low compute configuration. If the low compute high efficiency retail cost was $2,000, by
00:16:34.100 | my calculation, that's around $350,000 to get the 87.5%. If your day job is solving
00:16:41.740 | Arc AGI challenges and you're paid less than $350,000 a year, you're safe just for
00:16:47.700 | now. And of course, if you're crazy worried by cost, there's always O3 mini, which gets
00:16:52.720 | close to the performance of O3 for a fraction of the cost. But more seriously, he said later
00:16:57.580 | in the statement, but cost performance will likely improve quite dramatically over the
00:17:02.420 | next few months and years. So you should plan for these capabilities to become competitive
00:17:07.900 | with human work within a fairly short timeline. The challenge was always to get models to
00:17:13.940 | reason. The costs and latency came second. Those can drop later with more GPUs, Moore's
00:17:20.420 | law and algorithmic efficiency. It's the crushing of these challenges that was the
00:17:25.620 | hard part. Cost is not a barrier that's going to last long. Now, Shillay does go on
00:17:30.100 | to say that O3 still fails on some very easy tasks. And you might argue that that Arc challenge
00:17:36.180 | I showed just earlier was such an example. The blocks move essentially in the direction
00:17:41.540 | of the lines that protrude out of them. And he mentions that he's crafting a so-called
00:17:46.580 | Arc AGI 2 benchmark that he thinks will still pose a significant challenge to O3, potentially
00:17:53.080 | reducing its score to under 30%. Sounds like he's almost already tested it. He goes on,
00:17:58.340 | "Even at high compute, while a smart human would still be able to score over 95% with
00:18:03.740 | no training." Notice that's smart human rather than average human though. And also
00:18:08.180 | it's kind of like O3 is under 30%, but what about O4, O5? What if even O6 is released
00:18:14.980 | before the end of 2025? That's maybe why Mike Knoop, the funder of the Arc $1 million
00:18:21.860 | prize, says, "We want AGI benchmarks that can endure many years. I do not expect V2
00:18:28.020 | will." And so, cryptically, he says, "We're also starting turning attention to V3, which
00:18:33.340 | will be very different." That sets up the crucial definition then of
00:18:37.420 | what counts as AGI. Is it still not AGI as long as there's any benchmark that the average
00:18:43.780 | human can outperform a model at? Shillay's position, at least as of tonight, is that
00:18:48.660 | he doesn't believe that O3 is AGI. The reason? Because it's still feasible to create unsaturated,
00:18:55.700 | not crushed, interesting benchmarks that are easy for humans yet impossible for AI, without
00:19:01.700 | involving specialist knowledge. In sum, we will have AGI when creating such evals becomes
00:19:07.980 | outright impossible. The question is, is that a fair marker? Does it have to be impossible
00:19:13.900 | to create such a benchmark? One that humans can beat easily, yet is impossible for AI?
00:19:19.780 | Or should the definition of AGI be when it's harder to create a benchmark that's easier
00:19:26.260 | for humans than it is for AI? In a way, that seems like a fairer definition, such that
00:19:31.800 | there isn't just a single benchmark out there that's holding out and the rest have
00:19:36.220 | fallen, and we're still saying not AGI. That of course leaves the question of is it
00:19:40.740 | harder to create a benchmark that O3 can't solve and yet is easy for humans? Do we consider
00:19:45.620 | different modalities? Can it spot the lack of realism in certain AI generated videos?
00:19:51.020 | What kind of benchmarks are allowed or are not allowed? What about benchmarks where we
00:19:55.220 | factor in how quickly challenges are solved? I alas can't provide a satisfying answer
00:20:01.100 | for those of you who want a simple yes/no AGI or not. What I can do though is shine
00:20:07.200 | a light on the significance of this achievement. Again, it's not about particular benchmarks.
00:20:12.660 | It's about an approach that can be used again and again on whatever benchmark you
00:20:17.380 | create and to whatever scale you can pay for. It's almost like they've shown that
00:20:21.940 | they can defeat the very concept of a benchmark. Yes, of course I read the paper released tonight
00:20:27.500 | by OpenAI on deliberative alignment. Essentially, they use these same reasoning techniques to
00:20:32.500 | get the models to be great at refusing harmful requests while also not over-refusing innocent
00:20:38.540 | ones. Noam Brown, who is one of the research leads for O1, said that frontier math result
00:20:45.100 | actually had safety implications. He said even if LLMs are dumb in some ways, and of
00:20:50.300 | course I can't yet test O3 on SimpleBench, nor even O1, they haven't yet given me API
00:20:55.800 | access. He went on "Saturating evals, like frontier math, suggests AI is surpassing top
00:21:02.220 | human intelligence in certain domains." The first implication of that, he said, is
00:21:06.920 | that we may see a broad acceleration in scientific research. But then he went on "This also
00:21:11.920 | means that AI safety topics, like scalable oversight, may soon stop being hypothetical.
00:21:18.280 | Research in these domains needs to be a priority for the field." Scalable oversight, in
00:21:23.000 | a ridiculous nutshell, is answering the question of how essentially a dumber model, or dumber
00:21:29.120 | human, can still have oversight over a smarter model.
00:21:32.440 | This then is one of the co-creators of O3 saying we really need to start focusing on
00:21:38.640 | safety. It's perhaps then more credible when OpenAI researchers like John Holman say
00:21:43.600 | this, "When Sam and us researchers say AGI is coming, we aren't doing it to sell you
00:21:48.720 | Kool-Aid, a $2,000 subscription, or to trick you to invest in our next round. It's actually
00:21:54.880 | coming." Whatever you've made of O3 tonight, let me
00:21:57.400 | know in the comments, I personally can't wait to test it.
00:22:01.020 | This has been a big night in AI, and thank you so much for joining me on it.
00:22:05.640 | As always, we'd love to see you over on Patreon, where I'll be continuing the discussion
00:22:10.360 | and actually fairly soon releasing a mini-documentary on the fateful year 2015 when OpenAI started.
00:22:17.080 | But regardless, wherever you are, have a wonderful day.