back to index

o3 breaks (some) records, but AI becomes pay-to-win


Chapters

0:0 Introduction
0:33 FictionLiveBench
1:37 PHYBench
2:14 SimpleBench
2:54 Virology Capabilities Test
3:13 Mathematics Performance
4:29 Vision Benchmarks
5:43 V* and how o3 works
6:44 Revenue and costs for you
8:54 Expensive RL and trade-offs
9:40 How to spend the OOMs
13:27 Gray Swan Arena

Whisper Transcript | Transcript Only Page

00:00:00.000 | This video is about rapid progress in AI, progress that might soon be a little less
00:00:05.660 | US-centric, with news of a veteran OpenAI researcher being denied a green card.
00:00:12.320 | But it's been just a single-digit number of days since the release of O3, the latest model from
00:00:18.260 | OpenAI, and it has broken some records, and in turn raised yet more questions.
00:00:23.800 | So in no particular order, and drawing on a half dozen papers, here are four updates on
00:00:30.000 | the state of play at the bleeding edge of AI.
00:00:32.520 | Now just before we get to how much money these models will make for companies like OpenAI
00:00:36.700 | and Google, and how much money they will cost you, which model is actually the best at the
00:00:42.160 | moment?
00:00:42.400 | Well, that's actually really hard to say, because it depends heavily on your use case and the
00:00:48.100 | benchmark that you look at.
00:00:49.420 | At the moment, the two clear contenders for me would be O3 and Gemini 2.5 Pro, and I covered
00:00:56.020 | how they were neck and neck in some of the most famous benchmarks in the video I released
00:01:00.420 | on the night of O3's release.
00:01:02.220 | But since then, we've arguably got some more interesting benchmark results.
00:01:05.840 | Take the piecing together of puzzles within long works of fiction, up to say around 100,000
00:01:12.340 | words.
00:01:12.740 | I honestly expected Gemini 2.5 Pro to keep its lead, in that it could piece together those
00:01:18.020 | puzzles, even at various lengths through to the longest texts.
00:01:21.620 | After all, long context is Gemini's speciality.
00:01:24.740 | But no, O3 takes the lead at almost every length of text.
00:01:29.760 | If you know that there's a clue in chapter 3 that pertains to chapter 16, then O3 is the
00:01:35.900 | model for you.
00:01:36.480 | Who cares about that, some of you will say.
00:01:38.300 | What about physics and spatial reasoning?
00:01:40.260 | Well, here is a brand new benchmark from less than 72 hours ago, and we can compare those
00:01:46.200 | top two contenders.
00:01:47.040 | We have Gemini 2.5 Pro in the lead, followed by O3 High.
00:01:51.880 | And bear in mind that Gemini 2.5 Pro is four times cheaper than O3.
00:01:57.320 | Notice though for reference that human expert accuracy on this benchmark still far exceeds
00:02:02.700 | the best model.
00:02:03.540 | Imagine you had to learn about all sorts of realistic physical interactions, predominantly
00:02:07.960 | through reading text, not experiencing the world.
00:02:11.040 | You would probably have the same problems.
00:02:13.900 | And honestly, this explains much of the discrepancy between the top two models and the human baseline
00:02:19.960 | on my own benchmark, Simple Bench.
00:02:21.740 | Those two models are starting to see through all the tricks on my benchmark, but they're still
00:02:25.440 | failing quite badly on spatial reasoning.
00:02:27.680 | This isn't a question from Simple Bench or the physics benchmark, but it illustrates
00:02:31.200 | the point that if, for example, you put your right palm on your left shoulder and then loop
00:02:36.020 | your left arm through the gap between your right arm and your chest, well, you're probably
00:02:41.280 | following, but models have no idea what's going on.
00:02:44.080 | It's not in their training data and they can't really visualize what's happening.
00:02:47.140 | I will come back to this example later though, because soon with tools, I could see them
00:02:51.640 | getting this question right.
00:02:52.680 | Speaking of getting questions right, we learned that O3 beats out Gemini 2.5 Pro on a test
00:02:58.300 | of troubleshooting complex virology lab protocols.
00:03:02.060 | O3, you will be glad to know, gets a 94th percentile score.
00:03:07.160 | This is, of course, a text-based exam and isn't the same as actually conducting those protocols
00:03:11.680 | in the lab.
00:03:12.340 | You might notice I'm balancing things out because now for a benchmark in which Gemini 2.5 Pro
00:03:17.300 | exceeds the performance of O3, competition mathematics.
00:03:21.000 | Now, you may have heard on the Grapevine, the O3 and O4 Mini actually got state-of-the-art
00:03:26.480 | scores on AIM 2025.
00:03:28.400 | That is a high school maths competition.
00:03:31.000 | Without tools, both models got around 90%, but with tools, they got over 99%.
00:03:36.580 | What you may not know is that AIM is just one of the tests used to qualify for US AMO.
00:03:43.000 | That is a significantly harder proof-based maths test.
00:03:46.420 | Notice all of these are high school tests though, which is very different from professional
00:03:50.500 | mathematics.
00:03:51.500 | Anyway, on the US AMO, you can see here that we have O3 on high settings, getting around
00:03:57.820 | 22% right compared to 24% for Gemini 2.5.
00:04:02.060 | Again, four times cheaper for Gemini.
00:04:04.160 | What's perhaps more interesting is that the US AMO is only a qualifier for the hardest high
00:04:10.440 | school math competition.
00:04:11.400 | That's the International Math Olympiad.
00:04:13.400 | And Google has a system, alpha proof, that got a silver medal in that competition.
00:04:19.880 | Now, I've done other videos on alpha proof, but I would predict that in this year's competition
00:04:24.960 | in July, I suspect Google might get gold.
00:04:28.000 | Back to some more down-to-earth domains though.
00:04:30.520 | What about simple visual challenges like this one?
00:04:33.080 | Given an image, can the model answer, is the squirrel climbing up the fence?
00:04:37.500 | Or is the squirrel climbing down the fence with these two images?
00:04:40.560 | Are these two dogs significantly different in size, as another question.
00:04:44.460 | This benchmark is called Natural Bench.
00:04:47.080 | And you probably guessed, because I'm alternating in performance, O3 actually scores better than
00:04:52.380 | Gemini 2.5.
00:04:53.560 | Both, of course, still well behind human performance.
00:04:56.980 | Despite that first impression, it's actually Gemini 2.5 Pro that scores better at geoguessing,
00:05:02.560 | being given a random street view and knowing which country and location within that country
00:05:08.120 | you're looking at.
00:05:09.120 | In fact, the difference is quite stark, with 2.5 Pro way exceeding O3 High.
00:05:14.240 | Now I think of it, probably not too surprising given Google's ownership of Google Maps, Google
00:05:18.400 | Earth, and of course, YouTube.
00:05:19.920 | And Waymo.
00:05:20.920 | Last benchmark, I promise, but how about visual puzzles?
00:05:24.360 | Which kite has the longest string?
00:05:27.080 | Here the answer is C.
00:05:28.840 | And overall, on the visual puzzles benchmark, we have Gemini 2.5 Pro even underperforming
00:05:34.680 | O1, let alone O3.
00:05:36.580 | Both, of course, still well behind the average human, let alone an expert human.
00:05:41.900 | Now allow me, if you will, 30 more seconds before we get to the question of money, because
00:05:46.060 | OpenAI basically gave away the VSTAR method they use to improve so much in vision.
00:05:52.120 | You may have noticed how O3 seems to zoom in to answer a question, but what's the executive
00:05:57.280 | summary of VSTAR?
00:05:58.600 | Essentially, the model gets overwhelmed by a high-resolution image.
00:06:02.600 | So what the method does is it uses a multimodal LM to guess at what part of the image is going
00:06:09.640 | to be most relevant to the question.
00:06:11.320 | That part of the image is then cropped, added to the visual working memory, the context of
00:06:16.360 | the model, along with the original image, and submitted with the question.
00:06:20.040 | You can see that in action when I gave O3 this "Where's Wally?" or Americans say "Where's Waldo?" image.
00:06:25.960 | The language model speculates that Waldo tends to show up in places like a top vantage point or a
00:06:31.320 | walkway. So it decides to crop that area. Now I will say, in keeping with the other benchmarks we saw,
00:06:37.000 | it wasn't actually able to find Waldo, and I was, although it took me about three minutes, I'll be
00:06:42.120 | honest. Okay, those are the state-of-the-art models in AI. But where is this all heading? Well, to $174
00:06:48.280 | billion of revenue for OpenAI in 2030, according to themselves. In a moment, I'll touch on what that
00:06:54.760 | means for you in terms of price, but actually on that prediction seems pretty reasonable to me.
00:06:59.480 | Even though in 2024 they made just $4 billion, I could see that growing extremely rapidly. I would
00:07:04.760 | note though, that even with the biggest figures being far less than 1% of the value of white-collar
00:07:10.600 | labor globally, someone would have to be spectacularly wrong. Either as I suspect we won't get a country of
00:07:17.400 | geniuses in a data center in 2026-2027, or these figures are spectacular underestimates. Here then
00:07:25.000 | are some of my very summarized thoughts about why I think AI is becoming, maybe has already become,
00:07:31.160 | pay to win. Or another way of putting that, why me or you might have to pay more and more and more to
00:07:36.280 | stay at the cutting edge of AI. We got news just the other day that Google is planning their own
00:07:41.240 | premium plus and premium pro tiers. Probably on the order of $100, $200 a month,
00:07:46.840 | just like OpenAI and very recently Anthropic as well. Now think about it, if AGI or Superintelligence
00:07:52.760 | was quote, one simple trick away, one algorithmic tweak or a quick little scale up of RL, well then
00:07:59.160 | these companies incentives would be to get that AGI out as soon as possible to everyone, safety
00:08:05.080 | permitting. Capture market share as they all tend to want to do, gain monopolies, and then further down
00:08:10.680 | the road charge for access to that AGI. If on the other hand performance can be bought through sheer
00:08:16.280 | scaling up of compute, then someone is going to have to pay for that compute, namely you. Yes,
00:08:21.960 | we've had some quick gains going from 01 to 03 and even 04 mini, but as the CEO of Anthropic said,
00:08:28.200 | that post-training or reasoning through reinforcement learning is soon going to be at the cost of billions
00:08:33.720 | and billions of dollars. And nor is post-training magic. It can't actually create reasoning paths not
00:08:38.760 | found in the original base model. That's according to a very new paper out of Xinhua University. If you're
00:08:44.120 | interested in my deep dive on that paper and the previous one you just saw, I've just put up a 20 minute
00:08:49.800 | on my Patreon. Thank you as ever to everyone who supports the channel via Patreon. Now as the former
00:08:54.840 | chief research officer at OpenAI said, that doesn't mean that there isn't lots of low-hanging fruit in
00:08:59.880 | reasoning or post-training. But he nevertheless predicts that soon reasoning will quote "catch up" to
00:09:05.800 | pre-training in the sense of providing log linear returns. As in you have to put 10 times the investment to get
00:09:11.320 | one increment more of progress. Also bear in mind that Sam Altman recently called OpenAI a product
00:09:16.840 | company as much as a model company. It's a little bit like they're kind of taking their eye off the
00:09:21.800 | AGI ball and focusing more on dollar returned per compute spend. These companies only have so many GPUs
00:09:28.840 | and TPUs to go around. Every time researchers are tempted toward a bigger base model or more post-training,
00:09:34.840 | Sam Altman has to judge that against rate limits for new users, new feature launches and latency. I know
00:09:40.920 | this research from Epoch AI was mainly focused on scaling up training runs or pre-training the base
00:09:47.000 | models, but very broadly speaking it predicted by 2030 having say a hundred thousand times the effective
00:09:53.800 | compute as was used in 2022 for the training of GPT-4. Even if hypothetically by 2030 we had five orders
00:10:02.200 | of magnitude more compute than we have say today, think of all the competing demands on that compute
00:10:07.560 | OpenAI would have if they're to achieve a hundred and seventy four billion dollars of revenue. Their models
00:10:12.680 | by parameter count might be a thousand times bigger on average by then as compared to now. Most free users
00:10:19.160 | until very recently were using around an eight billion parameter model GPT-40 mini. But even if free users
00:10:25.560 | are now getting used to models the size of GPT-40, GPT-4.5 is around 20 trillion parameters. Some say
00:10:31.880 | 12 trillion but either way roughly two orders of magnitude more than GPT-40. Of course by then
00:10:36.760 | power users like me won't be using GPT-4.5 but probably GPT-5 or 6, 10 or 100 times bigger. Then
00:10:43.720 | there's the user base and even though OpenAI are serving 600 million monthly active users, by five
00:10:50.040 | years from now there might be six billion smartphone users. Google with Gemini recently quadrupled its
00:10:55.240 | user base in just a few months up to 350 million monthly active users. But that could easily 2x, 3x, 4x.
00:11:02.040 | That takes compute and this is all before we get to models thinking for longer. Then there's
00:11:06.440 | latency. Deep research is amazing but it takes an average of say five to ten minutes. You can imagine
00:11:11.640 | spending an order of magnitude more compute to bring that down to say five seconds. Also don't forget
00:11:16.680 | there's usage per user. In this 2027 or 2030 scenario of AGI everyone is of course going to be using these
00:11:24.280 | chatbots way more than they are now. That's another 10x and that's all before we get to things like text to image,
00:11:30.440 | text to video with Sora. All of which is a long way of saying that I could imagine 12 orders of
00:11:35.000 | magnitude of effective compute being utilized by companies like OpenAI. That includes things like
00:11:40.120 | not just more chips but more efficient chips and better algorithms. Five orders of magnitude by 2030
00:11:45.640 | wouldn't be nearly enough. If you notice none of that actually precludes there being a proto-AGI in the
00:11:52.440 | coming few years. Albeit a very expensive one. Here's what a senior staff member at OpenAI said
00:11:58.040 | just a few days ago. OpenAI, he said, has defined AGI as "a highly autonomous system that can outperform
00:12:04.840 | humans at most economically valuable work. We definitely aren't there yet, far from it." You might have
00:12:10.440 | deduced the same with some of the benchmarks earlier in this video. But he goes on, "The AGI vibes are very real
00:12:16.920 | to me." Especially the way that O3 dynamically uses tools as part of its chain of thought. Again he says
00:12:22.680 | that does not mean we've achieved AGI now. In fact it's a hill that he would die on that we have in fact
00:12:28.840 | not. He ends though, and I agree with this, that things will go slow until they go fast, really fast.
00:12:34.840 | Things feel fast today, but I think we're actually still accelerating, and we will actually start to
00:12:39.880 | go even faster. If you're willing to spend the money, Francois Chalet, a famous AI researcher said,
00:12:44.760 | going from cents per query up to tens of thousands of dollars per query, you can go from zero fluid
00:12:50.440 | intelligence to near human level fluid intelligence. After all, we're getting things like Anthropic's
00:12:55.240 | Model Context Protocol, where models now have a shared language to call tools of all types. And we know
00:13:01.320 | that tool calling was part of the reinforcement learning training of O3. So how long is it before
00:13:07.000 | O3, which arguably fails on anatomy questions like this, can call on open source software like OpenSim,
00:13:13.960 | and run a simulation. Enter the relevant parameters and run the code like they do with Code Interpreter,
00:13:19.080 | watching the resultant simulation. Soon almost any software could be sucked into the orbit of these
00:13:24.600 | models training regimes. Now I will grant you that presents all sorts of security problems that will
00:13:30.440 | have to be solved first. Which is why I'm going to introduce you to the sponsors of this video,
00:13:34.760 | Grace One. And you may be able to see out of the corner of your eye, a $60,000 competition that's
00:13:41.560 | in progress, wherein you, you don't even have to be a professional researcher, can try to use image
00:13:46.840 | inputs to jailbreak leading vision enabled AI models. I think it's pretty insane that you can be paid to
00:13:52.840 | exploit these vulnerabilities, and yet at the same time be boosting AI safety and security. These are
00:13:58.760 | incredibly legit competitions with public leaderboards monitored by OpenAI, Anthropic,
00:14:03.400 | and Google DeepMind. So wouldn't it be pretty epic if the winners of this competition turned out to
00:14:07.800 | have used my unique link, which you can find in the description. I will completely take full credit for
00:14:13.160 | your win and bask in the resultant glory. Of course, feel free to weigh in in the comments what you think
00:14:19.080 | about the new story that's currently going viral online. No doubt it's crazy times we live in,
00:14:24.760 | but thank you guys so much for watching to the end. I will never not be grateful for your viewership,
00:14:31.240 | so have an absolutely wonderful day.