back to indexo3 breaks (some) records, but AI becomes pay-to-win

Chapters
0:0 Introduction
0:33 FictionLiveBench
1:37 PHYBench
2:14 SimpleBench
2:54 Virology Capabilities Test
3:13 Mathematics Performance
4:29 Vision Benchmarks
5:43 V* and how o3 works
6:44 Revenue and costs for you
8:54 Expensive RL and trade-offs
9:40 How to spend the OOMs
13:27 Gray Swan Arena
00:00:00.000 |
This video is about rapid progress in AI, progress that might soon be a little less 00:00:05.660 |
US-centric, with news of a veteran OpenAI researcher being denied a green card. 00:00:12.320 |
But it's been just a single-digit number of days since the release of O3, the latest model from 00:00:18.260 |
OpenAI, and it has broken some records, and in turn raised yet more questions. 00:00:23.800 |
So in no particular order, and drawing on a half dozen papers, here are four updates on 00:00:30.000 |
the state of play at the bleeding edge of AI. 00:00:32.520 |
Now just before we get to how much money these models will make for companies like OpenAI 00:00:36.700 |
and Google, and how much money they will cost you, which model is actually the best at the 00:00:42.400 |
Well, that's actually really hard to say, because it depends heavily on your use case and the 00:00:49.420 |
At the moment, the two clear contenders for me would be O3 and Gemini 2.5 Pro, and I covered 00:00:56.020 |
how they were neck and neck in some of the most famous benchmarks in the video I released 00:01:02.220 |
But since then, we've arguably got some more interesting benchmark results. 00:01:05.840 |
Take the piecing together of puzzles within long works of fiction, up to say around 100,000 00:01:12.740 |
I honestly expected Gemini 2.5 Pro to keep its lead, in that it could piece together those 00:01:18.020 |
puzzles, even at various lengths through to the longest texts. 00:01:21.620 |
After all, long context is Gemini's speciality. 00:01:24.740 |
But no, O3 takes the lead at almost every length of text. 00:01:29.760 |
If you know that there's a clue in chapter 3 that pertains to chapter 16, then O3 is the 00:01:40.260 |
Well, here is a brand new benchmark from less than 72 hours ago, and we can compare those 00:01:47.040 |
We have Gemini 2.5 Pro in the lead, followed by O3 High. 00:01:51.880 |
And bear in mind that Gemini 2.5 Pro is four times cheaper than O3. 00:01:57.320 |
Notice though for reference that human expert accuracy on this benchmark still far exceeds 00:02:03.540 |
Imagine you had to learn about all sorts of realistic physical interactions, predominantly 00:02:07.960 |
through reading text, not experiencing the world. 00:02:13.900 |
And honestly, this explains much of the discrepancy between the top two models and the human baseline 00:02:21.740 |
Those two models are starting to see through all the tricks on my benchmark, but they're still 00:02:27.680 |
This isn't a question from Simple Bench or the physics benchmark, but it illustrates 00:02:31.200 |
the point that if, for example, you put your right palm on your left shoulder and then loop 00:02:36.020 |
your left arm through the gap between your right arm and your chest, well, you're probably 00:02:41.280 |
following, but models have no idea what's going on. 00:02:44.080 |
It's not in their training data and they can't really visualize what's happening. 00:02:47.140 |
I will come back to this example later though, because soon with tools, I could see them 00:02:52.680 |
Speaking of getting questions right, we learned that O3 beats out Gemini 2.5 Pro on a test 00:02:58.300 |
of troubleshooting complex virology lab protocols. 00:03:02.060 |
O3, you will be glad to know, gets a 94th percentile score. 00:03:07.160 |
This is, of course, a text-based exam and isn't the same as actually conducting those protocols 00:03:12.340 |
You might notice I'm balancing things out because now for a benchmark in which Gemini 2.5 Pro 00:03:17.300 |
exceeds the performance of O3, competition mathematics. 00:03:21.000 |
Now, you may have heard on the Grapevine, the O3 and O4 Mini actually got state-of-the-art 00:03:31.000 |
Without tools, both models got around 90%, but with tools, they got over 99%. 00:03:36.580 |
What you may not know is that AIM is just one of the tests used to qualify for US AMO. 00:03:43.000 |
That is a significantly harder proof-based maths test. 00:03:46.420 |
Notice all of these are high school tests though, which is very different from professional 00:03:51.500 |
Anyway, on the US AMO, you can see here that we have O3 on high settings, getting around 00:04:04.160 |
What's perhaps more interesting is that the US AMO is only a qualifier for the hardest high 00:04:13.400 |
And Google has a system, alpha proof, that got a silver medal in that competition. 00:04:19.880 |
Now, I've done other videos on alpha proof, but I would predict that in this year's competition 00:04:28.000 |
Back to some more down-to-earth domains though. 00:04:30.520 |
What about simple visual challenges like this one? 00:04:33.080 |
Given an image, can the model answer, is the squirrel climbing up the fence? 00:04:37.500 |
Or is the squirrel climbing down the fence with these two images? 00:04:40.560 |
Are these two dogs significantly different in size, as another question. 00:04:47.080 |
And you probably guessed, because I'm alternating in performance, O3 actually scores better than 00:04:53.560 |
Both, of course, still well behind human performance. 00:04:56.980 |
Despite that first impression, it's actually Gemini 2.5 Pro that scores better at geoguessing, 00:05:02.560 |
being given a random street view and knowing which country and location within that country 00:05:09.120 |
In fact, the difference is quite stark, with 2.5 Pro way exceeding O3 High. 00:05:14.240 |
Now I think of it, probably not too surprising given Google's ownership of Google Maps, Google 00:05:20.920 |
Last benchmark, I promise, but how about visual puzzles? 00:05:28.840 |
And overall, on the visual puzzles benchmark, we have Gemini 2.5 Pro even underperforming 00:05:36.580 |
Both, of course, still well behind the average human, let alone an expert human. 00:05:41.900 |
Now allow me, if you will, 30 more seconds before we get to the question of money, because 00:05:46.060 |
OpenAI basically gave away the VSTAR method they use to improve so much in vision. 00:05:52.120 |
You may have noticed how O3 seems to zoom in to answer a question, but what's the executive 00:05:58.600 |
Essentially, the model gets overwhelmed by a high-resolution image. 00:06:02.600 |
So what the method does is it uses a multimodal LM to guess at what part of the image is going 00:06:11.320 |
That part of the image is then cropped, added to the visual working memory, the context of 00:06:16.360 |
the model, along with the original image, and submitted with the question. 00:06:20.040 |
You can see that in action when I gave O3 this "Where's Wally?" or Americans say "Where's Waldo?" image. 00:06:25.960 |
The language model speculates that Waldo tends to show up in places like a top vantage point or a 00:06:31.320 |
walkway. So it decides to crop that area. Now I will say, in keeping with the other benchmarks we saw, 00:06:37.000 |
it wasn't actually able to find Waldo, and I was, although it took me about three minutes, I'll be 00:06:42.120 |
honest. Okay, those are the state-of-the-art models in AI. But where is this all heading? Well, to $174 00:06:48.280 |
billion of revenue for OpenAI in 2030, according to themselves. In a moment, I'll touch on what that 00:06:54.760 |
means for you in terms of price, but actually on that prediction seems pretty reasonable to me. 00:06:59.480 |
Even though in 2024 they made just $4 billion, I could see that growing extremely rapidly. I would 00:07:04.760 |
note though, that even with the biggest figures being far less than 1% of the value of white-collar 00:07:10.600 |
labor globally, someone would have to be spectacularly wrong. Either as I suspect we won't get a country of 00:07:17.400 |
geniuses in a data center in 2026-2027, or these figures are spectacular underestimates. Here then 00:07:25.000 |
are some of my very summarized thoughts about why I think AI is becoming, maybe has already become, 00:07:31.160 |
pay to win. Or another way of putting that, why me or you might have to pay more and more and more to 00:07:36.280 |
stay at the cutting edge of AI. We got news just the other day that Google is planning their own 00:07:41.240 |
premium plus and premium pro tiers. Probably on the order of $100, $200 a month, 00:07:46.840 |
just like OpenAI and very recently Anthropic as well. Now think about it, if AGI or Superintelligence 00:07:52.760 |
was quote, one simple trick away, one algorithmic tweak or a quick little scale up of RL, well then 00:07:59.160 |
these companies incentives would be to get that AGI out as soon as possible to everyone, safety 00:08:05.080 |
permitting. Capture market share as they all tend to want to do, gain monopolies, and then further down 00:08:10.680 |
the road charge for access to that AGI. If on the other hand performance can be bought through sheer 00:08:16.280 |
scaling up of compute, then someone is going to have to pay for that compute, namely you. Yes, 00:08:21.960 |
we've had some quick gains going from 01 to 03 and even 04 mini, but as the CEO of Anthropic said, 00:08:28.200 |
that post-training or reasoning through reinforcement learning is soon going to be at the cost of billions 00:08:33.720 |
and billions of dollars. And nor is post-training magic. It can't actually create reasoning paths not 00:08:38.760 |
found in the original base model. That's according to a very new paper out of Xinhua University. If you're 00:08:44.120 |
interested in my deep dive on that paper and the previous one you just saw, I've just put up a 20 minute 00:08:49.800 |
on my Patreon. Thank you as ever to everyone who supports the channel via Patreon. Now as the former 00:08:54.840 |
chief research officer at OpenAI said, that doesn't mean that there isn't lots of low-hanging fruit in 00:08:59.880 |
reasoning or post-training. But he nevertheless predicts that soon reasoning will quote "catch up" to 00:09:05.800 |
pre-training in the sense of providing log linear returns. As in you have to put 10 times the investment to get 00:09:11.320 |
one increment more of progress. Also bear in mind that Sam Altman recently called OpenAI a product 00:09:16.840 |
company as much as a model company. It's a little bit like they're kind of taking their eye off the 00:09:21.800 |
AGI ball and focusing more on dollar returned per compute spend. These companies only have so many GPUs 00:09:28.840 |
and TPUs to go around. Every time researchers are tempted toward a bigger base model or more post-training, 00:09:34.840 |
Sam Altman has to judge that against rate limits for new users, new feature launches and latency. I know 00:09:40.920 |
this research from Epoch AI was mainly focused on scaling up training runs or pre-training the base 00:09:47.000 |
models, but very broadly speaking it predicted by 2030 having say a hundred thousand times the effective 00:09:53.800 |
compute as was used in 2022 for the training of GPT-4. Even if hypothetically by 2030 we had five orders 00:10:02.200 |
of magnitude more compute than we have say today, think of all the competing demands on that compute 00:10:07.560 |
OpenAI would have if they're to achieve a hundred and seventy four billion dollars of revenue. Their models 00:10:12.680 |
by parameter count might be a thousand times bigger on average by then as compared to now. Most free users 00:10:19.160 |
until very recently were using around an eight billion parameter model GPT-40 mini. But even if free users 00:10:25.560 |
are now getting used to models the size of GPT-40, GPT-4.5 is around 20 trillion parameters. Some say 00:10:31.880 |
12 trillion but either way roughly two orders of magnitude more than GPT-40. Of course by then 00:10:36.760 |
power users like me won't be using GPT-4.5 but probably GPT-5 or 6, 10 or 100 times bigger. Then 00:10:43.720 |
there's the user base and even though OpenAI are serving 600 million monthly active users, by five 00:10:50.040 |
years from now there might be six billion smartphone users. Google with Gemini recently quadrupled its 00:10:55.240 |
user base in just a few months up to 350 million monthly active users. But that could easily 2x, 3x, 4x. 00:11:02.040 |
That takes compute and this is all before we get to models thinking for longer. Then there's 00:11:06.440 |
latency. Deep research is amazing but it takes an average of say five to ten minutes. You can imagine 00:11:11.640 |
spending an order of magnitude more compute to bring that down to say five seconds. Also don't forget 00:11:16.680 |
there's usage per user. In this 2027 or 2030 scenario of AGI everyone is of course going to be using these 00:11:24.280 |
chatbots way more than they are now. That's another 10x and that's all before we get to things like text to image, 00:11:30.440 |
text to video with Sora. All of which is a long way of saying that I could imagine 12 orders of 00:11:35.000 |
magnitude of effective compute being utilized by companies like OpenAI. That includes things like 00:11:40.120 |
not just more chips but more efficient chips and better algorithms. Five orders of magnitude by 2030 00:11:45.640 |
wouldn't be nearly enough. If you notice none of that actually precludes there being a proto-AGI in the 00:11:52.440 |
coming few years. Albeit a very expensive one. Here's what a senior staff member at OpenAI said 00:11:58.040 |
just a few days ago. OpenAI, he said, has defined AGI as "a highly autonomous system that can outperform 00:12:04.840 |
humans at most economically valuable work. We definitely aren't there yet, far from it." You might have 00:12:10.440 |
deduced the same with some of the benchmarks earlier in this video. But he goes on, "The AGI vibes are very real 00:12:16.920 |
to me." Especially the way that O3 dynamically uses tools as part of its chain of thought. Again he says 00:12:22.680 |
that does not mean we've achieved AGI now. In fact it's a hill that he would die on that we have in fact 00:12:28.840 |
not. He ends though, and I agree with this, that things will go slow until they go fast, really fast. 00:12:34.840 |
Things feel fast today, but I think we're actually still accelerating, and we will actually start to 00:12:39.880 |
go even faster. If you're willing to spend the money, Francois Chalet, a famous AI researcher said, 00:12:44.760 |
going from cents per query up to tens of thousands of dollars per query, you can go from zero fluid 00:12:50.440 |
intelligence to near human level fluid intelligence. After all, we're getting things like Anthropic's 00:12:55.240 |
Model Context Protocol, where models now have a shared language to call tools of all types. And we know 00:13:01.320 |
that tool calling was part of the reinforcement learning training of O3. So how long is it before 00:13:07.000 |
O3, which arguably fails on anatomy questions like this, can call on open source software like OpenSim, 00:13:13.960 |
and run a simulation. Enter the relevant parameters and run the code like they do with Code Interpreter, 00:13:19.080 |
watching the resultant simulation. Soon almost any software could be sucked into the orbit of these 00:13:24.600 |
models training regimes. Now I will grant you that presents all sorts of security problems that will 00:13:30.440 |
have to be solved first. Which is why I'm going to introduce you to the sponsors of this video, 00:13:34.760 |
Grace One. And you may be able to see out of the corner of your eye, a $60,000 competition that's 00:13:41.560 |
in progress, wherein you, you don't even have to be a professional researcher, can try to use image 00:13:46.840 |
inputs to jailbreak leading vision enabled AI models. I think it's pretty insane that you can be paid to 00:13:52.840 |
exploit these vulnerabilities, and yet at the same time be boosting AI safety and security. These are 00:13:58.760 |
incredibly legit competitions with public leaderboards monitored by OpenAI, Anthropic, 00:14:03.400 |
and Google DeepMind. So wouldn't it be pretty epic if the winners of this competition turned out to 00:14:07.800 |
have used my unique link, which you can find in the description. I will completely take full credit for 00:14:13.160 |
your win and bask in the resultant glory. Of course, feel free to weigh in in the comments what you think 00:14:19.080 |
about the new story that's currently going viral online. No doubt it's crazy times we live in, 00:14:24.760 |
but thank you guys so much for watching to the end. I will never not be grateful for your viewership,