This video is about rapid progress in AI, progress that might soon be a little less US-centric, with news of a veteran OpenAI researcher being denied a green card. But it's been just a single-digit number of days since the release of O3, the latest model from OpenAI, and it has broken some records, and in turn raised yet more questions.
So in no particular order, and drawing on a half dozen papers, here are four updates on the state of play at the bleeding edge of AI. Now just before we get to how much money these models will make for companies like OpenAI and Google, and how much money they will cost you, which model is actually the best at the moment?
Well, that's actually really hard to say, because it depends heavily on your use case and the benchmark that you look at. At the moment, the two clear contenders for me would be O3 and Gemini 2.5 Pro, and I covered how they were neck and neck in some of the most famous benchmarks in the video I released on the night of O3's release.
But since then, we've arguably got some more interesting benchmark results. Take the piecing together of puzzles within long works of fiction, up to say around 100,000 words. I honestly expected Gemini 2.5 Pro to keep its lead, in that it could piece together those puzzles, even at various lengths through to the longest texts.
After all, long context is Gemini's speciality. But no, O3 takes the lead at almost every length of text. If you know that there's a clue in chapter 3 that pertains to chapter 16, then O3 is the model for you. Who cares about that, some of you will say. What about physics and spatial reasoning?
Well, here is a brand new benchmark from less than 72 hours ago, and we can compare those top two contenders. We have Gemini 2.5 Pro in the lead, followed by O3 High. And bear in mind that Gemini 2.5 Pro is four times cheaper than O3. Notice though for reference that human expert accuracy on this benchmark still far exceeds the best model.
Imagine you had to learn about all sorts of realistic physical interactions, predominantly through reading text, not experiencing the world. You would probably have the same problems. And honestly, this explains much of the discrepancy between the top two models and the human baseline on my own benchmark, Simple Bench. Those two models are starting to see through all the tricks on my benchmark, but they're still failing quite badly on spatial reasoning.
This isn't a question from Simple Bench or the physics benchmark, but it illustrates the point that if, for example, you put your right palm on your left shoulder and then loop your left arm through the gap between your right arm and your chest, well, you're probably following, but models have no idea what's going on.
It's not in their training data and they can't really visualize what's happening. I will come back to this example later though, because soon with tools, I could see them getting this question right. Speaking of getting questions right, we learned that O3 beats out Gemini 2.5 Pro on a test of troubleshooting complex virology lab protocols.
O3, you will be glad to know, gets a 94th percentile score. This is, of course, a text-based exam and isn't the same as actually conducting those protocols in the lab. You might notice I'm balancing things out because now for a benchmark in which Gemini 2.5 Pro exceeds the performance of O3, competition mathematics.
Now, you may have heard on the Grapevine, the O3 and O4 Mini actually got state-of-the-art scores on AIM 2025. That is a high school maths competition. Without tools, both models got around 90%, but with tools, they got over 99%. What you may not know is that AIM is just one of the tests used to qualify for US AMO.
That is a significantly harder proof-based maths test. Notice all of these are high school tests though, which is very different from professional mathematics. Anyway, on the US AMO, you can see here that we have O3 on high settings, getting around 22% right compared to 24% for Gemini 2.5. Again, four times cheaper for Gemini.
What's perhaps more interesting is that the US AMO is only a qualifier for the hardest high school math competition. That's the International Math Olympiad. And Google has a system, alpha proof, that got a silver medal in that competition. Now, I've done other videos on alpha proof, but I would predict that in this year's competition in July, I suspect Google might get gold.
Back to some more down-to-earth domains though. What about simple visual challenges like this one? Given an image, can the model answer, is the squirrel climbing up the fence? Or is the squirrel climbing down the fence with these two images? Are these two dogs significantly different in size, as another question.
This benchmark is called Natural Bench. And you probably guessed, because I'm alternating in performance, O3 actually scores better than Gemini 2.5. Both, of course, still well behind human performance. Despite that first impression, it's actually Gemini 2.5 Pro that scores better at geoguessing, being given a random street view and knowing which country and location within that country you're looking at.
In fact, the difference is quite stark, with 2.5 Pro way exceeding O3 High. Now I think of it, probably not too surprising given Google's ownership of Google Maps, Google Earth, and of course, YouTube. And Waymo. Last benchmark, I promise, but how about visual puzzles? Which kite has the longest string?
Here the answer is C. And overall, on the visual puzzles benchmark, we have Gemini 2.5 Pro even underperforming O1, let alone O3. Both, of course, still well behind the average human, let alone an expert human. Now allow me, if you will, 30 more seconds before we get to the question of money, because OpenAI basically gave away the VSTAR method they use to improve so much in vision.
You may have noticed how O3 seems to zoom in to answer a question, but what's the executive summary of VSTAR? Essentially, the model gets overwhelmed by a high-resolution image. So what the method does is it uses a multimodal LM to guess at what part of the image is going to be most relevant to the question.
That part of the image is then cropped, added to the visual working memory, the context of the model, along with the original image, and submitted with the question. You can see that in action when I gave O3 this "Where's Wally?" or Americans say "Where's Waldo?" image. The language model speculates that Waldo tends to show up in places like a top vantage point or a walkway.
So it decides to crop that area. Now I will say, in keeping with the other benchmarks we saw, it wasn't actually able to find Waldo, and I was, although it took me about three minutes, I'll be honest. Okay, those are the state-of-the-art models in AI. But where is this all heading?
Well, to $174 billion of revenue for OpenAI in 2030, according to themselves. In a moment, I'll touch on what that means for you in terms of price, but actually on that prediction seems pretty reasonable to me. Even though in 2024 they made just $4 billion, I could see that growing extremely rapidly.
I would note though, that even with the biggest figures being far less than 1% of the value of white-collar labor globally, someone would have to be spectacularly wrong. Either as I suspect we won't get a country of geniuses in a data center in 2026-2027, or these figures are spectacular underestimates.
Here then are some of my very summarized thoughts about why I think AI is becoming, maybe has already become, pay to win. Or another way of putting that, why me or you might have to pay more and more and more to stay at the cutting edge of AI. We got news just the other day that Google is planning their own premium plus and premium pro tiers.
Probably on the order of $100, $200 a month, just like OpenAI and very recently Anthropic as well. Now think about it, if AGI or Superintelligence was quote, one simple trick away, one algorithmic tweak or a quick little scale up of RL, well then these companies incentives would be to get that AGI out as soon as possible to everyone, safety permitting.
Capture market share as they all tend to want to do, gain monopolies, and then further down the road charge for access to that AGI. If on the other hand performance can be bought through sheer scaling up of compute, then someone is going to have to pay for that compute, namely you.
Yes, we've had some quick gains going from 01 to 03 and even 04 mini, but as the CEO of Anthropic said, that post-training or reasoning through reinforcement learning is soon going to be at the cost of billions and billions of dollars. And nor is post-training magic. It can't actually create reasoning paths not found in the original base model.
That's according to a very new paper out of Xinhua University. If you're interested in my deep dive on that paper and the previous one you just saw, I've just put up a 20 minute on my Patreon. Thank you as ever to everyone who supports the channel via Patreon. Now as the former chief research officer at OpenAI said, that doesn't mean that there isn't lots of low-hanging fruit in reasoning or post-training.
But he nevertheless predicts that soon reasoning will quote "catch up" to pre-training in the sense of providing log linear returns. As in you have to put 10 times the investment to get one increment more of progress. Also bear in mind that Sam Altman recently called OpenAI a product company as much as a model company.
It's a little bit like they're kind of taking their eye off the AGI ball and focusing more on dollar returned per compute spend. These companies only have so many GPUs and TPUs to go around. Every time researchers are tempted toward a bigger base model or more post-training, Sam Altman has to judge that against rate limits for new users, new feature launches and latency.
I know this research from Epoch AI was mainly focused on scaling up training runs or pre-training the base models, but very broadly speaking it predicted by 2030 having say a hundred thousand times the effective compute as was used in 2022 for the training of GPT-4. Even if hypothetically by 2030 we had five orders of magnitude more compute than we have say today, think of all the competing demands on that compute OpenAI would have if they're to achieve a hundred and seventy four billion dollars of revenue.
Their models by parameter count might be a thousand times bigger on average by then as compared to now. Most free users until very recently were using around an eight billion parameter model GPT-40 mini. But even if free users are now getting used to models the size of GPT-40, GPT-4.5 is around 20 trillion parameters.
Some say 12 trillion but either way roughly two orders of magnitude more than GPT-40. Of course by then power users like me won't be using GPT-4.5 but probably GPT-5 or 6, 10 or 100 times bigger. Then there's the user base and even though OpenAI are serving 600 million monthly active users, by five years from now there might be six billion smartphone users.
Google with Gemini recently quadrupled its user base in just a few months up to 350 million monthly active users. But that could easily 2x, 3x, 4x. That takes compute and this is all before we get to models thinking for longer. Then there's latency. Deep research is amazing but it takes an average of say five to ten minutes.
You can imagine spending an order of magnitude more compute to bring that down to say five seconds. Also don't forget there's usage per user. In this 2027 or 2030 scenario of AGI everyone is of course going to be using these chatbots way more than they are now. That's another 10x and that's all before we get to things like text to image, text to video with Sora.
All of which is a long way of saying that I could imagine 12 orders of magnitude of effective compute being utilized by companies like OpenAI. That includes things like not just more chips but more efficient chips and better algorithms. Five orders of magnitude by 2030 wouldn't be nearly enough.
If you notice none of that actually precludes there being a proto-AGI in the coming few years. Albeit a very expensive one. Here's what a senior staff member at OpenAI said just a few days ago. OpenAI, he said, has defined AGI as "a highly autonomous system that can outperform humans at most economically valuable work.
We definitely aren't there yet, far from it." You might have deduced the same with some of the benchmarks earlier in this video. But he goes on, "The AGI vibes are very real to me." Especially the way that O3 dynamically uses tools as part of its chain of thought. Again he says that does not mean we've achieved AGI now.
In fact it's a hill that he would die on that we have in fact not. He ends though, and I agree with this, that things will go slow until they go fast, really fast. Things feel fast today, but I think we're actually still accelerating, and we will actually start to go even faster.
If you're willing to spend the money, Francois Chalet, a famous AI researcher said, going from cents per query up to tens of thousands of dollars per query, you can go from zero fluid intelligence to near human level fluid intelligence. After all, we're getting things like Anthropic's Model Context Protocol, where models now have a shared language to call tools of all types.
And we know that tool calling was part of the reinforcement learning training of O3. So how long is it before O3, which arguably fails on anatomy questions like this, can call on open source software like OpenSim, and run a simulation. Enter the relevant parameters and run the code like they do with Code Interpreter, watching the resultant simulation.
Soon almost any software could be sucked into the orbit of these models training regimes. Now I will grant you that presents all sorts of security problems that will have to be solved first. Which is why I'm going to introduce you to the sponsors of this video, Grace One. And you may be able to see out of the corner of your eye, a $60,000 competition that's in progress, wherein you, you don't even have to be a professional researcher, can try to use image inputs to jailbreak leading vision enabled AI models.
I think it's pretty insane that you can be paid to exploit these vulnerabilities, and yet at the same time be boosting AI safety and security. These are incredibly legit competitions with public leaderboards monitored by OpenAI, Anthropic, and Google DeepMind. So wouldn't it be pretty epic if the winners of this competition turned out to have used my unique link, which you can find in the description.
I will completely take full credit for your win and bask in the resultant glory. Of course, feel free to weigh in in the comments what you think about the new story that's currently going viral online. No doubt it's crazy times we live in, but thank you guys so much for watching to the end.
I will never not be grateful for your viewership, so have an absolutely wonderful day.