I have a flight to catch in a few hours, so this is going to be a much speedier version of my normal videos, but O4 Mini has been released along with O3 from OpenAI, and it's generating insane hype like this. But is the hype justified? Call me cynical, but they do lean towards giving early access to people that they know are going to massively hype up their new models.
Now, don't get me wrong, they are much better than previous models like O1, they're just not above genius level. I can prove that in multiple ways, but I'm just going to give you a selection, and yes, I have read most of the system card and tested the model 20 times already.
I should say both models because it's O4 Mini and O3. For those completely new to AI, by the way, these are now the best models within ChatGPT, which is different, of course, from Google's Gemini 2.5 Pro or Anthropix Claude 3.7. Others, like Tyler Cohen, says that O3 is AGI, and honestly, I hear people talk about Tyler Cohen.
I have no idea who he is, but I don't believe O3 is AGI. For me, AGI is when a model can perform better than the human average at most tasks that a human can do. For knowledge, coding, and mathematics, that is absolutely true, as long as you don't focus on experts, but in general, it's not so much the case.
There are many examples I could have chosen, but check this one out. Here are five lines, and I asked how many overlaps do they have in total. It said there are eight distinct points where the five drawn line segments intersect. Now, I know you guys, and you might pause and say, well, if you extrapolate the lines, maybe they intersect eight times.
Oh, it's not wrong. Nah, this is exactly what O3 meant. Here, for example, is one point of intersection. Anyway, I can't explain my entire theory about AGI here. This is a very brief video, but it is definitely not hallucination free. That is complete BS, and OpenAI know it. It's a great model, a big improvement.
I think O4 Mini is possibly comparable to Gemini 2.5 Pro on a good day, but it's definitely not hallucination free. Both models are trained from the ground up to use tools, and I think that's an epic way to improve models, and they will get even more useful very fast.
In case you think I am dehyping O3 too much, it was the first model to get six out of ten on the first ten public questions of my own benchmark, SimpleBench. I was genuinely impressed with some of its answers, and even though I disagree that it would take a fairly old man three to five minutes to climb up to the top of a skyscraper, it nevertheless got the question right for the first time of any OpenAI model.
But back to the theme you heard earlier, it can sometimes make some pretty basic mistakes, like saying that a glove that falls out of the trunk or boot of a car, which is going over a bridge, seems likely to fall into the river, because the trunk area is open and the river is directly below.
Well, what happened to the bridge? As I hinted at in my previous video, released just, wait, three, four hours ago, it does sometimes get this question right, because I kind of indirectly got early access to O3, but more often than not it gets it wrong. I don't know about you, but if I met someone who was above genius level, I would expect them to at least consider that the glove might fall out of the car onto the bridge itself.
O4 Mini High gets four out of ten on the public set of questions, which is actually really not bad for a small quick model. Both models, by the way, are coming to the plus tier, which is really making me start to wonder about paying for the pro tier. But anyway, here are the prices.
These numbers probably don't mean much to you, but the key point of comparison is with O3 and Gemini 2.5 Pro. Very roughly speaking, Gemini 2.5 Pro is about three to four times cheaper than O3, so bear that in mind when you see the benchmark results in a second. The first benchmark result is on the overlap test, and yes, Gemini gets it right.
Oh, and by the way, speaking of multi-modality, Gemini 2.5 Pro can handle YouTube videos and just raw videos and O3 can't. I initially got really excited when I saw that you could upload videos to O3, but it's just doing an analysis of the metadata. Now, I know O3 is trained to use tools natively, and I was particularly impressed with the way that it analyzed my benchmark website, created an image, a cover image for it, and did a deep dive analysis, which was pretty much accurate.
It gave speculation about why the front runners were better, and honestly, some pretty nuanced advice about the benchmark itself and its limitations. Now, some of you may say, what about the O3 wow video that I made earlier? And first of all, AI moves very quickly, so even four or five months can change a lot in AI.
It's still a wow model, just not as wow in comparison with, say, Gemini 2.5, or even sometimes Clawed 3.7 thinking. There is another key detail, though, that they slipped into the presentation. They said that O3, the one I covered back in December, was, quote, benchmark optimized. The ones we're getting, as the ARC prize confirmed, are smaller in terms of compute or less thinking time than the version that they tested.
Now, I am presuming that what they meant by benchmark optimized was that they let O3 think for a lot longer, have a lot more inference time compute. In other words, we are not quite getting the model that crushed ARC AGI. Some other quick details before we get to the benchmarks, and both models have a 200,000 token context window, think roughly 150,000 words, but they can output up to 80,000 words, which I think is pretty cool.
Their knowledge cutoff, which you can think of as the limits of their training data is June 1st, 2024. That compares to January 2025 for Gemini 2.5 Pro. It seems, and I haven't had time to check this, that they are still basing it on GPT-4-0, hence the lack of an updated training cutoff date.
Time for some benchmarks, and don't hate on me for literally just getting a screen grab from the YouTube video. For competitive mathematics, O3 and O4 mini do extremely well on a data set that couldn't have been in their training data. For reference, on this benchmark, Gemini 2.5 Pro gets around 86%.
Now, with multiple attempts, Grok 3 gets 93%, but we don't quite know how many attempts this was for OpenAI. I would presume just first attempt. Either way, as the narrator said, when you allow these models to have tools, they are extremely good to the point of saturating some of these competitive math benchmarks.
Likewise, for competitive code, if you can benchmark it, they can crush it. These models, indeed, even other model families, are essentially eval maximizers, as I touched on in my video from, what, four hours ago. For PhD level science, you can see the results here, 83.3 and 81.4. For reference, Gemini gets 84%, and Claude 3.7 Sonnet gets 84.8%.
That's with multiple attempts for Claude, but just a single attempt for Gemini 2.5. So, Gemini 2.5 is better on a single attempt than either model. Kind of seems a little bit strange to be declaring AGI tonight, but not when Gemini 2.5 Pro came out. For me, neither are AGI, but I am not an AGI denialist.
I think it is coming in the next few years. The simplest version of my definition is when I would hire O4, for example, over a smart human being. Could they edit an entire video without random glitches or cutting off key images? For that matter, could they do my Amazon shopping without putting me 10 grand in debt?
I do get it. At coming up with quick drafts that seem incredibly intelligent and often are, they are incredible, far smarter than me in that sense. I couldn't get any of these scores in any of these exams. But it is fairly inadvisable to make comparisons to human IQ, because how many super crazy coders or PhDs do you know that could score like this but not count the number of overlaps?
Or think of there being a bridge beneath the glove that's falling? If it's in their training data, amazing. If it's not, on the MMMU, which is a bit like the MMLU in the sense of it spans many different domains, but it focuses on questions involving charts and tables and graphs and things like that, O3 gets 82.9%.
That is genuinely better than Gemini 2.5 Pro's 81.7%, so well done to OpenAI on that. Now, on humanity's last exam, which you can think of as a benchmark for really obscure knowledge, it was almost slightly disappointing for me, even though the previous record was OpenAI itself with deep research.
Now, in fairness, again, O3 beats Gemini 2.5 Pro, which got 18% in this benchmark. But given that deep research was powered by an early version, OpenAI said, of O3, I was expecting a little bit more, especially with the quote, "new and improved," Sam Altman said, O3, with Python and browsing tools.
Or even O4 for that matter, I thought they might have injected more knowledge into it. That is kind of harsh because they're just challenging their own record in this benchmark, so well done to them again. The release notes from OpenAI were fairly interesting, with evaluations by "external experts" having O3 making 20% fewer major errors.
That's great, but what happened to it being "hallucination-free"? If I saw Sam Altman retweet that a new model was "hallucination-free" and I was an average white-collar worker, I would be panicking. What's the real truth? It is absolutely not hallucination-free, and making fewer major errors is great, but still concedes that it does make major errors, as we've seen already.
On one part of Ada's Polyglot coding benchmark, we can see that O3 on high settings indeed sets a record. It's more than 10 points higher than Gemini 2.5 Pro. But you may remember that O1 on high settings, as in thinking for a long time using lots of chains of thought, cost almost $200.
That compares to $6 for Gemini. O3 high, in other words, may eke out Gemini 2.5 Pro, and therefore become widely used, but at an extreme cost. Or the TL;DR of all of that is even in those domains where O3 has taken the lead, it hasn't taken the cost-effective lead.
On Codex CLI, their agent that you can run from your terminal, they're clearly taking aim at Claude code, but obviously, given that it's only been around two and a half hours, I haven't had time to test it. Of course, that may be turbocharged if OpenAI buy Windsurf, the competitor to Cursor itself.
Of course, if you check out my previous video, Kevin Weil, the chief product officer of OpenAI, said very clearly that competitive coding is not always the same as front-end coding, for example, so you'll have to test it yourself. As always, it comes down to how much high-quality data there is in your domain, and indeed, a diversity of data.
As we all know, in machine learning, sometimes you can over-train on bits of data, as you can see in this example. Yes, this was O3, by the way. I bet some of the first comments will be, "What about testing both models on SimpleBench, given that the API is out tonight?" Well, given my flight, that will have to be my colleague, who is hopefully going to do it tonight, and so the results should be on the website tonight, fingers crossed.
I suspect it may just take the lead from Gemini 2.5 Pro, albeit costing a lot more. But as pretty much all of you know by now, we couldn't test O3 on SimpleBench without Weights and Biases, who are the sponsors of today's video. If you want to check out their Weave platform, you can do so via the SimpleBench website, or of course, with the link in the description.
We should be doing some Discord workshops on my Patreon to get you started on Weave. Essentially, it's what we use to benchmark these models. It's kind of like being in one of those Mercedes, where you can see like a thousand options for tweaking things, improving things, and comparing things.
And as I mentioned before, they have an AI Academy with free courses. Thanks as ever to Weights and Biases for keeping SimpleBench going. Now, the system card, which because of the flight for one of the first times in video history on AI Explained, I've only read part of. Nevertheless, here are some highlights with Meta finding examples of reward hacking.
Maxing their score, in other words, not by the model itself solving the challenge, but tweaking the parameters so it seems like it solved the challenge. A bit like hacking a game to change your score to win the game, and O3 did this roughly 1% of the time. Also, do you remember that paper recently from Meta that I've been discussing with their authors, where the length of a task that a model can do is doubling every seven months?
Well, there are many caveats to that paper, but Meta do say that when they analysed O3, they said they found capabilities exceeding those of other public models and surpassing their projections from previous capability scaling trends that we saw. In other words, the time horizon of software tasks that they complete with greater than 50% reliability may be doubling in less than seven months.
Obviously, I'll have to come back to some of this stuff in other videos, but here's another highlight. That O3 and O4 Mini are on the cusp of being able to meaningfully help novices create known biological threats. That would cross OpenAI's own high-risk threshold, which would mean that they couldn't even release the model.
I keep telling people this, that because of responsible scaling policies from Anthropic and OpenAI, they are promising people that they soon won't even be able to release certain models. That will be reassuring or disappointing to you depending on your perspective. Back to dehyping though, because I'm sure tonight you're going to see plenty of O3 is AGI screaming face thumbnails.
Just to drown out that noise, check out OpenAI research engineer interview performance. Look at that incredible exponential you can see from O1 up to O4. Well, kind of not really. Also Paperbench, which is testing whether AIs can replicate AI research papers. I bet this particular chart doesn't find itself in many AGI is here, crazy thumbnails that you see, or videos for that matter, that you see tonight.
Look at O1's performance, 24%, and O3 without browsing, 18%, O4 Mini, 25%. I do get it, it's not a perfect apples to apples, but this is not exactly an exponential. Obligatory caveat, I am expecting progress over the course of this year. I'm just saying not every chart backs up the AGI hype.
I'll leave you with these two more optimistic thoughts on O3. While it may be demanding more and more compute, performance does continue to rise and rise. And that's not even to mention letting the models think for longer, which is another entire axis we can exploit. As Noam Brown of OpenAI said, there is still a lot of room to scale both of these further.
So if you ignore the headlines and drown out the hype, which is to be honest, good advice for all times in life. O3 represents genuine progress. Well done to OpenAI. If you want to check out more ways that AI is improving, check out my video from around four hours ago.
And either way, have a wonderful day.