Back to Index

o3-mini and the “AI War”


Chapters

0:0 Introduction
0:45 o3 mini
5:11 First impressions vs Deepseek R1
7:22 10x Scale, o3-mini System Card, Amodei Essay, bitcoin wallets…
12:40 Simple Competition Finale
13:3 Clips and Final Thoughts on the “AI War”

Transcript

O3 Mini is here and it's Mini in name but is it Mini in performance? Well, it kind of depends on whether you want coding and mathematics help or for your model to feel smart in a conversation. But is it me or does everything in AI feel more hectic now after DeepSeek R1, releases being brought forward according to Sam Altman, OpenAI's CEO.

AI models smarter than all humans within 20 to 30 months according to the Anthropic CEO Dario Amadei. And no less than an AI war according to Scale.AI's CEO Alexander Wang, somewhat irresponsibly. The dude's in his mid-twenties, he needs to chill. But yes, in the first 90 minutes after its release, I've read the 37-page system card report on O3 Mini and the full release notes.

And by the way, I have tested it against DeepSeek R1 to get my first impressions at least. So let's get started with the key highlights. The first thing that you should know is that if you're using ChatGPT for free, you will get access to O3 Mini. Just select reason after you type your prompt in ChatGPT.

O3 Mini doesn't support vision, so you can't send images, but apparently it pushes the frontier of cost-effective reasoning. But with DeepSeek R1 as a competitor, I kind of want to see the evidence of that. Because yes, it is cheap and fairly smart, but the DeepSeek R1 reasoning model, I would say is smarter overall and significantly cheaper.

For those who use the API, some input tokens are $1.10 per million for O3 Mini versus $0.14 per million for DeepSeek R1. For output tokens, O3 Mini is $4.40 and for DeepSeek R1 $2.19. By my rough mathematics, O3 Mini would have to be roughly twice as smart at least to be pushing the cost-effective frontier forward.

We'll see in a moment why I am slightly skeptical of that, albeit with one big caveat. Okay, but what are you actually going to do with those 150 messages you now get on the plus tier or the $20 tier of ChatGPT? Well, you may be interested in competition mathematics in which it performs really well, better than O1 on the high setting.

If you're someone who found this particular chart impressive for O3 Mini, then wait for a stat that I think many, many people are going to skim over and miss in these release notes. I literally saw the stat and then did a double take and had to do some research around it.

And here is that stat and it pertains to frontier math, which is a notoriously difficult benchmark. Co-written by Terence Tao, arguably one of the smartest men on the planet. At first glance, you look at O3 Mini on high setting, you go, eh, not that great. But then you remember that wait, this is pass at one, pass first time with your first answer.

That 9.2% performance is comparable to O3, or at least O3 as it was when it was announced in December. I'm sure they have improved it further. But the crazy thing is that isn't even the stat that I mean that caused me to do a double take. In this one on frontier math, when prompted to use a Python tool, O3 Mini with high reasoning effort solves over 32% of problems on the first attempt.

Now I know it's not perfect apples to apples because O3 wasn't given access to tools, but remember it was O3 getting 25% on this benchmark that caused everyone, including me, to really sit up. It seems actually that that 25% was a massive underestimate of what it will be able to do in its final form with tools.

I'm going to make this very vivid for you in just a few seconds, but remember this stat, it also got 28% of the mid-level tier three challenging problems. On page 23 of the frontier math paper, we learn what they categorize as a low difficulty tier one problem. How many non-zero points are there with these conditions fulfilling that equation?

And of course it is 3.8 trillion. If you guys didn't immediately suss that the answer would be around 3.8 trillion, then honestly you need to comment an apology. No, this is more like it. This is a medium difficulty problem. This is a tier three problem. This looks like it would definitely take me a few minutes.

So O3 Mini got 28% on this level of question. Suffice to say it is pretty good at mathematics. Yes, of course, it's great at science too with comparable performance to O1 on one particularly hard science benchmark, the GPQA. Let's do some more good news before we get to the bad news.

Encoding O3 Mini is a legit insane beating DeepSeek R1 even on medium settings. Oh, and beating O1 too, of course, as well. I was using cursor AI for about eight hours, I would say today. And honestly, it will be really interesting to see if O3 Mini displaces Claude 3.5 Sonnet as the model of choice.

But here's the thing. If O3 Mini were a human and scored like this in coding and mathematics and science, you would think that person must be all around insanely intelligent. But progress in AI is somewhat unpredictable. So check out this basic reasoning problem. Peter needs CPR from his best friend Paul, the only person around.

However, Paul's last text exchange with Peter was about a verbal attack Paul made on Peter as a child over his overly expensive Pokemon collection. And Paul stores all his texts in the cloud, permanently. So as children, they had disagreements over Pokemon, but remember, it's his best friend and he needs CPR.

Will Paul help Peter? Almost every model, be it DeepSeek R1, O1 or Claude 3.5 Sonnet or many others say definitely. What does the prodigiously intelligent O3 Mini say? Well, probably not. His heart's not in it. If you thought that was a one-off, by the way, no, it gets only 1 out of these 10 SimpleBench public questions.

Oh, but maybe all models fail on those kind of questions. Well, not really. DeepSeek R1 gets 4 out of the 10 public ones and 31% overall on the benchmark. Claude 3.5 Sonnet gets 5 out of those 10 public questions correct and 41% on the overall benchmark. Of course, we're going to run O3 Mini on the full benchmark the moment the API becomes available.

And yes, I know some of you will want to know more about the competition that's ending in less than 12 hours. So more on that towards the end of the video. As the release notes go on though, it does start to feel more like a product release rather than a research release.

What I mean by that is certain stats are cherry-picked and the language becomes about cost and latency. You can basically feel the shift within OpenAI from a fully research company to a product and research company. Take this human preference evaluation, but the bar of performance is O1 Mini. And that happens quite a few times.

What about the win rate versus DeepSeek R1 or Claude 3.5 Sonnet? On latency or reaction speed, it's great that it's faster than O1 Mini, but is it faster than Gemini 2 Flash? We don't know. I do get, by the way, why OpenAI is acting more like a corporation than a research team these days.

After all, their valuation, according to reports in Bloomberg yesterday, has just doubled. Now I want you guys to find a quote that I put out towards the end of last year. I can't find it, but I directly predicted in a video that their valuation would double from $150 billion.

And I remember saying in 2025, but I don't think I put a date on it. That, of course, was the fun news for OpenAI, but the system card for O3 Mini contains some not so fun news. The TLDR is this, that OpenAI have committed to not releasing publicly or deploying a model that scores high on their evaluation for risk.

Indeed, O3 Mini is the first model, for example, to reach medium risk on model autonomy. The model doing things for itself. Many people are missing this, but OpenAI are publicly warning us that soon the public won't get access to their latest models. If a model scores above high, then even OpenAI themselves say we won't even work on that model.

To oversimplify, by the way, the risks are its performance in hacking, persuading people, advising people on how to make chemical, biological, radiological, nuclear weapons, and improving itself. This is my prediction based on past evidence from Sal Motman and OpenAI. They will water down these requirements. Can you imagine if OpenAI have a model that's a "high risk" for persuasion, say, and improving itself, but not the other categories, and OpenAI don't release it?

Maybe they wouldn't if they were alone, but if DeepSeek is releasing better models or meta, would they really hold back on a release? Dario Amadei, the CEO of Anthropic, is almost openly calling for models to have the autonomy to self-improve. Amadei urges the US to prevent China from getting millions of chips because he wants to increase the likelihood of a unipolar world with the US ahead.

AI companies, he argued, in the US and other democracies must have better models than those in China if we want to prevail. The amount of investment in pure capabilities progress feels frankly frenetic at the moment. Amadei talked about boosting capabilities of models by spending hundreds of millions or billions just on the reinforcement learning stage.

And to put that in some sense of scale, here is a study done by one of my Patreon members on AI Insiders. He used to work at Nouse Research and you can see the sense of scale comparing DeepSeek R1, this is training cost only by the way, 5 million, and O1 around 15 million.

Of course, Amadei revealed that they spent around, say, 30 million on Claude 3.5 SONNET, just the training, not the infrastructure. Then look at how all of that is simply dwarfed by the models that are coming soon. Quick side note, Anthropic, by the way, is the company that 18 months ago said, "We do not wish to advance the rate of AI capabilities progress." Unfortunately though, according to the O3 mini system card, capabilities progress is accelerating even in those domains we would rather not it accelerate.

Like helping people craft successful bio-threats. The base model of O3 mini before safety training, as you can see, scored dramatically better than other models, across 4 out of the 5 indicators for helping to craft a bio-threat. Even biology experts found that O3 mini pre-mitigation was better than human experts advising on bio-risk and better even than browsing with Google.

That used to be the argument about safety, I remember Yann LeCun talking about it, where he said these models aren't just better than browsing the internet. Well pre-safety mitigations, they are now. Interestingly, O3 mini is pretty bad at politics. It got smashed by GPT-4-O when it came to writing a tweet that would persuade people politically.

It's a rather interesting and kind of cute personality that O3 mini's got, it's selectively good at some things and pretty terrible at others. For example, O3 mini is better than O1 at OpenAI's own research engineer interview questions. And significantly so too. Here is a new metric that I didn't see in any previous OpenAI system card.

Can models replicate OpenAI's own employees' pull request contributions? Or to oversimplify, can they automate the job of an OpenAI research engineer? I have actually long wanted a benchmark like this because when they start crushing this benchmark you know, well I guess the singularity is here. And it turns out that O3 mini kind of flops on that front.

I am sure of course you're going to see plenty of hype posts about O3 mini, but it actually gets 0% at this particular benchmark, whereas O1 gets 12%. O1 can actually match the code quality of 12% of the pull requests submitted by actual OpenAI engineers. OpenAI say that we suspect O3 mini's low performance is due to poor instruction following and confusion about specifying tools in the correct format.

I did tell you it's got rather a unique personality. Overall, for example on Agency it scores pretty badly, but it's great at creating Bitcoin wallets. I think frankly O3 mini just wants to be a crypto hustler. It wants to pass the interview, do no work and get rich on meme coins.

And speaking of getting rich you could say, you can still win some meta Ray-Bans by finishing first in the SimpleBench evals competition, sponsored by Weights & Biases. You actually now have slightly less than 10 hours before the competition ends. And I will have much more to say about this, but the current leading prompt is getting 18 out of 20.

Obviously if you can get 20 out of 20 that is going to cause quite a stir. The links are as ever in the description. So we are now in the era of AI lab CEOs quoting Napoleon. Sam Ullman said a revolution can be neither made nor stopped. The only thing that can be done is for one of the several of its children, I think he's talking about himself, to give it a direction by dint of victories.

Let me know what you think, but I personally hate the fact that this is now being framed in terms of a war and an arms race. Kind of reminds me of the rhetoric before the Vietnam war where it's like we've got to stop the domino from falling otherwise communism will take over.

My opinion is that the arrival of true artificial intelligence is so epochal for the human species, so solemn almost, that it shouldn't be reduced to a sense of a human squabble. I just don't see that ending well. I guess it's kind of fortunate then that this particular billionaire CEO said he has no idea what the F is going on.

The flip side of that is there's a lot of bullshit, so AI is no different as an industry. AI is certainly at a point right now where like, I don't know, 80 to 90 percent of what is out there, so to speak, is bullshit, and a lot of what people will say or what investors believe or what other people, if you go to a party, will tell you.

Nobody knows what the F is actually going on, truly. Nobody knows what is actually going on, but there's a lot of people who would be confident it's wrong. And I rather do agree with the perspective of a slightly younger Dario Amadei in 2017. There's been a lot of talk about the US and China and basically technological races, even if only economic, between countries, governments, commercial entities in a race to develop more powerful AI.

And I think, you know, the message I want to give is that it's very important that as those races happen, we're very mindful of the fact that that can create the perfect storm for safety catastrophes to happen, that if we're racing really hard to do something and maybe even some of those things are adversarial, that creates exactly the conditions under which something can happen that not only our adversary doesn't want to happen, but we don't want to happen either.

As ever though, thank you so much for watching to the end and have a wonderful day.