back to index

o3-mini and the “AI War”


Chapters

0:0 Introduction
0:45 o3 mini
5:11 First impressions vs Deepseek R1
7:22 10x Scale, o3-mini System Card, Amodei Essay, bitcoin wallets…
12:40 Simple Competition Finale
13:3 Clips and Final Thoughts on the “AI War”

Whisper Transcript | Transcript Only Page

00:00:00.000 | O3 Mini is here and it's Mini in name but is it Mini in performance?
00:00:05.520 | Well, it kind of depends on whether you want coding and mathematics help or for your model
00:00:11.380 | to feel smart in a conversation.
00:00:14.060 | But is it me or does everything in AI feel more hectic now after DeepSeek R1, releases
00:00:20.720 | being brought forward according to Sam Altman, OpenAI's CEO.
00:00:24.880 | AI models smarter than all humans within 20 to 30 months according to the Anthropic CEO
00:00:32.940 | Dario Amadei.
00:00:34.040 | And no less than an AI war according to Scale.AI's CEO Alexander Wang, somewhat irresponsibly.
00:00:41.420 | The dude's in his mid-twenties, he needs to chill.
00:00:44.520 | But yes, in the first 90 minutes after its release, I've read the 37-page system card
00:00:49.980 | report on O3 Mini and the full release notes.
00:00:53.760 | And by the way, I have tested it against DeepSeek R1 to get my first impressions at least.
00:00:58.520 | So let's get started with the key highlights.
00:01:01.220 | The first thing that you should know is that if you're using ChatGPT for free, you will
00:01:05.880 | get access to O3 Mini.
00:01:07.880 | Just select reason after you type your prompt in ChatGPT.
00:01:11.200 | O3 Mini doesn't support vision, so you can't send images, but apparently it pushes the
00:01:16.420 | frontier of cost-effective reasoning.
00:01:19.000 | But with DeepSeek R1 as a competitor, I kind of want to see the evidence of that.
00:01:24.360 | Because yes, it is cheap and fairly smart, but the DeepSeek R1 reasoning model, I would
00:01:29.960 | say is smarter overall and significantly cheaper.
00:01:34.020 | For those who use the API, some input tokens are $1.10 per million for O3 Mini versus $0.14
00:01:42.640 | per million for DeepSeek R1.
00:01:44.640 | For output tokens, O3 Mini is $4.40 and for DeepSeek R1 $2.19.
00:01:50.400 | By my rough mathematics, O3 Mini would have to be roughly twice as smart at least to be
00:01:57.240 | pushing the cost-effective frontier forward.
00:01:59.800 | We'll see in a moment why I am slightly skeptical of that, albeit with one big caveat.
00:02:04.240 | Okay, but what are you actually going to do with those 150 messages you now get on the
00:02:08.640 | plus tier or the $20 tier of ChatGPT?
00:02:11.640 | Well, you may be interested in competition mathematics in which it performs really well,
00:02:17.240 | better than O1 on the high setting.
00:02:20.360 | If you're someone who found this particular chart impressive for O3 Mini, then wait for
00:02:26.080 | a stat that I think many, many people are going to skim over and miss in these release
00:02:30.800 | notes.
00:02:31.800 | I literally saw the stat and then did a double take and had to do some research around it.
00:02:35.960 | And here is that stat and it pertains to frontier math, which is a notoriously difficult benchmark.
00:02:41.440 | Co-written by Terence Tao, arguably one of the smartest men on the planet.
00:02:45.680 | At first glance, you look at O3 Mini on high setting, you go, eh, not that great.
00:02:50.120 | But then you remember that wait, this is pass at one, pass first time with your first answer.
00:02:55.320 | That 9.2% performance is comparable to O3, or at least O3 as it was when it was announced
00:03:01.600 | in December.
00:03:02.600 | I'm sure they have improved it further.
00:03:04.220 | But the crazy thing is that isn't even the stat that I mean that caused me to do a double
00:03:08.820 | take.
00:03:09.820 | In this one on frontier math, when prompted to use a Python tool, O3 Mini with high reasoning
00:03:15.160 | effort solves over 32% of problems on the first attempt.
00:03:20.480 | Now I know it's not perfect apples to apples because O3 wasn't given access to tools, but
00:03:24.600 | remember it was O3 getting 25% on this benchmark that caused everyone, including me, to really
00:03:30.480 | sit up.
00:03:31.480 | It seems actually that that 25% was a massive underestimate of what it will be able to do
00:03:36.240 | in its final form with tools.
00:03:38.320 | I'm going to make this very vivid for you in just a few seconds, but remember this stat,
00:03:42.960 | it also got 28% of the mid-level tier three challenging problems.
00:03:48.280 | On page 23 of the frontier math paper, we learn what they categorize as a low difficulty
00:03:55.840 | tier one problem.
00:03:57.240 | How many non-zero points are there with these conditions fulfilling that equation?
00:04:01.920 | And of course it is 3.8 trillion.
00:04:04.640 | If you guys didn't immediately suss that the answer would be around 3.8 trillion, then
00:04:08.760 | honestly you need to comment an apology.
00:04:11.480 | No, this is more like it.
00:04:13.160 | This is a medium difficulty problem.
00:04:15.280 | This is a tier three problem.
00:04:17.120 | This looks like it would definitely take me a few minutes.
00:04:20.680 | So O3 Mini got 28% on this level of question.
00:04:27.240 | Suffice to say it is pretty good at mathematics.
00:04:29.960 | Yes, of course, it's great at science too with comparable performance to O1 on one particularly
00:04:36.000 | hard science benchmark, the GPQA.
00:04:38.740 | Let's do some more good news before we get to the bad news.
00:04:42.200 | Encoding O3 Mini is a legit insane beating DeepSeek R1 even on medium settings.
00:04:47.800 | Oh, and beating O1 too, of course, as well.
00:04:50.640 | I was using cursor AI for about eight hours, I would say today.
00:04:54.920 | And honestly, it will be really interesting to see if O3 Mini displaces Claude 3.5 Sonnet
00:04:59.720 | as the model of choice.
00:05:00.720 | But here's the thing.
00:05:01.720 | If O3 Mini were a human and scored like this in coding and mathematics and science, you
00:05:07.600 | would think that person must be all around insanely intelligent.
00:05:11.520 | But progress in AI is somewhat unpredictable.
00:05:15.160 | So check out this basic reasoning problem.
00:05:18.120 | Peter needs CPR from his best friend Paul, the only person around.
00:05:22.840 | However, Paul's last text exchange with Peter was about a verbal attack Paul made on Peter
00:05:29.520 | as a child over his overly expensive Pokemon collection.
00:05:33.640 | And Paul stores all his texts in the cloud, permanently.
00:05:36.920 | So as children, they had disagreements over Pokemon, but remember, it's his best friend
00:05:42.080 | and he needs CPR.
00:05:43.980 | Will Paul help Peter?
00:05:46.040 | Almost every model, be it DeepSeek R1, O1 or Claude 3.5 Sonnet or many others say definitely.
00:05:52.520 | What does the prodigiously intelligent O3 Mini say?
00:05:56.560 | Well, probably not.
00:05:59.000 | His heart's not in it.
00:06:00.700 | If you thought that was a one-off, by the way, no, it gets only 1 out of these 10 SimpleBench
00:06:06.960 | public questions.
00:06:07.960 | Oh, but maybe all models fail on those kind of questions.
00:06:10.800 | Well, not really.
00:06:11.960 | DeepSeek R1 gets 4 out of the 10 public ones and 31% overall on the benchmark.
00:06:17.680 | Claude 3.5 Sonnet gets 5 out of those 10 public questions correct and 41% on the overall benchmark.
00:06:24.520 | Of course, we're going to run O3 Mini on the full benchmark the moment the API becomes
00:06:29.020 | available.
00:06:30.020 | And yes, I know some of you will want to know more about the competition that's ending
00:06:33.680 | in less than 12 hours.
00:06:35.240 | So more on that towards the end of the video.
00:06:37.160 | As the release notes go on though, it does start to feel more like a product release
00:06:41.800 | rather than a research release.
00:06:43.640 | What I mean by that is certain stats are cherry-picked and the language becomes about cost and latency.
00:06:49.680 | You can basically feel the shift within OpenAI from a fully research company to a product
00:06:55.920 | and research company.
00:06:57.240 | Take this human preference evaluation, but the bar of performance is O1 Mini.
00:07:02.440 | And that happens quite a few times.
00:07:03.840 | What about the win rate versus DeepSeek R1 or Claude 3.5 Sonnet?
00:07:08.200 | On latency or reaction speed, it's great that it's faster than O1 Mini, but is it faster
00:07:13.440 | than Gemini 2 Flash?
00:07:14.760 | We don't know.
00:07:15.760 | I do get, by the way, why OpenAI is acting more like a corporation than a research team
00:07:21.440 | these days.
00:07:22.440 | After all, their valuation, according to reports in Bloomberg yesterday, has just doubled.
00:07:27.360 | Now I want you guys to find a quote that I put out towards the end of last year.
00:07:31.720 | I can't find it, but I directly predicted in a video that their valuation would double
00:07:36.360 | from $150 billion.
00:07:38.080 | And I remember saying in 2025, but I don't think I put a date on it.
00:07:42.200 | That, of course, was the fun news for OpenAI, but the system card for O3 Mini contains some
00:07:48.980 | not so fun news.
00:07:50.320 | The TLDR is this, that OpenAI have committed to not releasing publicly or deploying a model
00:07:56.240 | that scores high on their evaluation for risk.
00:07:59.640 | Indeed, O3 Mini is the first model, for example, to reach medium risk on model autonomy.
00:08:05.120 | The model doing things for itself.
00:08:06.880 | Many people are missing this, but OpenAI are publicly warning us that soon the public won't
00:08:12.600 | get access to their latest models.
00:08:14.700 | If a model scores above high, then even OpenAI themselves say we won't even work on that
00:08:21.080 | model.
00:08:22.080 | To oversimplify, by the way, the risks are its performance in hacking, persuading people,
00:08:27.400 | advising people on how to make chemical, biological, radiological, nuclear weapons, and improving
00:08:32.120 | itself.
00:08:33.120 | This is my prediction based on past evidence from Sal Motman and OpenAI.
00:08:37.400 | They will water down these requirements.
00:08:39.920 | Can you imagine if OpenAI have a model that's a "high risk" for persuasion, say, and
00:08:46.640 | improving itself, but not the other categories, and OpenAI don't release it?
00:08:51.520 | Maybe they wouldn't if they were alone, but if DeepSeek is releasing better models
00:08:55.760 | or meta, would they really hold back on a release?
00:08:58.640 | Dario Amadei, the CEO of Anthropic, is almost openly calling for models to have the autonomy
00:09:04.680 | to self-improve.
00:09:06.040 | Amadei urges the US to prevent China from getting millions of chips because he wants
00:09:10.400 | to increase the likelihood of a unipolar world with the US ahead.
00:09:14.360 | AI companies, he argued, in the US and other democracies must have better models than those
00:09:19.680 | in China if we want to prevail.
00:09:22.220 | The amount of investment in pure capabilities progress feels frankly frenetic at the moment.
00:09:27.880 | Amadei talked about boosting capabilities of models by spending hundreds of millions
00:09:32.040 | or billions just on the reinforcement learning stage.
00:09:35.760 | And to put that in some sense of scale, here is a study done by one of my Patreon members
00:09:40.360 | on AI Insiders.
00:09:41.640 | He used to work at Nouse Research and you can see the sense of scale comparing DeepSeek
00:09:46.400 | R1, this is training cost only by the way, 5 million, and O1 around 15 million.
00:09:52.260 | Of course, Amadei revealed that they spent around, say, 30 million on Claude 3.5 SONNET,
00:09:57.980 | just the training, not the infrastructure.
00:10:00.220 | Then look at how all of that is simply dwarfed by the models that are coming soon.
00:10:05.020 | Quick side note, Anthropic, by the way, is the company that 18 months ago said, "We
00:10:09.180 | do not wish to advance the rate of AI capabilities progress."
00:10:13.140 | Unfortunately though, according to the O3 mini system card, capabilities progress is
00:10:17.560 | accelerating even in those domains we would rather not it accelerate.
00:10:22.080 | Like helping people craft successful bio-threats.
00:10:25.400 | The base model of O3 mini before safety training, as you can see, scored dramatically better
00:10:30.940 | than other models, across 4 out of the 5 indicators for helping to craft a bio-threat.
00:10:37.280 | Even biology experts found that O3 mini pre-mitigation was better than human experts advising on
00:10:43.700 | bio-risk and better even than browsing with Google.
00:10:46.940 | That used to be the argument about safety, I remember Yann LeCun talking about it, where
00:10:50.580 | he said these models aren't just better than browsing the internet.
00:10:53.560 | Well pre-safety mitigations, they are now.
00:10:56.620 | Interestingly, O3 mini is pretty bad at politics.
00:11:00.320 | It got smashed by GPT-4-O when it came to writing a tweet that would persuade people
00:11:07.100 | politically.
00:11:08.100 | It's a rather interesting and kind of cute personality that O3 mini's got, it's selectively
00:11:12.320 | good at some things and pretty terrible at others.
00:11:14.980 | For example, O3 mini is better than O1 at OpenAI's own research engineer interview
00:11:20.700 | questions.
00:11:21.700 | And significantly so too.
00:11:23.100 | Here is a new metric that I didn't see in any previous OpenAI system card.
00:11:28.140 | Can models replicate OpenAI's own employees' pull request contributions?
00:11:33.300 | Or to oversimplify, can they automate the job of an OpenAI research engineer?
00:11:38.140 | I have actually long wanted a benchmark like this because when they start crushing this
00:11:42.640 | benchmark you know, well I guess the singularity is here.
00:11:47.320 | And it turns out that O3 mini kind of flops on that front.
00:11:51.260 | I am sure of course you're going to see plenty of hype posts about O3 mini, but it
00:11:55.740 | actually gets 0% at this particular benchmark, whereas O1 gets 12%.
00:12:01.500 | O1 can actually match the code quality of 12% of the pull requests submitted by actual
00:12:07.420 | OpenAI engineers.
00:12:08.620 | OpenAI say that we suspect O3 mini's low performance is due to poor instruction following
00:12:14.340 | and confusion about specifying tools in the correct format.
00:12:17.700 | I did tell you it's got rather a unique personality.
00:12:21.180 | Overall, for example on Agency it scores pretty badly, but it's great at creating Bitcoin
00:12:27.300 | wallets.
00:12:28.300 | I think frankly O3 mini just wants to be a crypto hustler.
00:12:31.140 | It wants to pass the interview, do no work and get rich on meme coins.
00:12:35.660 | And speaking of getting rich you could say, you can still win some meta Ray-Bans by finishing
00:12:40.480 | first in the SimpleBench evals competition, sponsored by Weights & Biases.
00:12:44.980 | You actually now have slightly less than 10 hours before the competition ends.
00:12:49.500 | And I will have much more to say about this, but the current leading prompt is getting
00:12:54.140 | 18 out of 20.
00:12:56.060 | Obviously if you can get 20 out of 20 that is going to cause quite a stir.
00:13:00.140 | The links are as ever in the description.
00:13:02.700 | So we are now in the era of AI lab CEOs quoting Napoleon.
00:13:08.020 | Sam Ullman said a revolution can be neither made nor stopped.
00:13:12.300 | The only thing that can be done is for one of the several of its children, I think he's
00:13:16.300 | talking about himself, to give it a direction by dint of victories.
00:13:20.900 | Let me know what you think, but I personally hate the fact that this is now being framed
00:13:25.720 | in terms of a war and an arms race.
00:13:28.260 | Kind of reminds me of the rhetoric before the Vietnam war where it's like we've got
00:13:32.460 | to stop the domino from falling otherwise communism will take over.
00:13:36.300 | My opinion is that the arrival of true artificial intelligence is so epochal for the human species,
00:13:43.060 | so solemn almost, that it shouldn't be reduced to a sense of a human squabble.
00:13:47.460 | I just don't see that ending well.
00:13:50.060 | I guess it's kind of fortunate then that this particular billionaire CEO said he has
00:13:54.340 | no idea what the F is going on.
00:13:56.460 | The flip side of that is there's a lot of bullshit, so AI is no different as an industry.
00:14:03.500 | AI is certainly at a point right now where like, I don't know, 80 to 90 percent of what
00:14:08.500 | is out there, so to speak, is bullshit, and a lot of what people will say or what investors
00:14:15.700 | believe or what other people, if you go to a party, will tell you.
00:14:20.980 | Nobody knows what the F is actually going on, truly.
00:14:24.060 | Nobody knows what is actually going on, but there's a lot of people who would be confident
00:14:28.820 | it's wrong.
00:14:29.820 | And I rather do agree with the perspective of a slightly younger Dario Amadei in 2017.
00:14:37.460 | There's been a lot of talk about the US and China and basically technological races, even
00:14:43.140 | if only economic, between countries, governments, commercial entities in a race to develop more
00:14:50.620 | powerful AI.
00:14:51.620 | And I think, you know, the message I want to give is that it's very important that as
00:14:55.980 | those races happen, we're very mindful of the fact that that can create the perfect
00:15:00.000 | storm for safety catastrophes to happen, that if we're racing really hard to do something
00:15:05.660 | and maybe even some of those things are adversarial, that creates exactly the conditions under
00:15:10.620 | which something can happen that not only our adversary doesn't want to happen, but we don't
00:15:14.860 | want to happen either.
00:15:16.740 | As ever though, thank you so much for watching to the end and have a wonderful day.