back to indexApple’s ‘AI Can’t Reason’ Claim Seen By 13M+, What You Need to Know

Chapters
0:0 Introduction
0:57 Viral Post + Headlines
1:42 Apple Paper Analysis
8:34 But they do Hallucinate
10:43 Not Supercomputers
11:18 o3 Pro and Recommendations
00:00:00.000 |
Almost no one has the time to investigate headlines like this one, seen by tens of 00:00:04.820 |
millions of people. The AI models don't actually reason at all. They just memorize patterns. AGI 00:00:09.860 |
is mostly hype, and even the underlying Apple paper quoted says it's an illusion of thinking. 00:00:15.780 |
This was picked up in mainstream outlets like The Guardian, which quoted it as being a pretty 00:00:21.400 |
devastating Apple paper. So what are people supposed to believe when half the headlines 00:00:25.800 |
are about an imminent AI job apocalypse, and the other half are about LLMs all being fake? 00:00:31.960 |
Well, hopefully you'll find that I'm not trying to sell a narrative. I'll just say what I found, 00:00:36.520 |
having read the 30-page paper in full and the surrounding analyses. I'll also end with a 00:00:42.560 |
recommendation on which model you should use, and yes, touch on the brand new O3 Pro from OpenAI. 00:00:49.580 |
Although I would say that the $200 price per month to access that model is not for the unwashed 00:00:55.400 |
masses like you guys. Some very quick context on why a post like this one gets tens of millions of 00:01:01.080 |
views and coverage in the mainstream media, and no, it's not just because of the unnecessarily 00:01:06.160 |
frantic breaking at the start. It's also because people hear the claims made by the CEOs of these 00:01:14.760 |
Humanity is close to building digital superintelligence. We're past the event horizon. 00:01:19.800 |
The takeoff has started. While the definitions of those terms are deliberately vague, you can understand 00:01:25.160 |
people paying attention. People can see for themselves how quickly large language models 00:01:29.020 |
are improving, and they can read the headlines generated by the CEO of Anthropic saying, 00:01:34.580 |
there is a white-collar bloodbath coming. It's almost every week now that we get headlines like 00:01:38.800 |
this one in the New York Times, so it's no wonder people are paying attention. Now, some would say 00:01:43.080 |
cynically that Apple seem to be producing more papers, quote, debunking AI than actually improving AI, 00:01:49.100 |
but let's set that cynicism aside. The paper essentially claimed that large language models don't follow 00:01:54.420 |
explicit algorithms and struggle with puzzles when there are sufficient degrees of complexity. 00:02:00.460 |
Puzzles like the Tower of Hanoi challenge, where you've got to move a tower of discs from one place 00:02:06.440 |
to another, but never place a larger disc atop a smaller one. They also tested the models on games 00:02:11.740 |
like checkers, where you've got to move the blue tokens all the way to the right and the red tokens to 00:02:16.940 |
the left, following the rules of checkers. And games like River Crossing, which might be more familiar to you 00:02:21.820 |
as the fox and chicken challenge, where you've got to go to the other side of the river without leaving 00:02:25.900 |
the fox with the chicken. All of these games, of course, can be and were scaled up in complexity the 00:02:31.400 |
more pieces you introduce. If models were a pre-programmed set of algorithms like a calculator, 00:02:37.260 |
then it shouldn't matter how many discs or checkers or blocks you have, performance should be 100% 00:02:42.360 |
all the time. Shocker, the paper showed that they're not that and performance dropped off noticeably the 00:02:48.160 |
more complex the task got. But this has been known for years now about large language models. They're 00:02:53.200 |
not traditional software, where the same input always leads to the same output. Nor, of course, 00:02:58.340 |
are they fully randomised either, otherwise they couldn't pass a single benchmark. They are probabilistic 00:03:03.780 |
neural networks, somewhere in between the two extremes. And the perfect example comes with multiplication. 00:03:09.400 |
Again, I could have added breaking to the title of this video, but this has been known about for 00:03:15.280 |
several years now. If you don't give models access to any tools and ask them to perform a multiplication, 00:03:20.620 |
then the moment the digits of the multiplication get too large, they start to fail dramatically. 00:03:26.880 |
Not occasionally getting it right, just never getting the sum right. If the number of digits is 00:03:31.980 |
small enough, the models can reason their way to the correct answer. And as you can see in the difference 00:03:36.820 |
between 01 mini from OpenAI and 03 mini, performance is incrementally improving. In other words, it takes 00:03:43.480 |
a bigger number of digits to flummox the latest models. But again, it must be emphasised that even 00:03:49.540 |
with the very latest, the very best models you can access, if you don't give them tools, they will 00:03:54.860 |
eventually reach a point where they just simply can't multiply two numbers. But this will always be the 00:03:59.680 |
case because these models aren't designed to be fully predictable. They're designed to be generative. 00:04:04.420 |
They're not designed to be software, they're designed to use software. They want to produce 00:04:09.180 |
plausible outputs, which is why they'll hallucinate when you ask them questions they can't handle. 00:04:14.320 |
Here, for example, I gave a calculation to Claude 4 Opus, the latest model from Anthropic, 00:04:18.760 |
and Gemini 2.5 Pro, the latest model from Google DeepMind, but I didn't give them access to tools. 00:04:23.780 |
They were never going to get this right, but rather than say, I don't know, they just hallucinated the 00:04:29.020 |
answer in both cases. The funny thing was that these answers were plausible in that they ended 00:04:33.980 |
in 2s and began with 6-7, which the correct answer does. These models are, after all, very convincing 00:04:39.840 |
BSers. But what the paper ignored is that these models can use tools and use them very effectively. 00:04:45.400 |
Here's that same Claude 4 Opus, but this time allowed to use code. It got the answer right, 00:04:50.960 |
and notice I didn't even say use code or use a tool. It knew to do so. So for me, what was surprising 00:04:57.220 |
is that this Apple paper found it surprising that large reasoning models, they call them, 00:05:02.840 |
can't perform exact computation. We know they can't. Now, several other people before me have pointed 00:05:07.920 |
out another fatal weakness with the paper, which is they describe accuracy ultimately collapsing 00:05:12.960 |
towards zero beyond a certain level of complexity. Because models are constrained with how many tokens, 00:05:18.420 |
or parts of a word, if you like, that they can output in one go. In the case of the Claude model 00:05:23.640 |
from Anthropic that was tested, that token limit is 128,000 tokens. But some of the questions tested 00:05:29.300 |
required more than that number of tokens. So even if the models were trained to be calculators, 00:05:35.280 |
which they're not, they weren't given enough space to output the requisite number of tokens. 00:05:40.640 |
For me then, it's to the credit of the models that they recognised their own output limits, 00:05:45.520 |
and then outputted what the paper calls shorter traces, basically giving up, because they quote, 00:05:50.200 |
knew they wouldn't have the space to output the required answer. Instead, the models would output 00:05:55.540 |
things like, here is the algorithm you need to use, or the tool you need to use, which I think is 00:06:01.260 |
reasonable. One quick detail that I think many people missed is the paper actually admits that 00:06:05.100 |
it originally wanted to compare thinking versus non-thinking models. You know, those ones output 00:06:09.740 |
long chains of thought versus those that don't on math benchmarks. Because the results didn't quite 00:06:14.860 |
conform to the narrative they were expecting, and thinking models did indeed outperform non-thinking 00:06:20.260 |
models with the same compute budget, they actually abandoned the math benchmark and then resorted to 00:06:25.300 |
the puzzles. I guess what I'm saying is I slightly feel like the authors came to testing the thinking 00:06:31.000 |
models with a preconceived notion about their lack of ability. Another learning moment for us all from the 00:06:36.720 |
paper comes from their surprise, the Apple authors, that when they provide the algorithm in the prompt, 00:06:44.580 |
the algorithm to solve these puzzles, the models still often fail. They're surprised by this and deem it 00:06:50.680 |
noteworthy because they say surely finding the solution requires more computation than merely executing a given 00:06:58.640 |
algorithm. But you guys have twigged this all by now. These are not calculators. They're not designed for 00:07:04.560 |
executing algorithms. Because they are instead neural networks that are probabilistic, even if there is a 00:07:10.380 |
99.9% chance that they output the correct next step, when there's millions of steps involved, they will 00:07:16.440 |
eventually make a mistake. Remember multiplication, where of course the language models know the quote 00:07:21.640 |
algorithm to perform a multiply step. Indeed, the models are derived through matrix multiplication, but that 00:07:27.740 |
does not mean that given enough steps required, they won't start making mistakes. The conclusion of the paper 00:07:33.560 |
paper then teed things up for the headline writers because they say, we may be encountering fundamental 00:07:39.080 |
barriers to generalizable reasoning. Now do forgive me for pointing this out, but that quote limitation to 00:07:44.960 |
generalized reasoning has been pointed out by experts like Professor Rao, who I interviewed back in 00:07:51.020 |
December of 2023 on my Patreon. This is not a quote breaking news type of situation. You may also find it 00:07:57.700 |
interesting that one researcher used Claude for Opus and named that model as a co-author in a paper pointing 00:08:05.220 |
out all the flaws of the Apple paper. Flaws even I missed, like certain of the questions being impossible 00:08:11.640 |
to answer due to logical impossibility. So no, to quote an article featured in The Guardian from Gary 00:08:18.920 |
Marcus, the tech world is not reeling from a paper that shows the powers of the new generation of AI have been 00:08:24.440 |
wildly oversold. I would go as far as to say that there isn't a single serious AI researcher that would 00:08:29.880 |
have been surprised by the results of that paper. That is not of course to say that these models don't 00:08:35.020 |
make basic reasoning mistakes in simple scenarios, or at least semi-simple scenarios. I'm the author of a 00:08:41.480 |
benchmark called SimpleBench designed to test models on such scenarios. For example, I tested the brand new 00:08:48.020 |
O3 Pro on this scenario in which models tend not to spot that the glove would just fall on to the road. This is 00:08:55.780 |
despite, by the way, thinking for 18 minutes. If you want to learn more about SimpleBench, by the way, the link is in the 00:09:01.180 |
description. And I'll end this video with my recommendation of which model you should check out if you're just used to say for 00:09:07.460 |
example, the free ChatGPT. The O3 Pro API from OpenAI failed, by the way, which is why we don't have a result yet for that model. 00:09:15.060 |
Of course, the failure modes go far beyond the simple scenarios featured in SimpleBench. Here's one quirk many of you may not know about. 00:09:21.860 |
This is the brand new VO3 from Google Gemini. And I said, "Output a London scene with absolutely zero lampposts, 00:09:29.460 |
not a single lamppost in sight." Of course, if you lean in to the hallucinations of generative models, 00:09:37.060 |
you can get creative outputs like this one from VO3. 00:09:46.740 |
Now obviously this is a quirky advert, but think of the immense amount of money that the company "saved" 00:09:53.700 |
by not using actors, sets, or props for this advert. That's why I don't want you guys to be as shocked 00:09:59.620 |
as this Sky News presenter in Britain, and this has got hundreds of thousands of views, where he noticed 00:10:05.220 |
ChatGPT hallucinating an answer, in this case a transcript. This generated several news segments, 00:10:11.220 |
as well as this article saying, "Can we trust ChatGPT despite it hallucinating answers?" 00:10:16.180 |
This all then comes back to us being able to hold two thoughts in our head at the same time, 00:10:20.740 |
which is that LLMs are swiftly catching up on human performance across almost all text-based domains. 00:10:26.740 |
But they have almost no hesitation in generating mistruths, you could say like many humans. 00:10:34.020 |
So if human performance is your yardstick, they are catching up fast and can BS like the best of us. 00:10:40.580 |
But language models like ChatGPT, Gemini, and Claude are not super computers, they're not the kind of 00:10:46.020 |
AI that can, for example, predict the weather. Their real breakthroughs, as with human breakthroughs, 00:10:50.820 |
come when they use tools in an environment that corrects their BS for them. That can lead to genuine 00:10:57.220 |
scientific advance, and if you want to learn more about that, check out my Alpha Evolve video. 00:11:01.460 |
Frankly, having made that video, I was quite surprised to hear Sam Altman say it's going to be 2026 when we 00:11:07.540 |
see the arrival of systems that can figure out novel insights. As far as I'm concerned, we have that 00:11:12.580 |
now. Again, not LLMs on their own, but LLMs in combination with symbolic systems. 00:11:18.660 |
So, while language models can't yet solo superintelligence, which one should you use? 00:11:23.620 |
Well, let me give you one cautionary word on benchmarks and a little bit of advice. 00:11:28.580 |
In just the last 48 hours, we got O3 Pro from OpenAI, at the $200 tier. I'm sure though that will 00:11:35.540 |
eventually filter down to the $20 tier. And of course, the benchmark results were pretty impressive. 00:11:41.300 |
On competition level mathematics, 93%, on really hard PhD level science questions, 84%, and competitive 00:11:48.820 |
coding, you can see the ELO ranking here. My cautionary note though comes from the results you can see below 00:11:54.740 |
for the O3 model, not O3 Pro, but the O3 model that OpenAI showcased on day 12 of Christmas in December 00:12:02.660 |
2024. As you can see, today's O3 Pro mostly underperforms that system teased back in December. 00:12:09.620 |
So that's the cautionary note that you often have to look beyond the headline benchmark results to see 00:12:15.140 |
how these models perform on your use case. The word of advice is that when you're looking at benchmarks, 00:12:20.020 |
companies will often either not compare to other model providers at all, as in the case of OpenAI these 00:12:26.180 |
days, or like Anthropic with their Claude series of models, they will show you multiple benchmarks, 00:12:31.380 |
but not be terribly clear about the multiple parallel attempts they took to get their record high scores, 00:12:37.220 |
or about the serious usage limitations they have for their bigger model, or the massively elevated price 00:12:43.780 |
for that model. Which brings me to my current recommendation if you just want to use a model for 00:12:48.580 |
free, albeit with caps of course, and that would be Google's Gemini 2.5 Pro. Yes, I am slightly 00:12:54.580 |
influenced by its top score on SimpleBench, and the fact you get a few uses of the Vio video generator 00:13:00.900 |
model. An honorary mention goes to DeepSeq R1, which is very cheap via the API, and at least comes with a 00:13:07.780 |
technical report that we can all read through. Many of you commented this month that you saw a pretty 00:13:12.820 |
noticeable boost in production quality for my DeepSeq documentary, and there's more to come 00:13:18.580 |
where that came from. But that boost was in no small part to my video editor choosing Storyblocks, 00:13:24.980 |
the sponsors of today's video. We picked them actually before any sponsorship, partly due to the unlimited 00:13:31.620 |
downloads of varied high quality media at their set subscription cost, but partly due to the clear-cut 00:13:38.420 |
licensing, wherein anything we downloaded with Storyblocks was 100% royalty-free. If you want to 00:13:45.140 |
get started with unlimited stock media downloads, head to storyblocks.com/aiexplained, link in the description. 00:13:52.500 |
I hope that helped give you some signal amongst the noise, but either way, I hope you have a very wonderful day.