back to index

Apple’s ‘AI Can’t Reason’ Claim Seen By 13M+, What You Need to Know


Chapters

0:0 Introduction
0:57 Viral Post + Headlines
1:42 Apple Paper Analysis
8:34 But they do Hallucinate
10:43 Not Supercomputers
11:18 o3 Pro and Recommendations

Whisper Transcript | Transcript Only Page

00:00:00.000 | Almost no one has the time to investigate headlines like this one, seen by tens of
00:00:04.820 | millions of people. The AI models don't actually reason at all. They just memorize patterns. AGI
00:00:09.860 | is mostly hype, and even the underlying Apple paper quoted says it's an illusion of thinking.
00:00:15.780 | This was picked up in mainstream outlets like The Guardian, which quoted it as being a pretty
00:00:21.400 | devastating Apple paper. So what are people supposed to believe when half the headlines
00:00:25.800 | are about an imminent AI job apocalypse, and the other half are about LLMs all being fake?
00:00:31.960 | Well, hopefully you'll find that I'm not trying to sell a narrative. I'll just say what I found,
00:00:36.520 | having read the 30-page paper in full and the surrounding analyses. I'll also end with a
00:00:42.560 | recommendation on which model you should use, and yes, touch on the brand new O3 Pro from OpenAI.
00:00:49.580 | Although I would say that the $200 price per month to access that model is not for the unwashed
00:00:55.400 | masses like you guys. Some very quick context on why a post like this one gets tens of millions of
00:01:01.080 | views and coverage in the mainstream media, and no, it's not just because of the unnecessarily
00:01:06.160 | frantic breaking at the start. It's also because people hear the claims made by the CEOs of these
00:01:12.220 | AI labs, like Sam Altman yesterday, posting,
00:01:14.760 | Humanity is close to building digital superintelligence. We're past the event horizon.
00:01:19.800 | The takeoff has started. While the definitions of those terms are deliberately vague, you can understand
00:01:25.160 | people paying attention. People can see for themselves how quickly large language models
00:01:29.020 | are improving, and they can read the headlines generated by the CEO of Anthropic saying,
00:01:34.580 | there is a white-collar bloodbath coming. It's almost every week now that we get headlines like
00:01:38.800 | this one in the New York Times, so it's no wonder people are paying attention. Now, some would say
00:01:43.080 | cynically that Apple seem to be producing more papers, quote, debunking AI than actually improving AI,
00:01:49.100 | but let's set that cynicism aside. The paper essentially claimed that large language models don't follow
00:01:54.420 | explicit algorithms and struggle with puzzles when there are sufficient degrees of complexity.
00:02:00.460 | Puzzles like the Tower of Hanoi challenge, where you've got to move a tower of discs from one place
00:02:06.440 | to another, but never place a larger disc atop a smaller one. They also tested the models on games
00:02:11.740 | like checkers, where you've got to move the blue tokens all the way to the right and the red tokens to
00:02:16.940 | the left, following the rules of checkers. And games like River Crossing, which might be more familiar to you
00:02:21.820 | as the fox and chicken challenge, where you've got to go to the other side of the river without leaving
00:02:25.900 | the fox with the chicken. All of these games, of course, can be and were scaled up in complexity the
00:02:31.400 | more pieces you introduce. If models were a pre-programmed set of algorithms like a calculator,
00:02:37.260 | then it shouldn't matter how many discs or checkers or blocks you have, performance should be 100%
00:02:42.360 | all the time. Shocker, the paper showed that they're not that and performance dropped off noticeably the
00:02:48.160 | more complex the task got. But this has been known for years now about large language models. They're
00:02:53.200 | not traditional software, where the same input always leads to the same output. Nor, of course,
00:02:58.340 | are they fully randomised either, otherwise they couldn't pass a single benchmark. They are probabilistic
00:03:03.780 | neural networks, somewhere in between the two extremes. And the perfect example comes with multiplication.
00:03:09.400 | Again, I could have added breaking to the title of this video, but this has been known about for
00:03:15.280 | several years now. If you don't give models access to any tools and ask them to perform a multiplication,
00:03:20.620 | then the moment the digits of the multiplication get too large, they start to fail dramatically.
00:03:26.880 | Not occasionally getting it right, just never getting the sum right. If the number of digits is
00:03:31.980 | small enough, the models can reason their way to the correct answer. And as you can see in the difference
00:03:36.820 | between 01 mini from OpenAI and 03 mini, performance is incrementally improving. In other words, it takes
00:03:43.480 | a bigger number of digits to flummox the latest models. But again, it must be emphasised that even
00:03:49.540 | with the very latest, the very best models you can access, if you don't give them tools, they will
00:03:54.860 | eventually reach a point where they just simply can't multiply two numbers. But this will always be the
00:03:59.680 | case because these models aren't designed to be fully predictable. They're designed to be generative.
00:04:04.420 | They're not designed to be software, they're designed to use software. They want to produce
00:04:09.180 | plausible outputs, which is why they'll hallucinate when you ask them questions they can't handle.
00:04:14.320 | Here, for example, I gave a calculation to Claude 4 Opus, the latest model from Anthropic,
00:04:18.760 | and Gemini 2.5 Pro, the latest model from Google DeepMind, but I didn't give them access to tools.
00:04:23.780 | They were never going to get this right, but rather than say, I don't know, they just hallucinated the
00:04:29.020 | answer in both cases. The funny thing was that these answers were plausible in that they ended
00:04:33.980 | in 2s and began with 6-7, which the correct answer does. These models are, after all, very convincing
00:04:39.840 | BSers. But what the paper ignored is that these models can use tools and use them very effectively.
00:04:45.400 | Here's that same Claude 4 Opus, but this time allowed to use code. It got the answer right,
00:04:50.960 | and notice I didn't even say use code or use a tool. It knew to do so. So for me, what was surprising
00:04:57.220 | is that this Apple paper found it surprising that large reasoning models, they call them,
00:05:02.840 | can't perform exact computation. We know they can't. Now, several other people before me have pointed
00:05:07.920 | out another fatal weakness with the paper, which is they describe accuracy ultimately collapsing
00:05:12.960 | towards zero beyond a certain level of complexity. Because models are constrained with how many tokens,
00:05:18.420 | or parts of a word, if you like, that they can output in one go. In the case of the Claude model
00:05:23.640 | from Anthropic that was tested, that token limit is 128,000 tokens. But some of the questions tested
00:05:29.300 | required more than that number of tokens. So even if the models were trained to be calculators,
00:05:35.280 | which they're not, they weren't given enough space to output the requisite number of tokens.
00:05:40.640 | For me then, it's to the credit of the models that they recognised their own output limits,
00:05:45.520 | and then outputted what the paper calls shorter traces, basically giving up, because they quote,
00:05:50.200 | knew they wouldn't have the space to output the required answer. Instead, the models would output
00:05:55.540 | things like, here is the algorithm you need to use, or the tool you need to use, which I think is
00:06:01.260 | reasonable. One quick detail that I think many people missed is the paper actually admits that
00:06:05.100 | it originally wanted to compare thinking versus non-thinking models. You know, those ones output
00:06:09.740 | long chains of thought versus those that don't on math benchmarks. Because the results didn't quite
00:06:14.860 | conform to the narrative they were expecting, and thinking models did indeed outperform non-thinking
00:06:20.260 | models with the same compute budget, they actually abandoned the math benchmark and then resorted to
00:06:25.300 | the puzzles. I guess what I'm saying is I slightly feel like the authors came to testing the thinking
00:06:31.000 | models with a preconceived notion about their lack of ability. Another learning moment for us all from the
00:06:36.720 | paper comes from their surprise, the Apple authors, that when they provide the algorithm in the prompt,
00:06:44.580 | the algorithm to solve these puzzles, the models still often fail. They're surprised by this and deem it
00:06:50.680 | noteworthy because they say surely finding the solution requires more computation than merely executing a given
00:06:58.640 | algorithm. But you guys have twigged this all by now. These are not calculators. They're not designed for
00:07:04.560 | executing algorithms. Because they are instead neural networks that are probabilistic, even if there is a
00:07:10.380 | 99.9% chance that they output the correct next step, when there's millions of steps involved, they will
00:07:16.440 | eventually make a mistake. Remember multiplication, where of course the language models know the quote
00:07:21.640 | algorithm to perform a multiply step. Indeed, the models are derived through matrix multiplication, but that
00:07:27.740 | does not mean that given enough steps required, they won't start making mistakes. The conclusion of the paper
00:07:33.560 | paper then teed things up for the headline writers because they say, we may be encountering fundamental
00:07:39.080 | barriers to generalizable reasoning. Now do forgive me for pointing this out, but that quote limitation to
00:07:44.960 | generalized reasoning has been pointed out by experts like Professor Rao, who I interviewed back in
00:07:51.020 | December of 2023 on my Patreon. This is not a quote breaking news type of situation. You may also find it
00:07:57.700 | interesting that one researcher used Claude for Opus and named that model as a co-author in a paper pointing
00:08:05.220 | out all the flaws of the Apple paper. Flaws even I missed, like certain of the questions being impossible
00:08:11.640 | to answer due to logical impossibility. So no, to quote an article featured in The Guardian from Gary
00:08:18.920 | Marcus, the tech world is not reeling from a paper that shows the powers of the new generation of AI have been
00:08:24.440 | wildly oversold. I would go as far as to say that there isn't a single serious AI researcher that would
00:08:29.880 | have been surprised by the results of that paper. That is not of course to say that these models don't
00:08:35.020 | make basic reasoning mistakes in simple scenarios, or at least semi-simple scenarios. I'm the author of a
00:08:41.480 | benchmark called SimpleBench designed to test models on such scenarios. For example, I tested the brand new
00:08:48.020 | O3 Pro on this scenario in which models tend not to spot that the glove would just fall on to the road. This is
00:08:55.780 | despite, by the way, thinking for 18 minutes. If you want to learn more about SimpleBench, by the way, the link is in the
00:09:01.180 | description. And I'll end this video with my recommendation of which model you should check out if you're just used to say for
00:09:07.460 | example, the free ChatGPT. The O3 Pro API from OpenAI failed, by the way, which is why we don't have a result yet for that model.
00:09:15.060 | Of course, the failure modes go far beyond the simple scenarios featured in SimpleBench. Here's one quirk many of you may not know about.
00:09:21.860 | This is the brand new VO3 from Google Gemini. And I said, "Output a London scene with absolutely zero lampposts,
00:09:29.460 | not a single lamppost in sight." Of course, if you lean in to the hallucinations of generative models,
00:09:37.060 | you can get creative outputs like this one from VO3.
00:09:46.740 | Now obviously this is a quirky advert, but think of the immense amount of money that the company "saved"
00:09:53.700 | by not using actors, sets, or props for this advert. That's why I don't want you guys to be as shocked
00:09:59.620 | as this Sky News presenter in Britain, and this has got hundreds of thousands of views, where he noticed
00:10:05.220 | ChatGPT hallucinating an answer, in this case a transcript. This generated several news segments,
00:10:11.220 | as well as this article saying, "Can we trust ChatGPT despite it hallucinating answers?"
00:10:16.180 | This all then comes back to us being able to hold two thoughts in our head at the same time,
00:10:20.740 | which is that LLMs are swiftly catching up on human performance across almost all text-based domains.
00:10:26.740 | But they have almost no hesitation in generating mistruths, you could say like many humans.
00:10:34.020 | So if human performance is your yardstick, they are catching up fast and can BS like the best of us.
00:10:40.580 | But language models like ChatGPT, Gemini, and Claude are not super computers, they're not the kind of
00:10:46.020 | AI that can, for example, predict the weather. Their real breakthroughs, as with human breakthroughs,
00:10:50.820 | come when they use tools in an environment that corrects their BS for them. That can lead to genuine
00:10:57.220 | scientific advance, and if you want to learn more about that, check out my Alpha Evolve video.
00:11:01.460 | Frankly, having made that video, I was quite surprised to hear Sam Altman say it's going to be 2026 when we
00:11:07.540 | see the arrival of systems that can figure out novel insights. As far as I'm concerned, we have that
00:11:12.580 | now. Again, not LLMs on their own, but LLMs in combination with symbolic systems.
00:11:18.660 | So, while language models can't yet solo superintelligence, which one should you use?
00:11:23.620 | Well, let me give you one cautionary word on benchmarks and a little bit of advice.
00:11:28.580 | In just the last 48 hours, we got O3 Pro from OpenAI, at the $200 tier. I'm sure though that will
00:11:35.540 | eventually filter down to the $20 tier. And of course, the benchmark results were pretty impressive.
00:11:41.300 | On competition level mathematics, 93%, on really hard PhD level science questions, 84%, and competitive
00:11:48.820 | coding, you can see the ELO ranking here. My cautionary note though comes from the results you can see below
00:11:54.740 | for the O3 model, not O3 Pro, but the O3 model that OpenAI showcased on day 12 of Christmas in December
00:12:02.660 | 2024. As you can see, today's O3 Pro mostly underperforms that system teased back in December.
00:12:09.620 | So that's the cautionary note that you often have to look beyond the headline benchmark results to see
00:12:15.140 | how these models perform on your use case. The word of advice is that when you're looking at benchmarks,
00:12:20.020 | companies will often either not compare to other model providers at all, as in the case of OpenAI these
00:12:26.180 | days, or like Anthropic with their Claude series of models, they will show you multiple benchmarks,
00:12:31.380 | but not be terribly clear about the multiple parallel attempts they took to get their record high scores,
00:12:37.220 | or about the serious usage limitations they have for their bigger model, or the massively elevated price
00:12:43.780 | for that model. Which brings me to my current recommendation if you just want to use a model for
00:12:48.580 | free, albeit with caps of course, and that would be Google's Gemini 2.5 Pro. Yes, I am slightly
00:12:54.580 | influenced by its top score on SimpleBench, and the fact you get a few uses of the Vio video generator
00:13:00.900 | model. An honorary mention goes to DeepSeq R1, which is very cheap via the API, and at least comes with a
00:13:07.780 | technical report that we can all read through. Many of you commented this month that you saw a pretty
00:13:12.820 | noticeable boost in production quality for my DeepSeq documentary, and there's more to come
00:13:18.580 | where that came from. But that boost was in no small part to my video editor choosing Storyblocks,
00:13:24.980 | the sponsors of today's video. We picked them actually before any sponsorship, partly due to the unlimited
00:13:31.620 | downloads of varied high quality media at their set subscription cost, but partly due to the clear-cut
00:13:38.420 | licensing, wherein anything we downloaded with Storyblocks was 100% royalty-free. If you want to
00:13:45.140 | get started with unlimited stock media downloads, head to storyblocks.com/aiexplained, link in the description.
00:13:52.500 | I hope that helped give you some signal amongst the noise, but either way, I hope you have a very wonderful day.