Apple’s ‘AI Can’t Reason’ Claim Seen By 13M+, What You Need to Know

00:00:00.000 | Almost no one has the time to investigate headlines like this one, seen by tens of

00:00:04.820 | millions of people. The AI models don't actually reason at all. They just memorize patterns. AGI

00:00:09.860 | is mostly hype, and even the underlying Apple paper quoted says it's an illusion of thinking.

00:00:15.780 | This was picked up in mainstream outlets like The Guardian, which quoted it as being a pretty

00:00:21.400 | devastating Apple paper. So what are people supposed to believe when half the headlines

00:00:25.800 | are about an imminent AI job apocalypse, and the other half are about LLMs all being fake?

00:00:31.960 | Well, hopefully you'll find that I'm not trying to sell a narrative. I'll just say what I found,

00:00:36.520 | having read the 30-page paper in full and the surrounding analyses. I'll also end with a

00:00:42.560 | recommendation on which model you should use, and yes, touch on the brand new O3 Pro from OpenAI.

00:00:49.580 | Although I would say that the $200 price per month to access that model is not for the unwashed

00:00:55.400 | masses like you guys. Some very quick context on why a post like this one gets tens of millions of

00:01:01.080 | views and coverage in the mainstream media, and no, it's not just because of the unnecessarily

00:01:06.160 | frantic breaking at the start. It's also because people hear the claims made by the CEOs of these

00:01:12.220 | AI labs, like Sam Altman yesterday, posting,

00:01:14.760 | Humanity is close to building digital superintelligence. We're past the event horizon.

00:01:19.800 | The takeoff has started. While the definitions of those terms are deliberately vague, you can understand

00:01:25.160 | people paying attention. People can see for themselves how quickly large language models

00:01:29.020 | are improving, and they can read the headlines generated by the CEO of Anthropic saying,

00:01:34.580 | there is a white-collar bloodbath coming. It's almost every week now that we get headlines like

00:01:38.800 | this one in the New York Times, so it's no wonder people are paying attention. Now, some would say

00:01:43.080 | cynically that Apple seem to be producing more papers, quote, debunking AI than actually improving AI,

00:01:49.100 | but let's set that cynicism aside. The paper essentially claimed that large language models don't follow

00:01:54.420 | explicit algorithms and struggle with puzzles when there are sufficient degrees of complexity.

00:02:00.460 | Puzzles like the Tower of Hanoi challenge, where you've got to move a tower of discs from one place

00:02:06.440 | to another, but never place a larger disc atop a smaller one. They also tested the models on games

00:02:11.740 | like checkers, where you've got to move the blue tokens all the way to the right and the red tokens to

00:02:16.940 | the left, following the rules of checkers. And games like River Crossing, which might be more familiar to you

00:02:21.820 | as the fox and chicken challenge, where you've got to go to the other side of the river without leaving

00:02:25.900 | the fox with the chicken. All of these games, of course, can be and were scaled up in complexity the

00:02:31.400 | more pieces you introduce. If models were a pre-programmed set of algorithms like a calculator,

00:02:37.260 | then it shouldn't matter how many discs or checkers or blocks you have, performance should be 100%

00:02:42.360 | all the time. Shocker, the paper showed that they're not that and performance dropped off noticeably the

00:02:48.160 | more complex the task got. But this has been known for years now about large language models. They're

00:02:53.200 | not traditional software, where the same input always leads to the same output. Nor, of course,

00:02:58.340 | are they fully randomised either, otherwise they couldn't pass a single benchmark. They are probabilistic

00:03:03.780 | neural networks, somewhere in between the two extremes. And the perfect example comes with multiplication.

00:03:09.400 | Again, I could have added breaking to the title of this video, but this has been known about for

00:03:15.280 | several years now. If you don't give models access to any tools and ask them to perform a multiplication,

00:03:20.620 | then the moment the digits of the multiplication get too large, they start to fail dramatically.

00:03:26.880 | Not occasionally getting it right, just never getting the sum right. If the number of digits is

00:03:31.980 | small enough, the models can reason their way to the correct answer. And as you can see in the difference

00:03:36.820 | between 01 mini from OpenAI and 03 mini, performance is incrementally improving. In other words, it takes

00:03:43.480 | a bigger number of digits to flummox the latest models. But again, it must be emphasised that even

00:03:49.540 | with the very latest, the very best models you can access, if you don't give them tools, they will

00:03:54.860 | eventually reach a point where they just simply can't multiply two numbers. But this will always be the

00:03:59.680 | case because these models aren't designed to be fully predictable. They're designed to be generative.

00:04:04.420 | They're not designed to be software, they're designed to use software. They want to produce

00:04:09.180 | plausible outputs, which is why they'll hallucinate when you ask them questions they can't handle.

00:04:14.320 | Here, for example, I gave a calculation to Claude 4 Opus, the latest model from Anthropic,

00:04:18.760 | and Gemini 2.5 Pro, the latest model from Google DeepMind, but I didn't give them access to tools.

00:04:23.780 | They were never going to get this right, but rather than say, I don't know, they just hallucinated the

00:04:29.020 | answer in both cases. The funny thing was that these answers were plausible in that they ended

00:04:33.980 | in 2s and began with 6-7, which the correct answer does. These models are, after all, very convincing

00:04:39.840 | BSers. But what the paper ignored is that these models can use tools and use them very effectively.

00:04:45.400 | Here's that same Claude 4 Opus, but this time allowed to use code. It got the answer right,

00:04:50.960 | and notice I didn't even say use code or use a tool. It knew to do so. So for me, what was surprising

00:04:57.220 | is that this Apple paper found it surprising that large reasoning models, they call them,

00:05:02.840 | can't perform exact computation. We know they can't. Now, several other people before me have pointed

00:05:07.920 | out another fatal weakness with the paper, which is they describe accuracy ultimately collapsing

00:05:12.960 | towards zero beyond a certain level of complexity. Because models are constrained with how many tokens,

00:05:18.420 | or parts of a word, if you like, that they can output in one go. In the case of the Claude model

00:05:23.640 | from Anthropic that was tested, that token limit is 128,000 tokens. But some of the questions tested

00:05:29.300 | required more than that number of tokens. So even if the models were trained to be calculators,

00:05:35.280 | which they're not, they weren't given enough space to output the requisite number of tokens.

00:05:40.640 | For me then, it's to the credit of the models that they recognised their own output limits,

00:05:45.520 | and then outputted what the paper calls shorter traces, basically giving up, because they quote,

00:05:50.200 | knew they wouldn't have the space to output the required answer. Instead, the models would output

00:05:55.540 | things like, here is the algorithm you need to use, or the tool you need to use, which I think is

00:06:01.260 | reasonable. One quick detail that I think many people missed is the paper actually admits that

00:06:05.100 | it originally wanted to compare thinking versus non-thinking models. You know, those ones output

00:06:09.740 | long chains of thought versus those that don't on math benchmarks. Because the results didn't quite

00:06:14.860 | conform to the narrative they were expecting, and thinking models did indeed outperform non-thinking

00:06:20.260 | models with the same compute budget, they actually abandoned the math benchmark and then resorted to

00:06:25.300 | the puzzles. I guess what I'm saying is I slightly feel like the authors came to testing the thinking

00:06:31.000 | models with a preconceived notion about their lack of ability. Another learning moment for us all from the

00:06:36.720 | paper comes from their surprise, the Apple authors, that when they provide the algorithm in the prompt,

00:06:44.580 | the algorithm to solve these puzzles, the models still often fail. They're surprised by this and deem it

00:06:50.680 | noteworthy because they say surely finding the solution requires more computation than merely executing a given

00:06:58.640 | algorithm. But you guys have twigged this all by now. These are not calculators. They're not designed for

00:07:04.560 | executing algorithms. Because they are instead neural networks that are probabilistic, even if there is a

00:07:10.380 | 99.9% chance that they output the correct next step, when there's millions of steps involved, they will

00:07:16.440 | eventually make a mistake. Remember multiplication, where of course the language models know the quote

00:07:21.640 | algorithm to perform a multiply step. Indeed, the models are derived through matrix multiplication, but that

00:07:27.740 | does not mean that given enough steps required, they won't start making mistakes. The conclusion of the paper

00:07:33.560 | paper then teed things up for the headline writers because they say, we may be encountering fundamental

00:07:39.080 | barriers to generalizable reasoning. Now do forgive me for pointing this out, but that quote limitation to

00:07:44.960 | generalized reasoning has been pointed out by experts like Professor Rao, who I interviewed back in

00:07:51.020 | December of 2023 on my Patreon. This is not a quote breaking news type of situation. You may also find it

00:07:57.700 | interesting that one researcher used Claude for Opus and named that model as a co-author in a paper pointing

00:08:05.220 | out all the flaws of the Apple paper. Flaws even I missed, like certain of the questions being impossible

00:08:11.640 | to answer due to logical impossibility. So no, to quote an article featured in The Guardian from Gary

00:08:18.920 | Marcus, the tech world is not reeling from a paper that shows the powers of the new generation of AI have been

00:08:24.440 | wildly oversold. I would go as far as to say that there isn't a single serious AI researcher that would

00:08:29.880 | have been surprised by the results of that paper. That is not of course to say that these models don't

00:08:35.020 | make basic reasoning mistakes in simple scenarios, or at least semi-simple scenarios. I'm the author of a

00:08:41.480 | benchmark called SimpleBench designed to test models on such scenarios. For example, I tested the brand new

00:08:48.020 | O3 Pro on this scenario in which models tend not to spot that the glove would just fall on to the road. This is

00:08:55.780 | despite, by the way, thinking for 18 minutes. If you want to learn more about SimpleBench, by the way, the link is in the

00:09:01.180 | description. And I'll end this video with my recommendation of which model you should check out if you're just used to say for

00:09:07.460 | example, the free ChatGPT. The O3 Pro API from OpenAI failed, by the way, which is why we don't have a result yet for that model.

00:09:15.060 | Of course, the failure modes go far beyond the simple scenarios featured in SimpleBench. Here's one quirk many of you may not know about.

00:09:21.860 | This is the brand new VO3 from Google Gemini. And I said, "Output a London scene with absolutely zero lampposts,

00:09:29.460 | not a single lamppost in sight." Of course, if you lean in to the hallucinations of generative models,

00:09:37.060 | you can get creative outputs like this one from VO3.

00:09:46.740 | Now obviously this is a quirky advert, but think of the immense amount of money that the company "saved"

00:09:53.700 | by not using actors, sets, or props for this advert. That's why I don't want you guys to be as shocked

00:09:59.620 | as this Sky News presenter in Britain, and this has got hundreds of thousands of views, where he noticed

00:10:05.220 | ChatGPT hallucinating an answer, in this case a transcript. This generated several news segments,

00:10:11.220 | as well as this article saying, "Can we trust ChatGPT despite it hallucinating answers?"

00:10:16.180 | This all then comes back to us being able to hold two thoughts in our head at the same time,

00:10:20.740 | which is that LLMs are swiftly catching up on human performance across almost all text-based domains.

00:10:26.740 | But they have almost no hesitation in generating mistruths, you could say like many humans.

00:10:34.020 | So if human performance is your yardstick, they are catching up fast and can BS like the best of us.

00:10:40.580 | But language models like ChatGPT, Gemini, and Claude are not super computers, they're not the kind of

00:10:46.020 | AI that can, for example, predict the weather. Their real breakthroughs, as with human breakthroughs,

00:10:50.820 | come when they use tools in an environment that corrects their BS for them. That can lead to genuine

00:10:57.220 | scientific advance, and if you want to learn more about that, check out my Alpha Evolve video.

00:11:01.460 | Frankly, having made that video, I was quite surprised to hear Sam Altman say it's going to be 2026 when we

00:11:07.540 | see the arrival of systems that can figure out novel insights. As far as I'm concerned, we have that

00:11:12.580 | now. Again, not LLMs on their own, but LLMs in combination with symbolic systems.

00:11:18.660 | So, while language models can't yet solo superintelligence, which one should you use?

00:11:23.620 | Well, let me give you one cautionary word on benchmarks and a little bit of advice.

00:11:28.580 | In just the last 48 hours, we got O3 Pro from OpenAI, at the $200 tier. I'm sure though that will

00:11:35.540 | eventually filter down to the $20 tier. And of course, the benchmark results were pretty impressive.

00:11:41.300 | On competition level mathematics, 93%, on really hard PhD level science questions, 84%, and competitive

00:11:48.820 | coding, you can see the ELO ranking here. My cautionary note though comes from the results you can see below

00:11:54.740 | for the O3 model, not O3 Pro, but the O3 model that OpenAI showcased on day 12 of Christmas in December

00:12:02.660 | 2024. As you can see, today's O3 Pro mostly underperforms that system teased back in December.

00:12:09.620 | So that's the cautionary note that you often have to look beyond the headline benchmark results to see

00:12:15.140 | how these models perform on your use case. The word of advice is that when you're looking at benchmarks,

00:12:20.020 | companies will often either not compare to other model providers at all, as in the case of OpenAI these

00:12:26.180 | days, or like Anthropic with their Claude series of models, they will show you multiple benchmarks,

00:12:31.380 | but not be terribly clear about the multiple parallel attempts they took to get their record high scores,

00:12:37.220 | or about the serious usage limitations they have for their bigger model, or the massively elevated price

00:12:43.780 | for that model. Which brings me to my current recommendation if you just want to use a model for

00:12:48.580 | free, albeit with caps of course, and that would be Google's Gemini 2.5 Pro. Yes, I am slightly

00:12:54.580 | influenced by its top score on SimpleBench, and the fact you get a few uses of the Vio video generator

00:13:00.900 | model. An honorary mention goes to DeepSeq R1, which is very cheap via the API, and at least comes with a

00:13:07.780 | technical report that we can all read through. Many of you commented this month that you saw a pretty

00:13:12.820 | noticeable boost in production quality for my DeepSeq documentary, and there's more to come

00:13:18.580 | where that came from. But that boost was in no small part to my video editor choosing Storyblocks,

00:13:24.980 | the sponsors of today's video. We picked them actually before any sponsorship, partly due to the unlimited

00:13:31.620 | downloads of varied high quality media at their set subscription cost, but partly due to the clear-cut

00:13:38.420 | licensing, wherein anything we downloaded with Storyblocks was 100% royalty-free. If you want to

00:13:45.140 | get started with unlimited stock media downloads, head to storyblocks.com/aiexplained, link in the description.

00:13:52.500 | I hope that helped give you some signal amongst the noise, but either way, I hope you have a very wonderful day.

Apple’s ‘AI Can’t Reason’ Claim Seen By 13M+, What You Need to Know

Chapters