Almost no one has the time to investigate headlines like this one, seen by tens of millions of people. The AI models don't actually reason at all. They just memorize patterns. AGI is mostly hype, and even the underlying Apple paper quoted says it's an illusion of thinking. This was picked up in mainstream outlets like The Guardian, which quoted it as being a pretty devastating Apple paper.
So what are people supposed to believe when half the headlines are about an imminent AI job apocalypse, and the other half are about LLMs all being fake? Well, hopefully you'll find that I'm not trying to sell a narrative. I'll just say what I found, having read the 30-page paper in full and the surrounding analyses.
I'll also end with a recommendation on which model you should use, and yes, touch on the brand new O3 Pro from OpenAI. Although I would say that the $200 price per month to access that model is not for the unwashed masses like you guys. Some very quick context on why a post like this one gets tens of millions of views and coverage in the mainstream media, and no, it's not just because of the unnecessarily frantic breaking at the start.
It's also because people hear the claims made by the CEOs of these AI labs, like Sam Altman yesterday, posting, Humanity is close to building digital superintelligence. We're past the event horizon. The takeoff has started. While the definitions of those terms are deliberately vague, you can understand people paying attention.
People can see for themselves how quickly large language models are improving, and they can read the headlines generated by the CEO of Anthropic saying, there is a white-collar bloodbath coming. It's almost every week now that we get headlines like this one in the New York Times, so it's no wonder people are paying attention.
Now, some would say cynically that Apple seem to be producing more papers, quote, debunking AI than actually improving AI, but let's set that cynicism aside. The paper essentially claimed that large language models don't follow explicit algorithms and struggle with puzzles when there are sufficient degrees of complexity. Puzzles like the Tower of Hanoi challenge, where you've got to move a tower of discs from one place to another, but never place a larger disc atop a smaller one.
They also tested the models on games like checkers, where you've got to move the blue tokens all the way to the right and the red tokens to the left, following the rules of checkers. And games like River Crossing, which might be more familiar to you as the fox and chicken challenge, where you've got to go to the other side of the river without leaving the fox with the chicken.
All of these games, of course, can be and were scaled up in complexity the more pieces you introduce. If models were a pre-programmed set of algorithms like a calculator, then it shouldn't matter how many discs or checkers or blocks you have, performance should be 100% all the time. Shocker, the paper showed that they're not that and performance dropped off noticeably the more complex the task got.
But this has been known for years now about large language models. They're not traditional software, where the same input always leads to the same output. Nor, of course, are they fully randomised either, otherwise they couldn't pass a single benchmark. They are probabilistic neural networks, somewhere in between the two extremes.
And the perfect example comes with multiplication. Again, I could have added breaking to the title of this video, but this has been known about for several years now. If you don't give models access to any tools and ask them to perform a multiplication, then the moment the digits of the multiplication get too large, they start to fail dramatically.
Not occasionally getting it right, just never getting the sum right. If the number of digits is small enough, the models can reason their way to the correct answer. And as you can see in the difference between 01 mini from OpenAI and 03 mini, performance is incrementally improving. In other words, it takes a bigger number of digits to flummox the latest models.
But again, it must be emphasised that even with the very latest, the very best models you can access, if you don't give them tools, they will eventually reach a point where they just simply can't multiply two numbers. But this will always be the case because these models aren't designed to be fully predictable.
They're designed to be generative. They're not designed to be software, they're designed to use software. They want to produce plausible outputs, which is why they'll hallucinate when you ask them questions they can't handle. Here, for example, I gave a calculation to Claude 4 Opus, the latest model from Anthropic, and Gemini 2.5 Pro, the latest model from Google DeepMind, but I didn't give them access to tools.
They were never going to get this right, but rather than say, I don't know, they just hallucinated the answer in both cases. The funny thing was that these answers were plausible in that they ended in 2s and began with 6-7, which the correct answer does. These models are, after all, very convincing BSers.
But what the paper ignored is that these models can use tools and use them very effectively. Here's that same Claude 4 Opus, but this time allowed to use code. It got the answer right, and notice I didn't even say use code or use a tool. It knew to do so.
So for me, what was surprising is that this Apple paper found it surprising that large reasoning models, they call them, can't perform exact computation. We know they can't. Now, several other people before me have pointed out another fatal weakness with the paper, which is they describe accuracy ultimately collapsing towards zero beyond a certain level of complexity.
Because models are constrained with how many tokens, or parts of a word, if you like, that they can output in one go. In the case of the Claude model from Anthropic that was tested, that token limit is 128,000 tokens. But some of the questions tested required more than that number of tokens.
So even if the models were trained to be calculators, which they're not, they weren't given enough space to output the requisite number of tokens. For me then, it's to the credit of the models that they recognised their own output limits, and then outputted what the paper calls shorter traces, basically giving up, because they quote, knew they wouldn't have the space to output the required answer.
Instead, the models would output things like, here is the algorithm you need to use, or the tool you need to use, which I think is reasonable. One quick detail that I think many people missed is the paper actually admits that it originally wanted to compare thinking versus non-thinking models.
You know, those ones output long chains of thought versus those that don't on math benchmarks. Because the results didn't quite conform to the narrative they were expecting, and thinking models did indeed outperform non-thinking models with the same compute budget, they actually abandoned the math benchmark and then resorted to the puzzles.
I guess what I'm saying is I slightly feel like the authors came to testing the thinking models with a preconceived notion about their lack of ability. Another learning moment for us all from the paper comes from their surprise, the Apple authors, that when they provide the algorithm in the prompt, the algorithm to solve these puzzles, the models still often fail.
They're surprised by this and deem it noteworthy because they say surely finding the solution requires more computation than merely executing a given algorithm. But you guys have twigged this all by now. These are not calculators. They're not designed for executing algorithms. Because they are instead neural networks that are probabilistic, even if there is a 99.9% chance that they output the correct next step, when there's millions of steps involved, they will eventually make a mistake.
Remember multiplication, where of course the language models know the quote algorithm to perform a multiply step. Indeed, the models are derived through matrix multiplication, but that does not mean that given enough steps required, they won't start making mistakes. The conclusion of the paper paper then teed things up for the headline writers because they say, we may be encountering fundamental barriers to generalizable reasoning.
Now do forgive me for pointing this out, but that quote limitation to generalized reasoning has been pointed out by experts like Professor Rao, who I interviewed back in December of 2023 on my Patreon. This is not a quote breaking news type of situation. You may also find it interesting that one researcher used Claude for Opus and named that model as a co-author in a paper pointing out all the flaws of the Apple paper.
Flaws even I missed, like certain of the questions being impossible to answer due to logical impossibility. So no, to quote an article featured in The Guardian from Gary Marcus, the tech world is not reeling from a paper that shows the powers of the new generation of AI have been wildly oversold.
I would go as far as to say that there isn't a single serious AI researcher that would have been surprised by the results of that paper. That is not of course to say that these models don't make basic reasoning mistakes in simple scenarios, or at least semi-simple scenarios. I'm the author of a benchmark called SimpleBench designed to test models on such scenarios.
For example, I tested the brand new O3 Pro on this scenario in which models tend not to spot that the glove would just fall on to the road. This is despite, by the way, thinking for 18 minutes. If you want to learn more about SimpleBench, by the way, the link is in the description.
And I'll end this video with my recommendation of which model you should check out if you're just used to say for example, the free ChatGPT. The O3 Pro API from OpenAI failed, by the way, which is why we don't have a result yet for that model. Of course, the failure modes go far beyond the simple scenarios featured in SimpleBench.
Here's one quirk many of you may not know about. This is the brand new VO3 from Google Gemini. And I said, "Output a London scene with absolutely zero lampposts, not a single lamppost in sight." Of course, if you lean in to the hallucinations of generative models, you can get creative outputs like this one from VO3.
Now obviously this is a quirky advert, but think of the immense amount of money that the company "saved" by not using actors, sets, or props for this advert. That's why I don't want you guys to be as shocked as this Sky News presenter in Britain, and this has got hundreds of thousands of views, where he noticed ChatGPT hallucinating an answer, in this case a transcript.
This generated several news segments, as well as this article saying, "Can we trust ChatGPT despite it hallucinating answers?" This all then comes back to us being able to hold two thoughts in our head at the same time, which is that LLMs are swiftly catching up on human performance across almost all text-based domains.
But they have almost no hesitation in generating mistruths, you could say like many humans. So if human performance is your yardstick, they are catching up fast and can BS like the best of us. But language models like ChatGPT, Gemini, and Claude are not super computers, they're not the kind of AI that can, for example, predict the weather.
Their real breakthroughs, as with human breakthroughs, come when they use tools in an environment that corrects their BS for them. That can lead to genuine scientific advance, and if you want to learn more about that, check out my Alpha Evolve video. Frankly, having made that video, I was quite surprised to hear Sam Altman say it's going to be 2026 when we see the arrival of systems that can figure out novel insights.
As far as I'm concerned, we have that now. Again, not LLMs on their own, but LLMs in combination with symbolic systems. So, while language models can't yet solo superintelligence, which one should you use? Well, let me give you one cautionary word on benchmarks and a little bit of advice.
In just the last 48 hours, we got O3 Pro from OpenAI, at the $200 tier. I'm sure though that will eventually filter down to the $20 tier. And of course, the benchmark results were pretty impressive. On competition level mathematics, 93%, on really hard PhD level science questions, 84%, and competitive coding, you can see the ELO ranking here.
My cautionary note though comes from the results you can see below for the O3 model, not O3 Pro, but the O3 model that OpenAI showcased on day 12 of Christmas in December 2024. As you can see, today's O3 Pro mostly underperforms that system teased back in December. So that's the cautionary note that you often have to look beyond the headline benchmark results to see how these models perform on your use case.
The word of advice is that when you're looking at benchmarks, companies will often either not compare to other model providers at all, as in the case of OpenAI these days, or like Anthropic with their Claude series of models, they will show you multiple benchmarks, but not be terribly clear about the multiple parallel attempts they took to get their record high scores, or about the serious usage limitations they have for their bigger model, or the massively elevated price for that model.
Which brings me to my current recommendation if you just want to use a model for free, albeit with caps of course, and that would be Google's Gemini 2.5 Pro. Yes, I am slightly influenced by its top score on SimpleBench, and the fact you get a few uses of the Vio video generator model.
An honorary mention goes to DeepSeq R1, which is very cheap via the API, and at least comes with a technical report that we can all read through. Many of you commented this month that you saw a pretty noticeable boost in production quality for my DeepSeq documentary, and there's more to come where that came from.
But that boost was in no small part to my video editor choosing Storyblocks, the sponsors of today's video. We picked them actually before any sponsorship, partly due to the unlimited downloads of varied high quality media at their set subscription cost, but partly due to the clear-cut licensing, wherein anything we downloaded with Storyblocks was 100% royalty-free.
If you want to get started with unlimited stock media downloads, head to storyblocks.com/aiexplained, link in the description. I hope that helped give you some signal amongst the noise, but either way, I hope you have a very wonderful day.