Back to Index

AI Won't Be AGI, Until It Can At Least Do This (plus 6 key ways LLMs are being upgraded)


Transcript

Can you spot the following pattern? If you look at this grid here and see how it's been transformed into this grid, and likewise how this grid on the left has been transformed into this grid, can you spot that pattern? Could you, in other words, predict what would happen to this grid here in the test example?

Well, it might shock you to learn that language models like GPT-4 can't do this. They are terrible at noticing that the little squares are being filled in with a darker shade of blue. This specific abstract reasoning challenge was not in the training data set of GPT-4 and that model cannot generalize from what it has seen to solve the challenge.

It's not generally intelligent, it's not an artificial general intelligence. Now, you might think that's a minor quibble, but it gets to the heart of why current generation AI is not AGI and frankly isn't even close. And no, neither will this problem be solved by a simple naive scaling up of our models.

But this video isn't just about picking out one flaw, albeit a critical one, in our current LLMs. It's more my attempt to address this swirling debate about whether AI is overhyped or underhyped. For many, AI is nothing but hype and is a giant bubble, while for others, AGI has already arrived or is just months away.

But you don't need an opinion, what does the evidence show? This video will draw upon dozens of papers and reports to give you the best snapshot I can on LLMs and AI more broadly. I'm going to start, I'm afraid, with so much of what's wrong with the current landscape, from delayed releases, overpromises, and my biggest current concern, a tragedy of the commons from AI slop.

But then I will caution those who might throw the baby out with the bathwater, giving six detailed evidence-based pathways, from the LLMs we have today, to substantially more powerful and useful models. Including systems that, yes, perform decently, even on that abstract reasoning challenge I showed you earlier. But let's start with the dodgy stuff in "AI".

And no, I'm not referring to the fact that the creators and funders of this ARK AGI challenge are so confident that current generation LLMs cannot succeed, that they've put up a prize pool of over a million dollars. If GPT-4.0 was a pure reasoning engine, well then why is its performance negligible in this challenge?

But no, I'm actually referring to the landscape of over-promising and under-delivering. You might remember Demis Hassabis referring to the original Gemini model when it was launched as being "as good as human experts". But Google has had to roll back its LLM-powered AI overview feature because there were simply far too many mistakes.

If Gemini was as good as human experts, as some benchmarks were claiming to show, then why wouldn't it be better than a random Google search? But surely the newly announced Apple Intelligence will be far better though? Well, aside from the fact that we can't actually test it, no, Tim Cook admitted that it still hallucinates all the time.

Now to some, as we'll see, that's actually the point of LLMs, we want them to be creative. But to others, it smells more like BS. This particular paper, aside from being quite funny, gave one clear take-home message. Language models aren't actually designed to be correct. They aren't designed to transmit information.

So don't be surprised when their assertions turn out to be false. Of course, people are working on that and the paper even makes reference to the Let's Verify Step-by-Step paper, more on that later, but the point remains. So we have the hallucinations in AI and then the hallucinations in the marketing about AI.

And the fact that we have AI-powered toothbrushes isn't even my primary concern. And nor is it actually that we occasionally have those wildly overhyped products like the Rabbit R1 and the Humane AI Pin. Then there's the possible breaches in privacy that features like Microsoft's Recall seem to inevitably invite.

I don't know about you, but it was never appealing to me to have an LLM analyse screenshots taken of my desktop every few seconds. Imagine the poor LLM trying to sift through all of the creations that some are likely to come up with using OpenAI's tools. You might think I've reached the end of my dodgy list, but I'm not actually even close.

What about the increase in academics using LLMs to write or polish papers? You can see the recent and dramatic increase in the use of the word 'delve' on papers on PubMed. For me, as soon as I suspect an article I'm reading is LLM generated, I just discount it heavily.

Then we get the delayed releases. Now this one is arguably a bit more forgivable, but we were promised GPT 4.0 within a few weeks. I think we all would prefer a tradition when features are announced the moment they're actually available. But I will come back to GPT 4.0 in a moment.

Now you might be different, but for me the number one concern at the moment is just AI generated slop. Take this tool where on LinkedIn you can imitate the writing of someone in your field or industry. Are they going viral while you get little to no engagement? Copy their tone and address the same topic.

And I would call this a kind of tragedy of the commons. For the individuals using this, and I'm not meaning to pick on one individual tool, but for the individuals using this, it's probably pretty helpful. It probably does boost engagement and help source out any language issues. But as we've seen of late on Facebook, it just leads to this general AI generated miasma.

Bots engaging with bots. Gullible people drawn in and fooled. A landscape where increasingly you can't trust what you see or even hear. And that's before even getting to the topic of deepfakes, which of course can acutely affect the individuals concerned. You can, of course, let me know if you agree.

But for me, this is at the moment the number one concern and it's hard to see how it would be stopped. At this point, you're probably thinking, damn, this is pretty negative on AI. But as you can see, the video has plenty of time to go. But at this point, my rejoinder to some is that it's all too easy to fall into diametrically opposed camps.

As we saw last year with accelerationists and doomers, we're seeing this year with those who say AI is nothing but hype and those who say AGI is imminent. But I think the world is just so much more complex than that. I reckon that if I was blind, I would forgive the occasional hallucination with GPT-40 and be grateful to have a model that can tell me about the world around me.

Oh, and interactively and in real time. Try and tell me exactly what they're doing right now, please. Um, right now, the ducks are gently gliding across the water. They're moving in a fairly relaxed manner, not in a hurry. Occasionally, one of them will dip its head under the water, probably looking for food, and then pop back up.

I even know when a taxi is coming with its orange light on. I think I'll hail it to get home. Yes, I spotted one just now. It's heading your way on the left side of the road. Get ready to wave it down. Great job hailing that taxi. It looks like you're all set to go.

That's a good dog right there, leading the way into the taxi. Safe travels. And remember, too, that as Project Astra from Google demonstrated, models are getting better at ingesting more and more tokens, more and more context. They are increasingly able to help users of any kind locate things not just in text, but in the real world.

Now, we'll get back to LLMs in a moment, but of course, it's worth remembering that there are far more to neural networks than just LLMs. A new study in Nature showed how you could use GANs, Generative Adversarial Networks, to predict the effects of untested chemicals on mice. Gen AI, in this case, was able to simulate a virtual animal experiment to generate profiles similar to those obtained from traditional animal studies.

The TL;DR is not just that the Gen AI predictions had less error, but they were produced much more quickly. And as the BBC reported, this is at least one tentative step toward an end to animal testing. And yes, that study was different to this one released a few days ago with Harvard and Google DeepMind, where they built a virtual rodent powered by AI.

Again, a highly realistic simulation, but this time down to the level of neural activity inside those real rats. Then of course, we have the good old convolutional neural networks for image analysis. Their use in the Brainomix eStroke system is, as we speak, helping stroke victims in the NHS. Essentially, it has enabled diagnoses to be made by clinicians much more quickly, which in the case of strokes is super important, of course, and that has tripled the number of patients recovering.

Now, the only tiny, tiny complaint I would make is that, as ever, the title just uses the phrase AI. It took me a disturbing amount of digging to track down the actual techniques used. And even then, of course, for commercial reasons, they don't say everything. But now I want to return to large language models, the central focus of this channel.

Now, I have discussed in the past on this channel some of the reasoning gaps that you can find in current generation large language models. But the ARC Abstract Reasoning Challenge and Prize publicized this week by the legendary Francois Chollet is a great opportunity to clarify what exactly our current models miss and what is being done to rectify that gap.

Hopefully, at least, the following will explain in part why our models can sometimes be shockingly dumb and shockingly smart. I'm going to leave in the background for a minute an example of the kind of challenge that models like GPT-4 fail at consistently. So here is, in 60 seconds, what the current issue is.

If language models haven't seen a solution to something in their training data, they won't be able to give you a solution when you test them. That's why models fail at this challenge. They simply haven't seen these tests before. Moreover, the models aren't generally intelligent. You can train them on millions of these kind of examples, and people have tried, and they'll still fail on a new, fresh one.

Again, if that new, fresh example isn't in the training dataset, they will fail. If the mother of Gabriel Macht is in the dataset, they will output the correct answer. If, however, the data son of that Suzanne Victoria Pullia is not in the dataset, it will not know. It doesn't reason its way to the answer based on other parts of its training data.

So how can models do so well on certain benchmarks like the math benchmark? Well, they can "recall" from their training dataset certain reasoning chains that they've seen before. That's enough in certain circumstances to get the answer right. So hang on a second, they can recall certain reasoning procedures, let's call them programs, but they can't create them.

Yes, it's a good news, bad news kind of situation. But for the remainder of this video, indeed for the remainder of LLMs, it will be important to remember that distinction. Recalling reasoning procedures or programs versus doing fresh reasoning itself. If it's seen it before, great. If it hasn't seen it before, not so great.

Okay, so we're almost ready for those six ways which might drag LLMs closer to that artificial general intelligence. But first, I want to quickly focus on one thought you may have had in reaction to this framework. Why not train language models on every reasoning procedure out there? Feed it data on any scenario it might encounter and wouldn't that be AGI?

Well, here's Francois Chollet, the author of the ARC challenge. He'll explain why memorization isn't enough because we don't see the whole map. If the world, if your life were a static distribution, then sure you could just brute force the space of possible behaviors. You can think of intelligence as a pathfinding algorithm in future situation space.

Like, I don't know if you're familiar with game development, like RTS game development, but you have a map, right? And you have, it's like a 2D map and you have partial information about it. There is some fog of war on your map, there are areas that you haven't explored yet, you know nothing about them.

And then there are areas that you've explored, but you only know how they were like in the past. And now instead of thinking about a 2D map, think about the space of possible future situations that you might encounter and how they're connected to each other. Intelligence is a pathfinding algorithm.

So once you set a goal, it will tell you how to get there optimally. But of course, it's constrained by the information you have. It cannot pathfind in an area that you know nothing about. If you had complete information about the map, then you could solve the pathfinding problem by simply memorizing every possible path, every mapping from point A to point B.

Solve the problem with pure memory. But the reason you cannot do that in real life is because you don't actually know what's going to happen in the future. So if AI will be encountering novel situations, it will need to adapt on the fly. What would make me change my mind about that is basically if I start seeing a critical mass of cases where you show the model with something it has not seen before.

A task that's actually novel from the perspective of its training data, something that's not in training data. And if it can actually adapt on the fly. And that might be more possible than you think, as Noam Brown of OpenAI thinks. He's optimistic that LLMs will crack it. But again, it won't be just a naive scaling up of data that alone gets us there.

Without examples or zero shot, as we've seen, models don't generalize from what they've seen to what they haven't. This paper demonstrates that in the visual domain, as we've already seen it in the text domain. No matter what neural network architecture or parameter scale they tested, models were data hungry.

Unlike a human child, they didn't learn in a sample efficient manner. Remember that a child might be shown one image of a camel with the caption camel and retain that term for life. However, midway through the paper, there was some tentative evidence that with enough scale, you can get decent results on rarely found concepts.

Those were represented by the "let it wag" dataset referring to the "long tail" of a distribution. Anyway, as you can see, models that performed at over 80% on this traditional, well-known ImageNet accuracy test, did perform fairly well on this "long tail" dataset. As they note, the gap in performance does seem to be decreasing for higher capacity models.

In other words, more data definitely helps, especially exponentially more data, but it definitely won't be enough. But even this paper pointed to some ways that this challenge might be overcome. They note possibilities for not only retrieval mechanisms, but compositional generalization. In other words, piecing together concepts that have been found in the training dataset to be able to recognize more complex ones.

And if you think that's impossible, I've got news for you in a moment. Just before I do though, I want to get to one other approach that I don't think will work. You may or may not have heard on the Twittersphere about the "Situational Awareness" 165-page report put out by a former OpenAI researcher.

I've done a full 45-minute breakdown on my Patreon, AI Insiders, but one key takeaway is this. He thinks the straight march to AGI will be achieved by scaling up the parameters and data of our current LLMs. But again, I think that's far too simplistic. Just throwing on more parameters and more data wouldn't resolve the kind of issues you've seen today.

Leopold Aschenbrenner also made some other somewhat crazier claims, but that's for this video. So it's time to get to the first of those six methods that I mentioned that might drag LLMs closer to something we might call a general intelligence. And this paper published late last year in Nature is about that compositionality I mentioned just a moment ago.

Perhaps models can't reason, but if they can better compose reasoning blocks into something more complex, might that be enough? Well, the authors prove that point, in principle at least, on just a 1.4 million parameter transformer-based model. Boldly, they claim, "Our results show how a standard neural network architecture, optimized for its compositional skills, can mimic human systematic generalization in a head-to-head comparison." And as the lead author of the paper retweeted, "The answer to better AI is probably not just more training data, but rather diversifying training strategies." TLDR, send the robot to algebra class every so often.

And the challenge, as you can see in the bottom left, was this. It might even remind you a little bit of the ARC challenge. The challenge was to work out what this made-up language fragment actually meant, based on these rules. Given enough time, humans tend to do fairly well at this challenge, like the ARC challenge, but models like GPT-4, as we'll see, flop hard.

Remember, GPT-4 has 1.8 trillion parameters, versus the model they trained at 1.4 million. Anyway, with enough time in your case, and training in the case of their transformer model, it worked out that, for example, "Hu" means "two", whereas "Sa" means "green" and "Ri" means "light blue". There are other rules to work out, of course, and then those rules have to be composed together to work out the question-answer.

Now, remember, when they were tested, they were tested on a new configuration of words. They had to "understand" the made-up rules of a new language and apply them to phrases it hadn't been trained on. It was able to do so, in other words, show the first tentative hints at true reasoning.

It's only a very small step towards AGI, though, because when they tested these flickers of compositional reasoning on a new task, this tiny model failed. Okay, time for the next approach. If, as we've seen, language models have these reasoning chains or programs within them to solve challenges, they're just hard to find.

What about methods that improve our ability to find those programs within language models? That's what, in part, verifiers are about. A string of papers came out this week about using verifiers and Monte Carlo tree search to improve the mathematical reasoning of language models. Of course, covering all of them would be a video in and of itself, but in a nutshell, it's about this.

You can train a model to recognize faulty steps in a reasoning chain to pick out the bad programs. With Let's Verify Step-by-Step, which I've talked about many times before on this channel, that required human annotation, but Google DeepMind came up with an approach which was done in an automated fashion.

By automatically collecting results which led to a correct answer, as well as contrasting them with outputs that led to an incorrect output, they trained a process reward model. Think of it like a supervisor analyzing each step of a language model's outputs. Or, to use the analogy we've been using throughout this video, the process reward model could alert the language model the moment it's calling a faulty or inappropriate program, in theory.

But at least in this paper, there is a limit to this approach. Even when analyzing and deciding amongst over a hundred solutions, the performance started to plateau. For sure, it's still a huge boost on this math benchmark from 50% to around 70%, but there's a limit. Is that because it's not using human annotation like with Let's Verify?

Or is it a fundamental limitation with the approach? Perhaps if a language model doesn't have the requisite program in its training dataset to solve a math problem, it can't do so no matter what supervision it's given. Nevertheless, the principle is clearly established. We don't have to rely on the language model itself locating the correct and necessary program to solve a challenge.

We can at least help it along the way. And we know that there are even better verifiers out there, namely simulations and the real world. As I discussed with Jason Ma, the lead author of the Dr. Eureka paper. It's a feature because that means the LLM can sample, let's say, a hundred different solutions.

And your simulator in this case serves as an external verifier to see which ones are good. If your LLM doesn't have the ability to hallucinate, it's always deterministic, then the method will actually not work because you would only be able to generate one candidate per iteration, right? So it becomes very slow.

Every time the model generates any response, it's technically hallucinating. It's a sample from the probability distribution, right? And it's only a hallucination if it doesn't agree with what you think it should output, right? But in my case, I don't care if it agrees with what I think is a good reward.

I have something external to verify. So it's great the model can output a hundred different things. That makes the iterative evolutionary process much faster. So I think that's actually a blessing that I think if you only think about applications like chatbot, agents, you may underappreciate. But if you think about the use case for large language models or any foundation model for, I think, discovery tasks or scientific discoveries, what you want is the model to be able to propose 10 different solutions.

Is it possible that we can turn hallucinations from a weakness to a strength? NVIDIA and others are, of course, working hard to find out. That second approach then can be summed up as using verifiers and other approaches to better locate the requisite programs within a large language model. And I will quickly mention another method for locating that latent knowledge.

ManyShot. Give models tons and tons of examples of the kind of task you want it to achieve, and it can better learn how to do so. It seems obvious, but it can lead to significant performance gains. But what about teaching language models new programs on the fly? This is what Francois Chollet calls active inference, and it's responsible for the current state-of-the-art score of 34% on the ARC AGI prize.

There are many facets to this approach, but I'm going to summarize heavily. The key insight is to use test-time fine-tuning. When the model sees three examples like those on the left, that isn't enough to teach it the way to solve the fourth one. It's too minuscule a signal amongst all its many parameters.

But one of the things that Jack Cole and co did is augment these three examples with many, many synthetic examples that mimic the style. They then fine-tune the model on those augmented examples. I think of it as prioritizing, as humans do, the thing right in front of its face.

You can almost think of it like a language model concentrating. Its parameters are adjusted to focus on the task at hand. A bit like a human going into the flow. By the way, they also used GPT-4 to generate many of the synthetic riddles to train their system. Here's how Francois Chollet describes the approach.

Most of the time when you're using an LLM, it's just doing static inference. The model is frozen and you're just prompting it and then you're getting an answer. So the model is not actually learning anything on the fly. Its state is not adapting to the task at hand. And what Jack Cole is actually doing is that for every test problem, he's on the fly, he's fine-tuning a version of the LLM for that task.

And that's really what's unlocking performance. If you don't do that, you get like 1%, 2%. So basically something completely negligible. And if you do test time fine-tuning and you add a bunch of tricks on top, then you end up with interesting performance numbers. So I think what he's doing is trying to address one of the key limitations of LLMs today, which is the lack of active inference.

It's actually adding active inference to LLMs and that's working extremely well, actually. So that's fascinating to me. In short, even on the Achilles heel of large language models, abstract reasoning, Jack Cole says this, "What is clear from our work and from others is that the upper limits of the capabilities of LLMs, even small ones, have not yet been discovered." By the way, his model is just 240 million parameters.

"There is clear evidence that more generally capable models are possible." Again, sticking with the program metaphor, as Francois Chollet says, "This is like doing program synthesis." Another definition we can use is reasoning is the ability to, when you're faced with a puzzle, given that you don't have already a program in memory to solve it, you must synthesize on the fly a new program based on bits of pieces of existing programs that you have.

You have to do on the fly program synthesis. And it's actually dramatically harder than just fetching the right memorized program and replying it. Now, at this point in the video, as we get to the fourth approach, I'm going to make a confession. I have about 20 tabs left on my screen going over other relevant papers on how they improve LLMs, but I am beginning to worry that this video might be getting a little bit too long.

So for the last few approaches, I'm going to be much more brief. I hope I'm still conveying that central message though, that LLMs currently suck at abstract reasoning, but that need not be a death sentence. Nothing in the literature indicates that AGI is at all imminent, but neither is AI or hype.

Here is a paper from the arch LLM skeptic Professor Rao, who I interviewed almost a year ago. It's a position paper from this week. LLMs can't plan, but can help planning in LLM modulo frameworks. We had a great discussion almost a year ago, which I'll hopefully get to in a different video, but the summary of this paper is this.

Earlier work by Professor Rao and co had shown that even models like GPT-4 can't come up with coherent plans. They fail in this domain of blocks world. Essentially blocks world is like some of the other reasoning challenges you've seen today in that you have to come up with a coherent plan, unstacking blocks and restacking them to meet the required objective.

Definitely not a memorization test as the MMLU can sometimes be. Indeed, if you throw off the model and use mysterious words instead of household objects, the models perform even worse. Zero-shot GPT-4 gets one out of 600 challenges. However, this fourth approach says why do we have to use LLMs alone?

Why can't we use them with traditional symbolic systems? Maybe that combination of neural networks and traditional symbolic hard-coded programmatic systems is better than either alone. Professor Rao is a friend of Yan Lecun, a famed LLM skeptic, but the paper that he led said this. There is unwarranted pessimism about the roles LLMs can play in planning/reasoning tasks.

The key insight is that LLMs can act as idea generators. Those grounded symbolic systems can then check those plans. LLMs as the ideas man, with symbolic systems as the kind of accountants. LLMs are great at guessing candidate plans to solve blocks world challenges. And those ideas, or you could say retrieve programs, aren't all bad.

Even after three or four rounds of feedback from the symbolic system, 50% of the final plan retains the elements of the initial large language model plan. With that feedback from the symbolic system, you back-prompt the LLM and it comes up with hopefully a better plan. Not even hopefully though, the results are clear.

GPT-4 can score 82% with this approach. Still struggles with mysterious languages, but can solve five of them. This all reminded me of alpha geometry, which I have done a separate video on, so I won't focus on today, but it's that same combination of a neural network and symbolic system.

For geometry problems specifically, in the International Math Olympiad, it scored almost a gold medal. Those are incredibly hard reasoning challenges. Very quickly now, the fifth approach. Instead of calling a separate system, how about jointly training on its knowledge? For time purposes, here is a very quick summary. They trained a separate neural network, not a language model, in this case, a graph neural network.

It learnt specialized algorithms. Then they embedded that fixed optimized know-how and had a language model train with access to those embeddings. In other words, a language model fluent in the language of text and algorithms. Programs that you might need, for example, for sorting. Sixth, and finally for this video, there's tacit data.

So much of what humans do and how humans reason is not written down. Let's hear from Terence Tao, arguably the smartest man on the planet. He said, "So much knowledge is somehow trapped in the head of individual mathematicians, and only a tiny fraction is made explicit. A lot of the intuition of mathematicians is not captured in the printed papers in journals, but in conversations among mathematicians, in lectures, and in the way we advise students.

People only publish the success stories. The data that are really precious are from when someone tries something and it doesn't quite work, but they know how to fix it. But they only publish the successful thing, not the process." And all of this, he says, simultaneously points to a dramatic way to improve AI, but also why we shouldn't expect an intelligence explosion imminently.

Training on that tacit data would unlock notable progress, but human mathematicians, he says, would just, in his view, move to a higher type of mathematics. As we speak, open AI, and I'm sure many others, are trying to make as explicit as possible that tacit knowledge. I'm sure hundreds of PhDs are writing down their methodologies as they solve problems.

Millions, if not billions, of YouTube hours of video are being ingested in the hopes that AI models pick up some of that implicit reasoning. While you may see this as the most promising approach of the six I've so far mentioned, it wouldn't yield immediate explosive results. It would be reliant on us and other human experts to write down with fidelity our reasoning.

Less a remorseless, faceless shoggoth solving the universe, and more a student imitating its teachers. Of course, you will have likely seen the somewhat ambiguous clip from Mira Murati, the CTO of OpenAI, saying they don't have any giant breakthrough behind the scenes. Inside the labs, we have these capable models, and, you know, they're not that far ahead from what the public has access to for free.

And that's a completely different trajectory for bringing technology into the world than what we've seen historically. But as I've hopefully shown you today, it doesn't have to be an all or nothing. AGI imminently or all hype. As even François Chollet says, it could be a combination of approaches that solves, for example, ARK.

That indeed might be the path to AGI. People who are going to be winning the ARK competition and who are going to be making the most progress towards near-term AGI are going to be those that manage to merge the deep learning paradigm and the discrete form search paradigm into one elegant way.

So much more to get to that I couldn't get to today, but I hope this video has helped you to navigate the current AI landscape. As ever, the world is more complex than it seems. Thank you so much, as ever, for watching and have a wonderful day.