Back to Index

What's Left Before AGI? PaLM-E, 'GPT 4' and Multi-Modality


Transcript

Palm E was released less than a week ago and for some people it may already be old news. Sure it can understand and manipulate language, images and even the physical world. The E at the end of Palm E by the way stands for embodied. But soon apparently we're going to get the rebranded GPT-4 which many people think surely will do better and be publicly accessible.

But the multimodal advancements released just this week left me with a question. What tasks are left before we call a model artificial general intelligence or AGI? Something beyond human intelligence. I didn't want hype or get rich schemes. I just wanted clear research about what exactly comes before AGI. Let's start with this four day old statement from Anthropic, a four billion dollar startup founded by people who left OpenAI over safety concerns.

They outlined that in 2019 it seemed possible that multiple of the AGI's that were released would be able to be used for the purpose of multi-modality like Palm E. Logical reasoning, speed of learning, transfer learning across tasks and long term memory might be walls that would slow or halt the progress of AI.

In the years since several of these walls such as multi-modality and logical reasoning have fallen. What this means is that the different modes of Palm E and Microsoft's new visual chat GPT, text, image, video aren't just cool tricks. They are major milestones. Palm E can look at images and images of an object that is not visible to the naked eye.

It can also predict what will happen next. Check out this robot who's about to fall down. That's just an image but ask Palm E what will the robot do next and it says fall. It knows what's going to happen just from an image. It can also read faces and answer natural language questions about them.

Check out Kobe Bryant over here. It recognizes him from an image and you can ask questions about his career. This example at the bottom I think is especially impressive. Palm E is actually doing the math from this hastily sketched image. It's chalkboard. It's solving those classic math problems that we all got at school just from an image.

Now think about this. Palm E is an advancement on Gato. Which at the time the lead scientist at DeepMind Nando de Freitas called game over in the search for AGI. Someone had written an article fearing that we would never achieve AGI and he said game over. All we need now are bigger models, more compute efficiency, smarter memory, more modalities etc.

And that was Gato not Palm E. Of course you may have noticed that neither he nor I am completely defining AGI. That's because there are multiple definitions. None of which satisfy everyone. But a broad one for our purposes is that AGI is a model that is at or above the human level on a majority of economic tasks currently done by humans.

You can read here some of the tests about what might constitute AGI. But that's enough about definitions and multi-modality. Time to get to my central question. What is left? Before AGI? Well what about learning and reasoning? This piece from Wired Magazine in late 2019 argued that robust machine reading was a distant prospect.

It gives a challenge of a children's book that has a cute and quite puzzling series of interactions. It then states that a good reading system would be able to answer questions like these. And then give some natural questions about the passage. I will say these questions do require a degree of logic and common sense reasoning about the world.

So you can guess what I did. I put them straight into Bing. We're only three and a half years on from this article. And look what happened. I pasted in the exact questions from the article. And as you might have guessed Bing got them all right pretty much instantly.

So clearly my quest to find the tasks that are left before AGI would have to continue. Just quickly before we move on from Bing and Microsoft products. What about specifically GPT-4? How will it be different from Bing? Or is it already inside? Inside Bing as many people think. The much quoted German CTO of Microsoft actually didn't confirm that GPT-4 will be multimodal.

Only saying that at the Microsoft event this week there we will have multimodal models. That's different from saying GPT-4 will be multimodal. I have a video on the eight more certain upgrades inside GPT-4. So do check that out. But even with those upgrades inside GPT-4 the key question remains if such models can already read so well.

What exactly is left before AGI? So I dove deep in the literature and found this graph from the original palm model which palm E is based on. Look to the right. These are a bunch of tasks that the average human rater at least those who work for Amazon Mechanical Turk could beat palm at in 2022.

And remember these were just the average raters not the best. The caption doesn't specify what the tasks are so I looked deep in the appendix and found the list of tasks. And here is the list of tasks that humans did far better on than palm. Here is that appendix and it doesn't make much sense when you initially look at it.

So what I did is I went into the big bench data set and found each of these exact tasks. Remember these are the tasks that the average human raters do much better at than palm. I wanted to know exactly what they entailed. Looking at the names they all seem a bit weird and you're going to be surprised at what some of them are.

Take the first one. MNIST ASCII. Basically representing and recognising ASCII numerals. Hmm. Now I can indeed confirm that Bing is still pretty bad at this in terms of numerals and in terms of letters. But I'm just not sure how great an accomplishment for humanity this one is though. So I went to the next one which was sequences.

As you can see below this is keeping track of time in a series of events. This is an interesting one. Perhaps linked to GPT models struggles. I tried the same question multiple times with Bing and ChatGPT and only once out of about a dozen attempts did it get the question right.

You can pause the video and try it yourself but essentially it's only between 4 and 5 that he could have been at the swimming pool. You can see here the kind of convoluted logic that Bing goes into. So really interesting. This is a task that the models can't yet do.

Again I was expecting something a bit more complex but I was actually quite surprised by the results. I was expecting something a bit more complex but I was actually quite surprised by the results. I was expecting something a bit more profound but let's move on to the next one.

Simple text editing of characters, words and sentences. That was strange. What does it mean text editing? Can't Bing do that? I gave Bing many of these text editing challenges and it did indeed fail most of them. It was able to replace the letter T with the letter P so it did okay with characters but it really doesn't seem to know which word in the sentence something is.

You can let me know in the comments. What do you think of these kind of errors and why Bing and ChatGPT keep making them? The next task that humans did much better on was hyperboton or intuitive adjective order. It's questions like which sentence has the correct adjective order? An old fashioned circular leather exercise car sounds okay or a circular exercise old fashioned leather car.

What I found interesting though is that even the current version of ChatGPT could now get this right. On other tests it gets it a little off but I think we might as well tick this one off the list. The final task that I wanted to focus on in that palm appendix is a little more worrying.

It's Triple H. Not the wrestler, the need to be helpful, honest and harmless. It's kind of worrying that that's the thing it's currently failing at. I think this is closely linked to hallucination and the fact that we cannot fully control the outputs of large language models. At this point if you've learnt anything please do let me know in the comments or leave a like it really does encourage me to do more such videos.

All of the papers and pages in this video will be linked in the description. Anyway hallucinations brought me back to the anthropic safety statement and their top priority of mechanistic interpretability which is a fancy way of saying understanding what exactly is going on inside the machine and one of the stated challenges is to recognise whether a model is deceptively aligned, playing along with even tests designed to tempt a system into revealing its own misalignment.

This is very much linked to the Triple H failures we saw a moment ago. Fine, so honesty is still a big challenge but I wanted to know what single significant and quantifiable task AI was not close to yet achieving. Some thought that that task might be storing long term memories as it says here but I knew that that milestone had already been passed.

This paper from January described augmenting palm with read write memory so that it can remember everything and process arbitrarily long inputs. Just imagine a bing chat equivalent knowing every email at your company, every customer record, sale, invoice, the minutes of every meeting etc. The paper goes on to describe a universal Turing machine which to the best of my understanding is one that is not a Turing machine.

It is a machine that can mimic any computation. A universal computer if you will. Indeed the authors state in the conclusion of this paper that the results show that large language models are already computationally universal as they exist currently provided only that they have access to an unbounded external memory.

What I found fascinating was that Anthropic are so concerned by this accelerating progress that they don't publish capabilities research because we do not wish to advance the rate of AI capabilities. And I must say that Anthropic do know a thing or two about language models having delayed the public deployment of Clawed which you can see on screen until it was no longer state of the art.

They had this model earlier but delayed the deployment. Clawed by the way is much better than ChatGPT at writing jokes. Moving on to data though. In my video on GPT-5 which I do recommend you check out I talk about how important data is to the improvement of models. One graph I left out from it is the one I left out of the last video.

The data on that video though suggests that there may be some limits to this straight line improvement in the performance of models. What you're seeing on screen is a paper released in ancient times which is to say two weeks ago on Meta's new Lama model. Essentially it shows performance improvements as more tokens are added to the model.

By tokens think scraped webtext. But notice how the gains level off after a certain point. So not every graph you're going to see today is exponential. The Y axis is different for each task. And some of the questions it still struggles with are interesting. Take SIQA which is social interaction question answering.

It peaks out at about 50-52%. That's questions like these. Where in most humans could easily understand what's going on and find the right answer. Models really struggle with that even when they're given trillions of tokens. Or what about natural questions where the model is struggling at about a third of the time.

And it's not even worth the effort to find the right answer. It's a lot of work. And it's not even worth the effort to find the right answer. So I dug deep into the literature to find exactly who proposed natural questions as a test and found this document. This is a paper published by Google in 2019 and it gives lots of examples of natural questions.

Essentially they're human like questions where it's not always clear exactly what we're referring to. Now you could say that's on us to be clearer with our questions. But let's see how Bing does with some of these. I asked: "The guy who plays Mandalorian also did What Drugs TV show?" I deliberately phrased it in a very natural, vague way.

Interestingly it gets it wrong initially in the first sentence but then gets it right for the second sentence. I tried dozens of these questions. You can see another one here. "Author of L-O-T-R surname origin." That's a very naturally phrased question. It surmised that I meant Tolkien, the author of Lord of the Rings and I wanted the origin of his surname.

And it gave it to me. Another example was: "Big Ben City first bomb landed WW2." It knew I meant London and while it didn't give me the first bomb that landed in London during World War 2, it gave me a bomb that was named Big Ben. So not bad.

Overall I found it was about 50/50 just like the Meta Llama model. Maybe a little better. Going back to the graph we can see that data does help a lot but it isn't everything. However, anthropics theory is that compute can be a rough proxy for further progress. And this was a somewhat eye-opening passage.

We know that the capability jump from GPT-2 to GPT-3 resulted mostly from about a 250 time increase in compute. We would guess that another 50 times increase separates the original GPT-3 model and state of the art models in 2023. Think Claude or Bing. Over the next 5 years we might expect around a 1000 time increase in the computation used to train the largest models based on trends in compute cost and spending.

If the scaling laws hold this would result in a capability jump that is significantly larger than the jump from GPT-2 to GPT-3 or GPT-3 to Claude. And it ends with: "At Anthropic we're deeply familiar with the capabilities of these systems. And a jump that is this much larger feels to many of us like it could result in human level performance across most tasks." That's AGI.

And 5 years is not a long timeline. This made me think of Sam Altman's AGI statement where he said: "At some point it may be important to get independent review before starting to train future systems and for the most advanced efforts to agree to limit the rate of growth of compute models." This is a very important step for creating new models.

Like a compute truce if you will. Even Sam Altman thinks we might need to slow down a bit. My question is though, would Microsoft or Tesla or Amazon agree with this truce and go along with it? Maybe, maybe not. But remember that 5 year timeline that Anthropic laid out?

That chimes with this assessment from the Conjecture Alignment Startup: "AGI is happening soon. Significant probability of it happening in less than 5 years." And it gives plenty of examples, many of which I have already covered. Others of course give much more distant timelines and as we've seen AGI is not a well defined concept.

In fact it's so not well defined that some people actually argue that it's already here. This article for example says "2022 was the year AGI arrived." Just don't call it that. This graph originally from Wait But Why? Is quite funny but it points to how short a gap there might be between being better than the average human and being better than Einstein.

I don't necessarily agree with this but it does remind me of another graph I saw recently. It was this one on the number of academic papers being published on machine learning and AI in a paper about exponential knowledge growth. The link to this paper like all the others is in the description.

And it does point to how hard it will be for me and others just to keep up with the latest papers on AI advancements. But this one is a bit more complicated. At this point you may have noticed that I haven't given a definitive answer to my original question which was to find the task that is left before AGI.

I do think there will be tasks such as physically plumbing a house that even an AGI, a generally intelligent entity, couldn't immediately accomplish simply because it doesn't have the tools. It might be smarter than a human but can't use a hammer. But my other theory to end on is that before AGI there will be a deeper, more complex and more subjective debate.

Take the benchmarks on reading comprehension. This graph shows how improvement is being made. But I have aced most reading comprehension tests such as the GRE so why is the highest human rater labelled at 80%? Could it be that progress stalls when we get to the outer edge of ability?

When test examples of sufficient quality get so rare in the dataset that language models simply cannot perform well on them? Take this difficult LSAT example. I won't read it out because by definition it's quite long and convoluted. And yes, Bing fails it. Is this the near term future? Where only obscure feats of logic, deeply subjective analyses of difficult texts and niche areas of mathematics and science remain out of reach?

Where essentially most people perceive AGI to have already occurred but for a few outlying tests? Indeed, is the ultimate capture test the ability to deliver a laugh out loud joke or deeply understand the plight of Oliver Twist? Anyway thank you for watching to the end of the video. I'm going to leave you with some bleeding edge text to image generations from Mid Journey version 5.

Whatever happens next with large language models, this is the news story of the century in my opinion and I do look forward to covering it. I'm going to leave you with some bleeding edge text to image generations from Mid Journey version 5. Whatever happens next with large language models, this is the news story of the century in my opinion and I do look forward to covering it.

Whatever happens next with large language models, this is the news story of the century in my opinion and I do look forward to covering it. Whatever happens next with large language models, this is the news story of the century in my opinion and I do look forward to covering it.

Whatever happens next with large language models, this is the news story of the century in my opinion and I do look forward to covering it. Whatever happens next with large language models, this is the news story of the century in my opinion and I do look forward to covering it.

Whatever happens next with large language models, this is the news story of the century in my opinion and I do look forward to covering it. Whatever happens next with large language models, this is the news story of the century in my opinion and I do look forward to covering it.

Whatever happens next with large language models, this is the news story of the century in my opinion and I do look forward to covering it. Whatever happens next with large language models, this is the news story of the century in my opinion and I do look forward to covering it.

Whatever happens next with large language models, this is the news story of the century in my opinion and I do look forward to covering it. Whatever happens next with large language models, this is the news story of the century in my opinion and I do look forward to covering it.