back to indexWhat's Left Before AGI? PaLM-E, 'GPT 4' and Multi-Modality
00:00:00.000 |
Palm E was released less than a week ago and for some people it may already be old news. 00:00:06.180 |
Sure it can understand and manipulate language, images and even the physical world. 00:00:11.680 |
The E at the end of Palm E by the way stands for embodied. 00:00:15.080 |
But soon apparently we're going to get the rebranded GPT-4 00:00:19.620 |
which many people think surely will do better and be publicly accessible. 00:00:23.020 |
But the multimodal advancements released just this week left me with a question. 00:00:28.340 |
What tasks are left before we call a model artificial general intelligence or AGI? 00:00:38.840 |
I just wanted clear research about what exactly comes before AGI. 00:00:43.240 |
Let's start with this four day old statement from Anthropic, 00:00:47.280 |
a four billion dollar startup founded by people who left OpenAI over safety concerns. 00:00:53.160 |
They outlined that in 2019 it seemed possible 00:00:57.460 |
that multiple of the AGI's that were released would be able to be used for the purpose of 00:00:58.500 |
multi-modality like Palm E. Logical reasoning, speed of learning, transfer learning across tasks and long term memory might be walls that would slow or halt the progress of AI. 00:01:09.500 |
In the years since several of these walls such as multi-modality and logical reasoning have fallen. 00:01:15.780 |
What this means is that the different modes of Palm E and Microsoft's new visual chat GPT, text, image, video aren't just cool tricks. 00:01:24.460 |
They are major milestones. Palm E can look at images and images of an object that is not visible to the naked eye. 00:01:30.300 |
Check out this robot who's about to fall down. 00:01:32.780 |
That's just an image but ask Palm E what will the robot do next and it says fall. 00:01:39.180 |
It knows what's going to happen just from an image. 00:01:41.900 |
It can also read faces and answer natural language questions about them. 00:01:47.340 |
It recognizes him from an image and you can ask questions about his career. 00:01:51.100 |
This example at the bottom I think is especially impressive. 00:01:54.140 |
Palm E is actually doing the math from this hastily sketched image. 00:01:59.080 |
It's solving those classic math problems that we all got at school just from an image. 00:02:07.800 |
Which at the time the lead scientist at DeepMind Nando de Freitas called game over in the search for AGI. 00:02:15.640 |
Someone had written an article fearing that we would never achieve AGI and he said game over. 00:02:20.520 |
All we need now are bigger models, more compute efficiency, smarter memory, more modalities etc. 00:02:28.260 |
Of course you may have noticed that neither he nor I am completely defining AGI. 00:02:33.300 |
That's because there are multiple definitions. 00:02:38.340 |
But a broad one for our purposes is that AGI is a model that is at or above the human level on a majority of economic tasks currently done by humans. 00:02:48.580 |
You can read here some of the tests about what might constitute AGI. 00:02:52.180 |
But that's enough about definitions and multi-modality. 00:03:01.460 |
This piece from Wired Magazine in late 2019 argued that robust machine reading was a distant prospect. 00:03:09.660 |
It gives a challenge of a children's book that has a cute and quite puzzling series of interactions. 00:03:15.180 |
It then states that a good reading system would be able to answer questions like these. 00:03:20.220 |
And then give some natural questions about the passage. 00:03:23.260 |
I will say these questions do require a degree of logic and common sense reasoning about the world. 00:03:31.960 |
We're only three and a half years on from this article. 00:03:35.480 |
I pasted in the exact questions from the article. 00:03:38.800 |
And as you might have guessed Bing got them all right pretty much instantly. 00:03:43.400 |
So clearly my quest to find the tasks that are left before AGI would have to continue. 00:03:49.000 |
Just quickly before we move on from Bing and Microsoft products. 00:03:59.700 |
The much quoted German CTO of Microsoft actually didn't confirm that GPT-4 will be multimodal. 00:04:06.440 |
Only saying that at the Microsoft event this week there we will have multimodal models. 00:04:12.800 |
That's different from saying GPT-4 will be multimodal. 00:04:15.480 |
I have a video on the eight more certain upgrades inside GPT-4. 00:04:21.660 |
But even with those upgrades inside GPT-4 the key question remains if such models can already read so 00:04:31.640 |
So I dove deep in the literature and found this graph from the original palm model which 00:04:39.840 |
These are a bunch of tasks that the average human rater at least those who work for Amazon 00:04:48.560 |
And remember these were just the average raters not the best. 00:04:52.380 |
The caption doesn't specify what the tasks are so I looked deep in the appendix and found 00:04:58.160 |
And here is the list of tasks that humans did far better on than palm. 00:05:01.660 |
Here is that appendix and it doesn't make much sense when you initially look at it. 00:05:06.120 |
So what I did is I went into the big bench data set and found each of these exact tasks. 00:05:12.340 |
Remember these are the tasks that the average human raters do much better at than palm. 00:05:20.100 |
Looking at the names they all seem a bit weird and you're going to be surprised at what some 00:05:28.140 |
Basically representing and recognising ASCII numerals. 00:05:32.900 |
Now I can indeed confirm that Bing is still pretty bad at this in terms of numerals and 00:05:41.260 |
But I'm just not sure how great an accomplishment for humanity this one is though. 00:05:46.380 |
So I went to the next one which was sequences. 00:05:49.140 |
As you can see below this is keeping track of time in a series of events. 00:05:58.120 |
I tried the same question multiple times with Bing and ChatGPT and only once out of about 00:06:04.880 |
a dozen attempts did it get the question right. 00:06:06.880 |
You can pause the video and try it yourself but essentially it's only between 4 and 5 00:06:10.880 |
that he could have been at the swimming pool. 00:06:12.880 |
You can see here the kind of convoluted logic that Bing goes into. 00:06:18.880 |
Again I was expecting something a bit more complex but I was actually quite surprised 00:06:22.880 |
I was expecting something a bit more complex but I was actually quite surprised by the 00:06:26.880 |
I was expecting something a bit more profound but let's move on to the next one. 00:06:31.140 |
Simple text editing of characters, words and sentences. 00:06:39.060 |
I gave Bing many of these text editing challenges and it did indeed fail most of them. 00:06:44.600 |
It was able to replace the letter T with the letter P so it did okay with characters but 00:06:50.740 |
it really doesn't seem to know which word in the sentence something is. 00:06:56.860 |
What do you think of these kind of errors and why Bing and ChatGPT keep making them? 00:07:02.240 |
The next task that humans did much better on was hyperboton or intuitive adjective order. 00:07:09.240 |
It's questions like which sentence has the correct adjective order? 00:07:13.380 |
An old fashioned circular leather exercise car sounds okay or a circular exercise old 00:07:21.300 |
What I found interesting though is that even the current version of ChatGPT could now get 00:07:27.920 |
On other tests it gets it a little off but I think we might as well tick this one off 00:07:33.140 |
The final task that I wanted to focus on in that palm appendix is a little more worrying. 00:07:39.140 |
It's Triple H. Not the wrestler, the need to be helpful, honest and harmless. 00:07:44.820 |
It's kind of worrying that that's the thing it's currently failing at. 00:07:48.120 |
I think this is closely linked to hallucination and the fact that we cannot fully control 00:07:56.820 |
At this point if you've learnt anything please do let me know in the comments or leave a 00:07:59.700 |
like it really does encourage me to do more such videos. 00:08:03.540 |
All of the papers and pages in this video will be linked in the description. 00:08:07.920 |
Anyway hallucinations brought me back to the anthropic safety statement and their top priority 00:08:13.560 |
of mechanistic interpretability which is a fancy way of saying understanding what exactly 00:08:20.440 |
is going on inside the machine and one of the stated challenges is to recognise whether 00:08:26.800 |
a model is deceptively aligned, playing along with even tests designed to tempt a system 00:08:37.100 |
This is very much linked to the Triple H failures we saw a moment ago. 00:08:40.940 |
Fine, so honesty is still a big challenge but I wanted to know what single significant 00:08:46.400 |
and quantifiable task AI was not close to yet achieving. 00:08:50.960 |
Some thought that that task might be storing long term memories as it says here but I knew 00:08:59.920 |
This paper from January described augmenting palm with read write memory so that it can 00:09:06.380 |
remember everything and process arbitrarily long inputs. 00:09:11.420 |
Just imagine a bing chat equivalent knowing every email at your company, every customer 00:09:16.740 |
record, sale, invoice, the minutes of every meeting etc. 00:09:20.880 |
The paper goes on to describe a universal Turing machine which to the best of my understanding 00:09:26.760 |
It is a machine that can mimic any computation. 00:09:31.300 |
Indeed the authors state in the conclusion of this paper that the results show that large 00:09:36.340 |
language models are already computationally universal as they exist currently provided 00:09:41.680 |
only that they have access to an unbounded external memory. 00:09:45.240 |
What I found fascinating was that Anthropic are so concerned by this accelerating progress 00:09:50.360 |
that they don't publish capabilities research because we do not wish to advance the rate 00:09:56.740 |
And I must say that Anthropic do know a thing or two about language models having delayed 00:10:02.400 |
the public deployment of Clawed which you can see on screen until it was no longer state 00:10:07.900 |
They had this model earlier but delayed the deployment. 00:10:11.040 |
Clawed by the way is much better than ChatGPT at writing jokes. 00:10:17.040 |
In my video on GPT-5 which I do recommend you check out I talk about how important data 00:10:25.440 |
One graph I left out from it is the one I left out of the last video. 00:10:26.720 |
The data on that video though suggests that there may be some limits to this straight 00:10:31.120 |
line improvement in the performance of models. 00:10:33.600 |
What you're seeing on screen is a paper released in ancient times which is to say 00:10:42.100 |
Essentially it shows performance improvements as more tokens are added to the model. 00:10:48.640 |
But notice how the gains level off after a certain point. 00:10:52.380 |
So not every graph you're going to see today is exponential. 00:10:59.580 |
And some of the questions it still struggles with are interesting. 00:11:03.200 |
Take SIQA which is social interaction question answering. 00:11:13.180 |
Where in most humans could easily understand what's going on and find the right answer. 00:11:18.740 |
Models really struggle with that even when they're given trillions of tokens. 00:11:22.220 |
Or what about natural questions where the model is struggling at about a third of the 00:11:27.680 |
And it's not even worth the effort to find the right answer. 00:11:29.680 |
And it's not even worth the effort to find the right answer. 00:11:30.680 |
So I dug deep into the literature to find exactly who proposed natural questions as 00:11:36.920 |
This is a paper published by Google in 2019 and it gives lots of examples of natural questions. 00:11:44.500 |
Essentially they're human like questions where it's not always clear exactly what we're 00:11:50.420 |
Now you could say that's on us to be clearer with our questions. 00:11:53.560 |
But let's see how Bing does with some of these. 00:11:57.660 |
"The guy who plays Mandalorian also did What Drugs TV show?" 00:12:01.720 |
I deliberately phrased it in a very natural, vague way. 00:12:05.660 |
Interestingly it gets it wrong initially in the first sentence but then gets it right 00:12:20.500 |
It surmised that I meant Tolkien, the author of Lord of the Rings and I wanted the origin 00:12:33.160 |
It knew I meant London and while it didn't give me the first bomb that landed in London 00:12:38.180 |
during World War 2, it gave me a bomb that was named Big Ben. 00:12:43.180 |
Overall I found it was about 50/50 just like the Meta Llama model. 00:12:48.540 |
Going back to the graph we can see that data does help a lot but it isn't everything. 00:13:03.800 |
We know that the capability jump from GPT-2 to GPT-3 resulted mostly from about a 250 00:13:14.700 |
We would guess that another 50 times increase separates the original GPT-3 model and state 00:13:26.600 |
Over the next 5 years we might expect around a 1000 time increase in the computation used 00:13:34.600 |
to train the largest models based on trends in compute cost and spending. 00:13:39.280 |
If the scaling laws hold this would result in a capability jump that is significantly 00:13:45.480 |
larger than the jump from GPT-2 to GPT-3 or GPT-3 to Claude. 00:13:52.600 |
"At Anthropic we're deeply familiar with the capabilities of these systems. 00:13:56.580 |
And a jump that is this much larger feels to many of us like it could result in human 00:14:10.040 |
This made me think of Sam Altman's AGI statement where he said: 00:14:14.200 |
"At some point it may be important to get independent review before starting to train 00:14:19.260 |
future systems and for the most advanced efforts to agree to limit the rate of growth of compute 00:14:27.560 |
This is a very important step for creating new models. 00:14:31.480 |
Even Sam Altman thinks we might need to slow down a bit. 00:14:34.760 |
My question is though, would Microsoft or Tesla or Amazon agree with this truce and 00:14:43.380 |
But remember that 5 year timeline that Anthropic laid out? 00:14:46.700 |
That chimes with this assessment from the Conjecture Alignment Startup: 00:14:53.100 |
Significant probability of it happening in less than 5 years." 00:14:56.540 |
And it gives plenty of examples, many of which I have already covered. 00:15:00.760 |
Others of course give much more distant timelines and as we've seen AGI is not a well defined 00:15:07.100 |
In fact it's so not well defined that some people actually argue that it's already 00:15:12.600 |
This article for example says "2022 was the year AGI arrived." 00:15:20.900 |
Is quite funny but it points to how short a gap there might be between being better than 00:15:26.520 |
the average human and being better than Einstein. 00:15:29.900 |
I don't necessarily agree with this but it does remind me of another graph I saw recently. 00:15:35.220 |
It was this one on the number of academic papers being published on machine learning 00:15:40.100 |
and AI in a paper about exponential knowledge growth. 00:15:44.080 |
The link to this paper like all the others is in the description. 00:15:47.440 |
And it does point to how hard it will be for me and others just to keep up with the latest 00:15:56.500 |
At this point you may have noticed that I haven't given a definitive answer to my 00:16:00.980 |
original question which was to find the task that is left before AGI. 00:16:06.180 |
I do think there will be tasks such as physically plumbing a house that even an AGI, a generally 00:16:12.200 |
intelligent entity, couldn't immediately accomplish simply because it doesn't have 00:16:17.640 |
It might be smarter than a human but can't use a hammer. 00:16:20.980 |
But my other theory to end on is that before AGI there will be a deeper, more complex 00:16:28.840 |
Take the benchmarks on reading comprehension. 00:16:31.840 |
This graph shows how improvement is being made. 00:16:34.620 |
But I have aced most reading comprehension tests such as the GRE so why is the highest 00:16:44.340 |
Could it be that progress stalls when we get to the outer edge of ability? 00:16:50.780 |
When test examples of sufficient quality get so rare in the dataset that language models 00:17:00.860 |
I won't read it out because by definition it's quite long and convoluted. 00:17:11.100 |
Where only obscure feats of logic, deeply subjective analyses of difficult texts and 00:17:17.360 |
niche areas of mathematics and science remain out of reach? 00:17:21.020 |
Where essentially most people perceive AGI to have already occurred but for a few outlying 00:17:27.440 |
Indeed, is the ultimate capture test the ability to deliver a laugh out loud joke or 00:17:33.940 |
deeply understand the plight of Oliver Twist? 00:17:37.380 |
Anyway thank you for watching to the end of the video. 00:17:40.680 |
I'm going to leave you with some bleeding edge text to image generations from Mid Journey 00:17:46.420 |
Whatever happens next with large language models, this is the news story of the century 00:17:51.260 |
in my opinion and I do look forward to covering it. 00:17:56.420 |
I'm going to leave you with some bleeding edge text to image generations from Mid Journey 00:17:59.420 |
Whatever happens next with large language models, this is the news story of the century 00:18:00.420 |
in my opinion and I do look forward to covering it. 00:18:01.420 |
Whatever happens next with large language models, this is the news story of the century in my 00:18:02.420 |
opinion and I do look forward to covering it. 00:18:03.420 |
Whatever happens next with large language models, this is the news story of the century in my 00:18:04.420 |
opinion and I do look forward to covering it. 00:18:05.420 |
Whatever happens next with large language models, this is the news story of the century in my 00:18:06.420 |
opinion and I do look forward to covering it. 00:18:07.420 |
Whatever happens next with large language models, this is the news story of the century in my 00:18:08.420 |
opinion and I do look forward to covering it. 00:18:09.420 |
Whatever happens next with large language models, this is the news story of the century in my 00:18:10.420 |
opinion and I do look forward to covering it. 00:18:11.420 |
Whatever happens next with large language models, this is the news story of the century 00:18:12.420 |
in my opinion and I do look forward to covering it. 00:18:13.420 |
Whatever happens next with large language models, this is the news story of the century in my 00:18:14.420 |
opinion and I do look forward to covering it. 00:18:15.420 |
Whatever happens next with large language models, this is the news story of the century in my 00:18:16.420 |
opinion and I do look forward to covering it. 00:18:17.420 |
Whatever happens next with large language models, this is the news story of the century in my 00:18:18.420 |
opinion and I do look forward to covering it. 00:18:19.420 |
Whatever happens next with large language models, this is the news story of the century in my 00:18:20.420 |
opinion and I do look forward to covering it.