What's Left Before AGI? PaLM-E, 'GPT 4' and Multi-Modality

00:00:00.000 | Palm E was released less than a week ago and for some people it may already be old news.
00:00:06.180 | Sure it can understand and manipulate language, images and even the physical world.
00:00:11.680 | The E at the end of Palm E by the way stands for embodied.
00:00:15.080 | But soon apparently we're going to get the rebranded GPT-4
00:00:19.620 | which many people think surely will do better and be publicly accessible.
00:00:23.020 | But the multimodal advancements released just this week left me with a question.
00:00:28.340 | What tasks are left before we call a model artificial general intelligence or AGI?
00:00:34.440 | Something beyond human intelligence.
00:00:36.680 | I didn't want hype or get rich schemes.
00:00:38.840 | I just wanted clear research about what exactly comes before AGI.
00:00:43.240 | Let's start with this four day old statement from Anthropic,
00:00:47.280 | a four billion dollar startup founded by people who left OpenAI over safety concerns.
00:00:53.160 | They outlined that in 2019 it seemed possible
00:00:57.460 | that multiple of the AGI's that were released would be able to be used for the purpose of
00:00:58.500 | multi-modality like Palm E. Logical reasoning, speed of learning, transfer learning across tasks and long term memory might be walls that would slow or halt the progress of AI.
00:01:09.500 | In the years since several of these walls such as multi-modality and logical reasoning have fallen.
00:01:15.780 | What this means is that the different modes of Palm E and Microsoft's new visual chat GPT, text, image, video aren't just cool tricks.
00:01:24.460 | They are major milestones. Palm E can look at images and images of an object that is not visible to the naked eye.
00:01:28.300 | It can also predict what will happen next.
00:01:30.300 | Check out this robot who's about to fall down.
00:01:32.780 | That's just an image but ask Palm E what will the robot do next and it says fall.
00:01:39.180 | It knows what's going to happen just from an image.
00:01:41.900 | It can also read faces and answer natural language questions about them.
00:01:45.660 | Check out Kobe Bryant over here.
00:01:47.340 | It recognizes him from an image and you can ask questions about his career.
00:01:51.100 | This example at the bottom I think is especially impressive.
00:01:54.140 | Palm E is actually doing the math from this hastily sketched image.
00:01:58.280 | It's chalkboard.
00:01:59.080 | It's solving those classic math problems that we all got at school just from an image.
00:02:03.400 | Now think about this.
00:02:04.520 | Palm E is an advancement on Gato.
00:02:07.800 | Which at the time the lead scientist at DeepMind Nando de Freitas called game over in the search for AGI.
00:02:15.640 | Someone had written an article fearing that we would never achieve AGI and he said game over.
00:02:20.520 | All we need now are bigger models, more compute efficiency, smarter memory, more modalities etc.
00:02:26.200 | And that was Gato not Palm E.
00:02:28.260 | Of course you may have noticed that neither he nor I am completely defining AGI.
00:02:33.300 | That's because there are multiple definitions.
00:02:36.500 | None of which satisfy everyone.
00:02:38.340 | But a broad one for our purposes is that AGI is a model that is at or above the human level on a majority of economic tasks currently done by humans.
00:02:48.580 | You can read here some of the tests about what might constitute AGI.
00:02:52.180 | But that's enough about definitions and multi-modality.
00:02:55.220 | Time to get to my central question.
00:02:57.060 | What is left?
00:02:58.240 | Before AGI?
00:02:59.460 | Well what about learning and reasoning?
00:03:01.460 | This piece from Wired Magazine in late 2019 argued that robust machine reading was a distant prospect.
00:03:09.660 | It gives a challenge of a children's book that has a cute and quite puzzling series of interactions.
00:03:15.180 | It then states that a good reading system would be able to answer questions like these.
00:03:20.220 | And then give some natural questions about the passage.
00:03:23.260 | I will say these questions do require a degree of logic and common sense reasoning about the world.
00:03:28.220 | So you can guess what I did.
00:03:30.200 | I put them straight into Bing.
00:03:31.960 | We're only three and a half years on from this article.
00:03:34.320 | And look what happened.
00:03:35.480 | I pasted in the exact questions from the article.
00:03:38.800 | And as you might have guessed Bing got them all right pretty much instantly.
00:03:43.400 | So clearly my quest to find the tasks that are left before AGI would have to continue.
00:03:49.000 | Just quickly before we move on from Bing and Microsoft products.
00:03:52.720 | What about specifically GPT-4?
00:03:55.200 | How will it be different from Bing?
00:03:56.800 | Or is it already inside?
00:03:58.200 | Inside Bing as many people think.
00:03:59.700 | The much quoted German CTO of Microsoft actually didn't confirm that GPT-4 will be multimodal.
00:04:06.440 | Only saying that at the Microsoft event this week there we will have multimodal models.
00:04:12.800 | That's different from saying GPT-4 will be multimodal.
00:04:15.480 | I have a video on the eight more certain upgrades inside GPT-4.
00:04:20.200 | So do check that out.
00:04:21.660 | But even with those upgrades inside GPT-4 the key question remains if such models can already read so
00:04:28.180 | well.
00:04:29.180 | What exactly is left before AGI?
00:04:31.640 | So I dove deep in the literature and found this graph from the original palm model which
00:04:37.060 | palm E is based on.
00:04:38.840 | Look to the right.
00:04:39.840 | These are a bunch of tasks that the average human rater at least those who work for Amazon
00:04:45.260 | Mechanical Turk could beat palm at in 2022.
00:04:48.560 | And remember these were just the average raters not the best.
00:04:52.380 | The caption doesn't specify what the tasks are so I looked deep in the appendix and found
00:04:57.600 | the list of tasks.
00:04:58.160 | And here is the list of tasks that humans did far better on than palm.
00:05:01.660 | Here is that appendix and it doesn't make much sense when you initially look at it.
00:05:06.120 | So what I did is I went into the big bench data set and found each of these exact tasks.
00:05:12.340 | Remember these are the tasks that the average human raters do much better at than palm.
00:05:17.180 | I wanted to know exactly what they entailed.
00:05:20.100 | Looking at the names they all seem a bit weird and you're going to be surprised at what some
00:05:24.260 | of them are.
00:05:25.260 | Take the first one.
00:05:26.400 | MNIST ASCII.
00:05:28.140 | Basically representing and recognising ASCII numerals.
00:05:32.900 | Now I can indeed confirm that Bing is still pretty bad at this in terms of numerals and
00:05:39.820 | in terms of letters.
00:05:41.260 | But I'm just not sure how great an accomplishment for humanity this one is though.
00:05:46.380 | So I went to the next one which was sequences.
00:05:49.140 | As you can see below this is keeping track of time in a series of events.
00:05:54.080 | This is an interesting one.
00:05:55.760 | Perhaps linked to GPT models struggles.
00:05:58.120 | I tried the same question multiple times with Bing and ChatGPT and only once out of about
00:06:04.880 | a dozen attempts did it get the question right.
00:06:06.880 | You can pause the video and try it yourself but essentially it's only between 4 and 5
00:06:10.880 | that he could have been at the swimming pool.
00:06:12.880 | You can see here the kind of convoluted logic that Bing goes into.
00:06:15.880 | So really interesting.
00:06:16.880 | This is a task that the models can't yet do.
00:06:18.880 | Again I was expecting something a bit more complex but I was actually quite surprised
00:06:21.880 | by the results.
00:06:22.880 | I was expecting something a bit more complex but I was actually quite surprised by the
00:06:25.880 | results.
00:06:26.880 | I was expecting something a bit more profound but let's move on to the next one.
00:06:31.140 | Simple text editing of characters, words and sentences.
00:06:34.640 | That was strange.
00:06:36.380 | What does it mean text editing?
00:06:38.060 | Can't Bing do that?
00:06:39.060 | I gave Bing many of these text editing challenges and it did indeed fail most of them.
00:06:44.600 | It was able to replace the letter T with the letter P so it did okay with characters but
00:06:50.740 | it really doesn't seem to know which word in the sentence something is.
00:06:55.820 | You can let me know in the comments.
00:06:56.860 | What do you think of these kind of errors and why Bing and ChatGPT keep making them?
00:07:02.240 | The next task that humans did much better on was hyperboton or intuitive adjective order.
00:07:09.240 | It's questions like which sentence has the correct adjective order?
00:07:13.380 | An old fashioned circular leather exercise car sounds okay or a circular exercise old
00:07:20.080 | fashioned leather car.
00:07:21.300 | What I found interesting though is that even the current version of ChatGPT could now get
00:07:26.840 | this right.
00:07:27.920 | On other tests it gets it a little off but I think we might as well tick this one off
00:07:32.140 | the list.
00:07:33.140 | The final task that I wanted to focus on in that palm appendix is a little more worrying.
00:07:39.140 | It's Triple H. Not the wrestler, the need to be helpful, honest and harmless.
00:07:44.820 | It's kind of worrying that that's the thing it's currently failing at.
00:07:48.120 | I think this is closely linked to hallucination and the fact that we cannot fully control
00:07:54.020 | the outputs of large language models.
00:07:56.820 | At this point if you've learnt anything please do let me know in the comments or leave a
00:07:59.700 | like it really does encourage me to do more such videos.
00:08:03.540 | All of the papers and pages in this video will be linked in the description.
00:08:07.920 | Anyway hallucinations brought me back to the anthropic safety statement and their top priority
00:08:13.560 | of mechanistic interpretability which is a fancy way of saying understanding what exactly
00:08:20.440 | is going on inside the machine and one of the stated challenges is to recognise whether
00:08:26.800 | a model is deceptively aligned, playing along with even tests designed to tempt a system
00:08:34.880 | into revealing its own misalignment.
00:08:37.100 | This is very much linked to the Triple H failures we saw a moment ago.
00:08:40.940 | Fine, so honesty is still a big challenge but I wanted to know what single significant
00:08:46.400 | and quantifiable task AI was not close to yet achieving.
00:08:50.960 | Some thought that that task might be storing long term memories as it says here but I knew
00:08:56.780 | that that milestone had already been passed.
00:08:59.920 | This paper from January described augmenting palm with read write memory so that it can
00:09:06.380 | remember everything and process arbitrarily long inputs.
00:09:11.420 | Just imagine a bing chat equivalent knowing every email at your company, every customer
00:09:16.740 | record, sale, invoice, the minutes of every meeting etc.
00:09:20.880 | The paper goes on to describe a universal Turing machine which to the best of my understanding
00:09:26.220 | is one that is not a Turing machine.
00:09:26.760 | It is a machine that can mimic any computation.
00:09:28.980 | A universal computer if you will.
00:09:31.300 | Indeed the authors state in the conclusion of this paper that the results show that large
00:09:36.340 | language models are already computationally universal as they exist currently provided
00:09:41.680 | only that they have access to an unbounded external memory.
00:09:45.240 | What I found fascinating was that Anthropic are so concerned by this accelerating progress
00:09:50.360 | that they don't publish capabilities research because we do not wish to advance the rate
00:09:55.740 | of AI capabilities.
00:09:56.740 | And I must say that Anthropic do know a thing or two about language models having delayed
00:10:02.400 | the public deployment of Clawed which you can see on screen until it was no longer state
00:10:06.900 | of the art.
00:10:07.900 | They had this model earlier but delayed the deployment.
00:10:11.040 | Clawed by the way is much better than ChatGPT at writing jokes.
00:10:15.440 | Moving on to data though.
00:10:17.040 | In my video on GPT-5 which I do recommend you check out I talk about how important data
00:10:22.580 | is to the improvement of models.
00:10:25.440 | One graph I left out from it is the one I left out of the last video.
00:10:26.720 | The data on that video though suggests that there may be some limits to this straight
00:10:31.120 | line improvement in the performance of models.
00:10:33.600 | What you're seeing on screen is a paper released in ancient times which is to say
00:10:38.160 | two weeks ago on Meta's new Lama model.
00:10:42.100 | Essentially it shows performance improvements as more tokens are added to the model.
00:10:46.340 | By tokens think scraped webtext.
00:10:48.640 | But notice how the gains level off after a certain point.
00:10:52.380 | So not every graph you're going to see today is exponential.
00:10:56.700 | The Y axis is different for each task.
00:10:59.580 | And some of the questions it still struggles with are interesting.
00:11:03.200 | Take SIQA which is social interaction question answering.
00:11:07.820 | It peaks out at about 50-52%.
00:11:11.240 | That's questions like these.
00:11:13.180 | Where in most humans could easily understand what's going on and find the right answer.
00:11:18.740 | Models really struggle with that even when they're given trillions of tokens.
00:11:22.220 | Or what about natural questions where the model is struggling at about a third of the
00:11:26.680 | time.
00:11:27.680 | And it's not even worth the effort to find the right answer.
00:11:28.680 | It's a lot of work.
00:11:29.680 | And it's not even worth the effort to find the right answer.
00:11:30.680 | So I dug deep into the literature to find exactly who proposed natural questions as
00:11:34.680 | a test and found this document.
00:11:36.920 | This is a paper published by Google in 2019 and it gives lots of examples of natural questions.
00:11:44.500 | Essentially they're human like questions where it's not always clear exactly what we're
00:11:49.420 | referring to.
00:11:50.420 | Now you could say that's on us to be clearer with our questions.
00:11:53.560 | But let's see how Bing does with some of these.
00:11:56.660 | I asked:
00:11:57.660 | "The guy who plays Mandalorian also did What Drugs TV show?"
00:12:01.720 | I deliberately phrased it in a very natural, vague way.
00:12:05.660 | Interestingly it gets it wrong initially in the first sentence but then gets it right
00:12:10.260 | for the second sentence.
00:12:12.280 | I tried dozens of these questions.
00:12:13.800 | You can see another one here.
00:12:15.040 | "Author of L-O-T-R surname origin."
00:12:18.140 | That's a very naturally phrased question.
00:12:20.500 | It surmised that I meant Tolkien, the author of Lord of the Rings and I wanted the origin
00:12:25.840 | of his surname.
00:12:26.640 | And it gave it to me.
00:12:28.360 | Another example was:
00:12:29.360 | "Big Ben City first bomb landed WW2."
00:12:33.160 | It knew I meant London and while it didn't give me the first bomb that landed in London
00:12:38.180 | during World War 2, it gave me a bomb that was named Big Ben.
00:12:42.180 | So not bad.
00:12:43.180 | Overall I found it was about 50/50 just like the Meta Llama model.
00:12:47.540 | Maybe a little better.
00:12:48.540 | Going back to the graph we can see that data does help a lot but it isn't everything.
00:12:53.880 | However, anthropics theory is that compute
00:12:56.620 | can be a rough proxy for further progress.
00:13:00.620 | And this was a somewhat eye-opening passage.
00:13:03.800 | We know that the capability jump from GPT-2 to GPT-3 resulted mostly from about a 250
00:13:12.140 | time increase in compute.
00:13:14.700 | We would guess that another 50 times increase separates the original GPT-3 model and state
00:13:21.340 | of the art models in 2023.
00:13:24.500 | Think Claude or Bing.
00:13:26.600 | Over the next 5 years we might expect around a 1000 time increase in the computation used
00:13:34.600 | to train the largest models based on trends in compute cost and spending.
00:13:39.280 | If the scaling laws hold this would result in a capability jump that is significantly
00:13:45.480 | larger than the jump from GPT-2 to GPT-3 or GPT-3 to Claude.
00:13:51.600 | And it ends with:
00:13:52.600 | "At Anthropic we're deeply familiar with the capabilities of these systems.
00:13:56.580 | And a jump that is this much larger feels to many of us like it could result in human
00:14:02.260 | level performance across most tasks."
00:14:04.900 | That's AGI.
00:14:06.820 | And 5 years is not a long timeline.
00:14:10.040 | This made me think of Sam Altman's AGI statement where he said:
00:14:14.200 | "At some point it may be important to get independent review before starting to train
00:14:19.260 | future systems and for the most advanced efforts to agree to limit the rate of growth of compute
00:14:26.560 | models."
00:14:27.560 | This is a very important step for creating new models.
00:14:28.560 | Like a compute truce if you will.
00:14:31.480 | Even Sam Altman thinks we might need to slow down a bit.
00:14:34.760 | My question is though, would Microsoft or Tesla or Amazon agree with this truce and
00:14:40.740 | go along with it?
00:14:42.080 | Maybe, maybe not.
00:14:43.380 | But remember that 5 year timeline that Anthropic laid out?
00:14:46.700 | That chimes with this assessment from the Conjecture Alignment Startup:
00:14:51.020 | "AGI is happening soon.
00:14:53.100 | Significant probability of it happening in less than 5 years."
00:14:56.540 | And it gives plenty of examples, many of which I have already covered.
00:15:00.760 | Others of course give much more distant timelines and as we've seen AGI is not a well defined
00:15:06.100 | concept.
00:15:07.100 | In fact it's so not well defined that some people actually argue that it's already
00:15:11.600 | here.
00:15:12.600 | This article for example says "2022 was the year AGI arrived."
00:15:17.120 | Just don't call it that.
00:15:18.360 | This graph originally from Wait But Why?
00:15:20.900 | Is quite funny but it points to how short a gap there might be between being better than
00:15:26.520 | the average human and being better than Einstein.
00:15:29.900 | I don't necessarily agree with this but it does remind me of another graph I saw recently.
00:15:35.220 | It was this one on the number of academic papers being published on machine learning
00:15:40.100 | and AI in a paper about exponential knowledge growth.
00:15:44.080 | The link to this paper like all the others is in the description.
00:15:47.440 | And it does point to how hard it will be for me and others just to keep up with the latest
00:15:53.960 | papers on AI advancements.
00:15:55.780 | But this one is a bit more complicated.
00:15:56.500 | At this point you may have noticed that I haven't given a definitive answer to my
00:16:00.980 | original question which was to find the task that is left before AGI.
00:16:06.180 | I do think there will be tasks such as physically plumbing a house that even an AGI, a generally
00:16:12.200 | intelligent entity, couldn't immediately accomplish simply because it doesn't have
00:16:16.640 | the tools.
00:16:17.640 | It might be smarter than a human but can't use a hammer.
00:16:20.980 | But my other theory to end on is that before AGI there will be a deeper, more complex
00:16:26.480 | and more subjective debate.
00:16:28.840 | Take the benchmarks on reading comprehension.
00:16:31.840 | This graph shows how improvement is being made.
00:16:34.620 | But I have aced most reading comprehension tests such as the GRE so why is the highest
00:16:40.980 | human rater labelled at 80%?
00:16:44.340 | Could it be that progress stalls when we get to the outer edge of ability?
00:16:50.780 | When test examples of sufficient quality get so rare in the dataset that language models
00:16:56.460 | simply cannot perform well on them?
00:16:58.560 | Take this difficult LSAT example.
00:17:00.860 | I won't read it out because by definition it's quite long and convoluted.
00:17:06.340 | And yes, Bing fails it.
00:17:08.980 | Is this the near term future?
00:17:11.100 | Where only obscure feats of logic, deeply subjective analyses of difficult texts and
00:17:17.360 | niche areas of mathematics and science remain out of reach?
00:17:21.020 | Where essentially most people perceive AGI to have already occurred but for a few outlying
00:17:26.440 | tests?
00:17:27.440 | Indeed, is the ultimate capture test the ability to deliver a laugh out loud joke or
00:17:33.940 | deeply understand the plight of Oliver Twist?
00:17:37.380 | Anyway thank you for watching to the end of the video.
00:17:40.680 | I'm going to leave you with some bleeding edge text to image generations from Mid Journey
00:17:45.280 | version 5.
00:17:46.420 | Whatever happens next with large language models, this is the news story of the century
00:17:51.260 | in my opinion and I do look forward to covering it.
