back to index'Sparks of AGI' - Bombshell GPT-4 Paper: Fully Read w/ 15 Revelations
Chapters
0:0 Intro
1:3 Emergent Capability
2:1 Text to Image
2:31 Human Level
3:19 Game Development
3:55 Olympiad
4:39 Fermi Questions
4:58 AI Personal Assistant
5:35 Handyman
5:53 Mental Map
6:26 Theory of Mind
7:4 Autoregressive Model
10:25 Unrestricted
10:47 Content Choices
11:17 Agency Intrinsic Motivation
12:9 Conclusion
00:00:00.000 |
Less than 24 hours ago a report was released that will echo around the world. It is 154 pages and I 00:00:08.020 |
just finished reading and digesting all of them. Yes that includes the appendices and no I didn't 00:00:14.220 |
use GPT-4. It revealed in a nutshell that GPT-4 shows sparks of artificial general intelligence, 00:00:21.000 |
the holy grail of AI research. And yes I was skeptical, then I read the paper. I'm going to 00:00:27.160 |
break down only the most important revelations one by one. But first I want to address the thought 00:00:32.180 |
that you must be having. How could these guys have discovered so much when the model has only 00:00:36.320 |
been out a week? Well first as they lay out in the introduction they have interacted with GPT-4 00:00:41.140 |
during its early development. These researchers from Microsoft have had the model for months 00:00:46.080 |
as early as October of last year or even earlier. They had the raw model, the unrestricted version, 00:00:52.440 |
not the final version of GPT-4 that had been fine-tuned to improve safety and 00:00:56.980 |
reduce the risk of infection. So they had to find a way to get the model out of the way. 00:00:57.140 |
So they had around six months to experiment with the unrestrained GPT-4. That's enough build-up, 00:01:03.640 |
it's time to get to the revelations. And all of them I'm going to do in order aside from this one 00:01:08.820 |
because honestly it blew my mind. On page 45 they say GPT-4 is able to use tools with very minimal 00:01:16.500 |
instruction and no demonstrations and they make use of them appropriately. They go on to say that 00:01:21.960 |
this is an emergent capability and ChatGPT could not do this. Before I get into the 00:01:27.060 |
details I must remind myself that one of the key moments in human evolution was when we discovered 00:01:32.180 |
how to use tools. So the fact that GPT-4 can use them so well and ChatGPT couldn't is truly a 00:01:38.260 |
milestone in AI and human history. I'm going to show you more examples throughout the video but 00:01:43.140 |
let's start with their examples. It knows when it needs to use a calculator and can use it 00:01:47.800 |
effectively. In my path to AGI video I talk about how it struggles with characters and it knows how 00:01:53.820 |
to call a character API and work out the number of characters. So let's get started. 00:01:56.880 |
So the first thing I want to talk about is the ability to use tools. So if you're using a 00:02:00.720 |
calculator you can use tools to call a character API and work out the number of characters. Now 00:02:04.560 |
might not seem impressive but that was one of its key weaknesses before. If that didn't impress you 00:02:09.120 |
how about text to image? GPT-4 can output detailed images based on a text prompt. These can then 00:02:16.080 |
easily be rendered into more detailed drawings using a model like stable diffusion version 2.1. 00:02:20.720 |
Notice how the model knew how to arrange the objects based on the text prompt. At this point I can't 00:02:26.700 |
really explain how it can be used effectively. So let's start with the first example. 00:02:30.540 |
So if you're using a calculator and you can use it effectively you can use it to call a character API 00:02:34.380 |
and work out the number of characters. So let's start with the first example. 00:02:37.260 |
So if you're using a calculator and you can use it to call a character API and work out the 00:02:41.220 |
number of characters. Notice how the model knew how to arrange the objects based on the text prompt. 00:02:45.900 |
These can then easily be rendered into more detailed drawings using a model like stable diffusion version 2.1. 00:02:50.760 |
Notice how the model knew how to arrange the objects based on the text prompt. At this point I can't 00:02:56.520 |
14.3. The k equals 5 bit by the way is that they picked the best of its five attempts. Deep in the 00:03:02.420 |
appendices you see this. This is the human level. Easy, medium and hard. And by the way they were a 00:03:08.360 |
little bit generous with humans because they didn't include those guys who got none of the tasks right. 00:03:13.160 |
They took those guys out of the database and compared GPT-4 only to those humans that got at 00:03:17.840 |
least one task right. And you thought it was just standard coding? How about 3D game development? 00:03:22.580 |
When given a task to create a 3D game of some complexity I must say, the report says that GPT-4 00:03:28.620 |
produces a working game in a zero shot fashion. ChatGPT by contrast responds that it can't do it. 00:03:35.140 |
When I say a complex game the enemy is trying to rush towards you and you have a defender that's 00:03:39.640 |
trying to block the enemy. It's not a simple game. As you can see from this video they are not the 00:03:44.540 |
only ones who have used GPT-4 to create a detailed game. And trust me I would talk about this amazing 00:03:52.560 |
But I need to get on to the next topic. And that is that they tested GPT-4 on the 2022 00:03:58.820 |
International Mathematics Olympiad. That was not in its database and trust me I've studied for this 00:04:04.680 |
kind of thing and it is not easy. It's an extremely high level of math. And as the authors say solving 00:04:10.880 |
this problem requires a more creative approach as there is no clear strategy for beginning the proof. 00:04:17.080 |
As you might expect GPT-4 manages to produce a correct proof. As I have demonstrated 00:04:22.480 |
in other videos it does get some math problems wrong. As the paper points out that's often due 00:04:27.600 |
to technical proficiency making basic calculation errors. But remember the paper proved that it could 00:04:33.640 |
use a calculator if given access to one. Give GPT-4 tools and honestly it is going to shock the world. 00:04:39.060 |
Next and this is a quick one but I loved it. Give it Fermi questions. These are the kind of 00:04:43.940 |
questions asked in really difficult interviews and they have no easy answer. Things like how 00:04:48.580 |
many golf balls could you fit in a swimming pool? Or please estimate roughly how many 00:04:52.400 |
Fermi questions are being asked every day. Truly complex questions and GPT-4 can hazard great guesses. 00:04:58.480 |
Next and this one was worth waiting for. Finally we get a personal assistant that actually works. 00:05:04.000 |
I know it's called Google Assistant but it isn't really an assistant is it? GPT-4 can use available 00:05:08.240 |
APIs to retrieve information about a user's calendar, coordinate with other people over email, 00:05:14.080 |
book a dinner and message the user with the details. This is a sample of the interactions 00:05:18.800 |
it performed sending an email to Luke and then receiving Luke's 00:05:22.320 |
reply, checking the calendar then putting the event in the calendar then sending an email to 00:05:27.120 |
Joe etc etc. When this becomes available in an app format we will finally have 00:05:32.240 |
that AI personal assistant that we have been waiting for. Moving on did you know that GPT-4 00:05:37.280 |
can be your personal handyman? One of the authors of the paper had a leak in their bathroom. They 00:05:42.800 |
went through a diagnostic process with GPT-4 and it figured out what the problem was. When the 00:05:48.160 |
author followed GPT-4's advice what happened? The leak was gone. The problem was that the leak was 00:05:52.240 |
gone. The problem was solved. If you thought that was impressive wait till you see this. 00:05:56.880 |
If it's allowed to ask enough questions as you can see above GPT-4 can build up a mental map 00:06:02.480 |
of say a house that is entering. On the left you can see a map of the true locations of each room 00:06:08.160 |
and on the right you can see GPT-4's mental image of them. That was revealed by the way 00:06:13.040 |
by drawing a pie plot. This ability of course is going to become very relevant when GPT-4 00:06:18.720 |
gets embodied and I'm going to talk about that in my next video. Speaking of which if 00:06:22.160 |
you're learning anything from this video please don't forget to leave a like and let me know in 00:06:25.760 |
the comments. Next up is theory of mind and I have done a whole video on this so do check it out 00:06:30.960 |
afterwards but essentially the authors discovered the same thing that we have which is to say that 00:06:36.720 |
GPT-4 can build up a mental model of what other people are thinking. You can pause the video and 00:06:42.720 |
read the scenario yourself. It essentially involves knowing what Alice must be thinking, 00:06:48.080 |
what she must believe about a situation even though the reality is different. 00:06:52.080 |
Separating what is actually true with what a human being believes to be true. This is a key 00:06:58.000 |
milestone on the road to possible consciousness but if you're interested in that topic honestly 00:07:02.960 |
check out my video on it. Now I know at this point you're thinking I must have covered the 00:07:06.640 |
best bits but no there's more. On page 80 the authors sketch out how GPT-4 is an auto-regressive 00:07:12.800 |
model which means that it bases its outputs on what has already come before. That's great but 00:07:18.240 |
it stops it from planning ahead. It doesn't know how its output is going to 00:07:22.000 |
end before it starts and I'm going to reveal the implications of this fascinating weakness in a 00:07:27.520 |
couple of ways. First with their examples and then with one of my own making. In this task they try to 00:07:32.880 |
get GPT-4 to create a poem which begins with a sentence and then ends with the same sentence in 00:07:39.360 |
reverse order but it's got to make sense. GPT-4 simply can't do it because it doesn't know how 00:07:45.200 |
its poem is going to end before it starts. Remember it's an auto-regressive model. After repeatedly and 00:07:51.920 |
successfully testing GPT-4's ability to do this the authors broke it down like this: GPT-4 is amazing 00:07:58.720 |
at incremental tasks but not as good at discontinuous tasks. Incremental tasks are those 00:08:04.640 |
where you follow a standard procedure building things up step by step like composing a poem using 00:08:10.400 |
a rhyme scheme or writing a summary of a text. Start at the beginning and then next sentence etc. 00:08:15.520 |
But discontinuous tasks require you to know a bit about the output the end result before you start. They give 00:08:21.840 |
a great example of writing a joke. You kind of need to know the punch line before you do the setup. 00:08:27.680 |
Maybe that's why GPT-4 is so bad at joke telling. It can't think of an amazing punch line and then 00:08:33.040 |
work backwards to create the scenario around it. I came up with a simple demonstration of this to show 00:08:38.480 |
you guys. Try asking GPT-4 this question: How many words are in the full response to this prompt? If you think 00:08:45.600 |
about it it has to know the final result of its output to give a correct answer. Because it's just generating 00:08:51.760 |
an answer word by word, token by token. It can't do this. It said that there are 43 words in the 00:08:57.440 |
full response to this prompt including the words in the question and the answer. Okay that's kind 00:09:02.160 |
of weird. I didn't want to include the question itself but let's see if it got it right. I said 00:09:06.240 |
list them out and count them and then it went through including the prompt which I didn't want 00:09:10.560 |
but fine. How many words are in the full response etc etc. And lo and behold there were only 31 words 00:09:18.160 |
in the prompt and the output. But remember it had said 00:09:21.680 |
that there were 43 words. It doesn't know the end result when it starts. Before you conclude that 00:09:27.600 |
this will be a permanent block on language models like GPT-4 progressing further ponder this. A paper 00:09:33.680 |
came out in January showing that it was at least theoretically possible to augment large language 00:09:39.280 |
models with external memory and the paper both asks and answers this question: Such works raise the 00:09:45.280 |
question of whether augmenting a language model with an external feedback loop is merely useful or 00:09:51.600 |
The paper expands the range of computations that can be performed. This paper gives an affirmative answer. 00:09:58.320 |
Now obviously it's still a huge leap from here to there. Imagine if GPT-4 gets access to an external 00:10:05.040 |
memory or say GPT-5. Then as the authors know you could have different layers of language models. 00:10:10.640 |
One doing the fast thinking subroutines and another doing the slow thinking big picture. 00:10:16.240 |
Monitoring the output of the language model and adjusting from there. Arguably that would be the 00:10:21.520 |
ultimate breakthrough. Possibly even a dangerous breakthrough. Speaking of dangerous on page 84 the 00:10:27.920 |
authors note that the unrestricted GPT-4 is incredible at propaganda and conspiracy theories. 00:10:34.640 |
It can design entire misinformation campaigns replete with links and images and I worry that 00:10:40.480 |
it's only a matter of time before someone jailbreaks this kind of version of GPT-4 and uses 00:10:45.840 |
it in the wild. Next and I think this is quite a stunning admission from researchers at Microsoft. They 00:10:51.440 |
say that some people may ask for the ability and right to decide and specify which content they 00:10:58.080 |
want or do not want to be crawled. They're flagging this up in terms of privacy and potential lawsuits. 00:11:04.720 |
The context they're giving is of models like GPT-4 taking away jobs and if they're taking away jobs 00:11:10.960 |
from people whose content has been crawled I wouldn't be surprised if there's some contention 00:11:16.240 |
there. Two final points from this bombshell paper. The authors talk about equipping LLM 00:11:21.360 |
large language models with agency and intrinsic motivation and say that this is a fascinating and 00:11:27.920 |
important direction for future work. This is in the context of GPT-4 not being motivated by anything, 00:11:34.960 |
just being passive. Well I do think that that's a fascinating direction for future work but it's 00:11:41.120 |
also a very concerning one. Giving a language model intrinsic motivation not only has 00:11:47.520 |
ethical concerns and questions like when would it have rights then, 00:11:51.280 |
but it also raises huge safety concerns. Of course they do admit with this direction of work 00:11:56.480 |
great care would have to be taken on alignment and safety. I'm not personally too keen on this 00:12:01.600 |
phrasing of giving it motivation is a fascinating and important direction as if it's definitely 00:12:07.440 |
something we should be working on. This is especially true in the context of the final 00:12:12.160 |
part of the paper. They admit that they don't really know what is actually happening. They 00:12:16.560 |
know what GPT-4 is capable of but not really why it's capable of those things. 00:12:21.200 |
Of course they propose hypotheses but they end with this: 00:12:24.880 |
Overall, elucidating the nature and mechanisms of AI systems such as GPT-4 is a formidable challenge 00:12:33.200 |
that has suddenly become important and urgent. 00:12:37.680 |
Translated, we need to figure out how these things work and fast. 00:12:41.360 |
Well I definitely agree with that. Thank you so much for watching to the end. 00:12:45.200 |
Let me know your thoughts in the comments and have a wonderful day.