'Sparks of AGI' - Bombshell GPT-4 Paper: Fully Read w/ 15 Revelations

00:00:00.000 | Less than 24 hours ago a report was released that will echo around the world. It is 154 pages and I

00:00:08.020 | just finished reading and digesting all of them. Yes that includes the appendices and no I didn't

00:00:14.220 | use GPT-4. It revealed in a nutshell that GPT-4 shows sparks of artificial general intelligence,

00:00:21.000 | the holy grail of AI research. And yes I was skeptical, then I read the paper. I'm going to

00:00:27.160 | break down only the most important revelations one by one. But first I want to address the thought

00:00:32.180 | that you must be having. How could these guys have discovered so much when the model has only

00:00:36.320 | been out a week? Well first as they lay out in the introduction they have interacted with GPT-4

00:00:41.140 | during its early development. These researchers from Microsoft have had the model for months

00:00:46.080 | as early as October of last year or even earlier. They had the raw model, the unrestricted version,

00:00:52.440 | not the final version of GPT-4 that had been fine-tuned to improve safety and

00:00:56.980 | reduce the risk of infection. So they had to find a way to get the model out of the way.

00:00:57.140 | So they had around six months to experiment with the unrestrained GPT-4. That's enough build-up,

00:01:03.640 | it's time to get to the revelations. And all of them I'm going to do in order aside from this one

00:01:08.820 | because honestly it blew my mind. On page 45 they say GPT-4 is able to use tools with very minimal

00:01:16.500 | instruction and no demonstrations and they make use of them appropriately. They go on to say that

00:01:21.960 | this is an emergent capability and ChatGPT could not do this. Before I get into the

00:01:27.060 | details I must remind myself that one of the key moments in human evolution was when we discovered

00:01:32.180 | how to use tools. So the fact that GPT-4 can use them so well and ChatGPT couldn't is truly a

00:01:38.260 | milestone in AI and human history. I'm going to show you more examples throughout the video but

00:01:43.140 | let's start with their examples. It knows when it needs to use a calculator and can use it

00:01:47.800 | effectively. In my path to AGI video I talk about how it struggles with characters and it knows how

00:01:53.820 | to call a character API and work out the number of characters. So let's get started.

00:01:56.880 | So the first thing I want to talk about is the ability to use tools. So if you're using a

00:02:00.720 | calculator you can use tools to call a character API and work out the number of characters. Now

00:02:04.560 | might not seem impressive but that was one of its key weaknesses before. If that didn't impress you

00:02:09.120 | how about text to image? GPT-4 can output detailed images based on a text prompt. These can then

00:02:16.080 | easily be rendered into more detailed drawings using a model like stable diffusion version 2.1.

00:02:20.720 | Notice how the model knew how to arrange the objects based on the text prompt. At this point I can't

00:02:26.700 | really explain how it can be used effectively. So let's start with the first example.

00:02:30.540 | So if you're using a calculator and you can use it effectively you can use it to call a character API

00:02:34.380 | and work out the number of characters. So let's start with the first example.

00:02:37.260 | So if you're using a calculator and you can use it to call a character API and work out the

00:02:41.220 | number of characters. Notice how the model knew how to arrange the objects based on the text prompt.

00:02:45.900 | These can then easily be rendered into more detailed drawings using a model like stable diffusion version 2.1.

00:02:50.760 | Notice how the model knew how to arrange the objects based on the text prompt. At this point I can't

00:02:56.520 | 14.3. The k equals 5 bit by the way is that they picked the best of its five attempts. Deep in the

00:03:02.420 | appendices you see this. This is the human level. Easy, medium and hard. And by the way they were a

00:03:08.360 | little bit generous with humans because they didn't include those guys who got none of the tasks right.

00:03:13.160 | They took those guys out of the database and compared GPT-4 only to those humans that got at

00:03:17.840 | least one task right. And you thought it was just standard coding? How about 3D game development?

00:03:22.580 | When given a task to create a 3D game of some complexity I must say, the report says that GPT-4

00:03:28.620 | produces a working game in a zero shot fashion. ChatGPT by contrast responds that it can't do it.

00:03:35.140 | When I say a complex game the enemy is trying to rush towards you and you have a defender that's

00:03:39.640 | trying to block the enemy. It's not a simple game. As you can see from this video they are not the

00:03:44.540 | only ones who have used GPT-4 to create a detailed game. And trust me I would talk about this amazing

00:03:51.200 | achievement for long.

00:03:52.560 | But I need to get on to the next topic. And that is that they tested GPT-4 on the 2022

00:03:58.820 | International Mathematics Olympiad. That was not in its database and trust me I've studied for this

00:04:04.680 | kind of thing and it is not easy. It's an extremely high level of math. And as the authors say solving

00:04:10.880 | this problem requires a more creative approach as there is no clear strategy for beginning the proof.

00:04:17.080 | As you might expect GPT-4 manages to produce a correct proof. As I have demonstrated

00:04:22.480 | in other videos it does get some math problems wrong. As the paper points out that's often due

00:04:27.600 | to technical proficiency making basic calculation errors. But remember the paper proved that it could

00:04:33.640 | use a calculator if given access to one. Give GPT-4 tools and honestly it is going to shock the world.

00:04:39.060 | Next and this is a quick one but I loved it. Give it Fermi questions. These are the kind of

00:04:43.940 | questions asked in really difficult interviews and they have no easy answer. Things like how

00:04:48.580 | many golf balls could you fit in a swimming pool? Or please estimate roughly how many

00:04:52.400 | Fermi questions are being asked every day. Truly complex questions and GPT-4 can hazard great guesses.

00:04:58.480 | Next and this one was worth waiting for. Finally we get a personal assistant that actually works.

00:05:04.000 | I know it's called Google Assistant but it isn't really an assistant is it? GPT-4 can use available

00:05:08.240 | APIs to retrieve information about a user's calendar, coordinate with other people over email,

00:05:14.080 | book a dinner and message the user with the details. This is a sample of the interactions

00:05:18.800 | it performed sending an email to Luke and then receiving Luke's

00:05:22.320 | reply, checking the calendar then putting the event in the calendar then sending an email to

00:05:27.120 | Joe etc etc. When this becomes available in an app format we will finally have

00:05:32.240 | that AI personal assistant that we have been waiting for. Moving on did you know that GPT-4

00:05:37.280 | can be your personal handyman? One of the authors of the paper had a leak in their bathroom. They

00:05:42.800 | went through a diagnostic process with GPT-4 and it figured out what the problem was. When the

00:05:48.160 | author followed GPT-4's advice what happened? The leak was gone. The problem was that the leak was

00:05:52.240 | gone. The problem was solved. If you thought that was impressive wait till you see this.

00:05:56.880 | If it's allowed to ask enough questions as you can see above GPT-4 can build up a mental map

00:06:02.480 | of say a house that is entering. On the left you can see a map of the true locations of each room

00:06:08.160 | and on the right you can see GPT-4's mental image of them. That was revealed by the way

00:06:13.040 | by drawing a pie plot. This ability of course is going to become very relevant when GPT-4

00:06:18.720 | gets embodied and I'm going to talk about that in my next video. Speaking of which if

00:06:22.160 | you're learning anything from this video please don't forget to leave a like and let me know in

00:06:25.760 | the comments. Next up is theory of mind and I have done a whole video on this so do check it out

00:06:30.960 | afterwards but essentially the authors discovered the same thing that we have which is to say that

00:06:36.720 | GPT-4 can build up a mental model of what other people are thinking. You can pause the video and

00:06:42.720 | read the scenario yourself. It essentially involves knowing what Alice must be thinking,

00:06:48.080 | what she must believe about a situation even though the reality is different.

00:06:52.080 | Separating what is actually true with what a human being believes to be true. This is a key

00:06:58.000 | milestone on the road to possible consciousness but if you're interested in that topic honestly

00:07:02.960 | check out my video on it. Now I know at this point you're thinking I must have covered the

00:07:06.640 | best bits but no there's more. On page 80 the authors sketch out how GPT-4 is an auto-regressive

00:07:12.800 | model which means that it bases its outputs on what has already come before. That's great but

00:07:18.240 | it stops it from planning ahead. It doesn't know how its output is going to

00:07:22.000 | end before it starts and I'm going to reveal the implications of this fascinating weakness in a

00:07:27.520 | couple of ways. First with their examples and then with one of my own making. In this task they try to

00:07:32.880 | get GPT-4 to create a poem which begins with a sentence and then ends with the same sentence in

00:07:39.360 | reverse order but it's got to make sense. GPT-4 simply can't do it because it doesn't know how

00:07:45.200 | its poem is going to end before it starts. Remember it's an auto-regressive model. After repeatedly and

00:07:51.920 | successfully testing GPT-4's ability to do this the authors broke it down like this: GPT-4 is amazing

00:07:58.720 | at incremental tasks but not as good at discontinuous tasks. Incremental tasks are those

00:08:04.640 | where you follow a standard procedure building things up step by step like composing a poem using

00:08:10.400 | a rhyme scheme or writing a summary of a text. Start at the beginning and then next sentence etc.

00:08:15.520 | But discontinuous tasks require you to know a bit about the output the end result before you start. They give

00:08:21.840 | a great example of writing a joke. You kind of need to know the punch line before you do the setup.

00:08:27.680 | Maybe that's why GPT-4 is so bad at joke telling. It can't think of an amazing punch line and then

00:08:33.040 | work backwards to create the scenario around it. I came up with a simple demonstration of this to show

00:08:38.480 | you guys. Try asking GPT-4 this question: How many words are in the full response to this prompt? If you think

00:08:45.600 | about it it has to know the final result of its output to give a correct answer. Because it's just generating

00:08:51.760 | an answer word by word, token by token. It can't do this. It said that there are 43 words in the

00:08:57.440 | full response to this prompt including the words in the question and the answer. Okay that's kind

00:09:02.160 | of weird. I didn't want to include the question itself but let's see if it got it right. I said

00:09:06.240 | list them out and count them and then it went through including the prompt which I didn't want

00:09:10.560 | but fine. How many words are in the full response etc etc. And lo and behold there were only 31 words

00:09:18.160 | in the prompt and the output. But remember it had said

00:09:21.680 | that there were 43 words. It doesn't know the end result when it starts. Before you conclude that

00:09:27.600 | this will be a permanent block on language models like GPT-4 progressing further ponder this. A paper

00:09:33.680 | came out in January showing that it was at least theoretically possible to augment large language

00:09:39.280 | models with external memory and the paper both asks and answers this question: Such works raise the

00:09:45.280 | question of whether augmenting a language model with an external feedback loop is merely useful or

00:09:50.560 | fundamentally expensive.

00:09:51.600 | The paper expands the range of computations that can be performed. This paper gives an affirmative answer.

00:09:58.320 | Now obviously it's still a huge leap from here to there. Imagine if GPT-4 gets access to an external

00:10:05.040 | memory or say GPT-5. Then as the authors know you could have different layers of language models.

00:10:10.640 | One doing the fast thinking subroutines and another doing the slow thinking big picture.

00:10:16.240 | Monitoring the output of the language model and adjusting from there. Arguably that would be the

00:10:21.520 | ultimate breakthrough. Possibly even a dangerous breakthrough. Speaking of dangerous on page 84 the

00:10:27.920 | authors note that the unrestricted GPT-4 is incredible at propaganda and conspiracy theories.

00:10:34.640 | It can design entire misinformation campaigns replete with links and images and I worry that

00:10:40.480 | it's only a matter of time before someone jailbreaks this kind of version of GPT-4 and uses

00:10:45.840 | it in the wild. Next and I think this is quite a stunning admission from researchers at Microsoft. They

00:10:51.440 | say that some people may ask for the ability and right to decide and specify which content they

00:10:58.080 | want or do not want to be crawled. They're flagging this up in terms of privacy and potential lawsuits.

00:11:04.720 | The context they're giving is of models like GPT-4 taking away jobs and if they're taking away jobs

00:11:10.960 | from people whose content has been crawled I wouldn't be surprised if there's some contention

00:11:16.240 | there. Two final points from this bombshell paper. The authors talk about equipping LLM

00:11:21.360 | large language models with agency and intrinsic motivation and say that this is a fascinating and

00:11:27.920 | important direction for future work. This is in the context of GPT-4 not being motivated by anything,

00:11:34.960 | just being passive. Well I do think that that's a fascinating direction for future work but it's

00:11:41.120 | also a very concerning one. Giving a language model intrinsic motivation not only has

00:11:47.520 | ethical concerns and questions like when would it have rights then,

00:11:51.280 | but it also raises huge safety concerns. Of course they do admit with this direction of work

00:11:56.480 | great care would have to be taken on alignment and safety. I'm not personally too keen on this

00:12:01.600 | phrasing of giving it motivation is a fascinating and important direction as if it's definitely

00:12:07.440 | something we should be working on. This is especially true in the context of the final

00:12:12.160 | part of the paper. They admit that they don't really know what is actually happening. They

00:12:16.560 | know what GPT-4 is capable of but not really why it's capable of those things.

00:12:21.200 | Of course they propose hypotheses but they end with this:

00:12:24.880 | Overall, elucidating the nature and mechanisms of AI systems such as GPT-4 is a formidable challenge

00:12:33.200 | that has suddenly become important and urgent.

00:12:37.680 | Translated, we need to figure out how these things work and fast.

00:12:41.360 | Well I definitely agree with that. Thank you so much for watching to the end.

00:12:45.200 | Let me know your thoughts in the comments and have a wonderful day.

'Sparks of AGI' - Bombshell GPT-4 Paper: Fully Read w/ 15 Revelations

Chapters