Back to Index

'Sparks of AGI' - Bombshell GPT-4 Paper: Fully Read w/ 15 Revelations


Chapters

0:0 Intro
1:3 Emergent Capability
2:1 Text to Image
2:31 Human Level
3:19 Game Development
3:55 Olympiad
4:39 Fermi Questions
4:58 AI Personal Assistant
5:35 Handyman
5:53 Mental Map
6:26 Theory of Mind
7:4 Autoregressive Model
10:25 Unrestricted
10:47 Content Choices
11:17 Agency Intrinsic Motivation
12:9 Conclusion

Transcript

Less than 24 hours ago a report was released that will echo around the world. It is 154 pages and I just finished reading and digesting all of them. Yes that includes the appendices and no I didn't use GPT-4. It revealed in a nutshell that GPT-4 shows sparks of artificial general intelligence, the holy grail of AI research.

And yes I was skeptical, then I read the paper. I'm going to break down only the most important revelations one by one. But first I want to address the thought that you must be having. How could these guys have discovered so much when the model has only been out a week?

Well first as they lay out in the introduction they have interacted with GPT-4 during its early development. These researchers from Microsoft have had the model for months as early as October of last year or even earlier. They had the raw model, the unrestricted version, not the final version of GPT-4 that had been fine-tuned to improve safety and reduce the risk of infection.

So they had to find a way to get the model out of the way. So they had around six months to experiment with the unrestrained GPT-4. That's enough build-up, it's time to get to the revelations. And all of them I'm going to do in order aside from this one because honestly it blew my mind.

On page 45 they say GPT-4 is able to use tools with very minimal instruction and no demonstrations and they make use of them appropriately. They go on to say that this is an emergent capability and ChatGPT could not do this. Before I get into the details I must remind myself that one of the key moments in human evolution was when we discovered how to use tools.

So the fact that GPT-4 can use them so well and ChatGPT couldn't is truly a milestone in AI and human history. I'm going to show you more examples throughout the video but let's start with their examples. It knows when it needs to use a calculator and can use it effectively.

In my path to AGI video I talk about how it struggles with characters and it knows how to call a character API and work out the number of characters. So let's get started. So the first thing I want to talk about is the ability to use tools. So if you're using a calculator you can use tools to call a character API and work out the number of characters.

Now might not seem impressive but that was one of its key weaknesses before. If that didn't impress you how about text to image? GPT-4 can output detailed images based on a text prompt. These can then easily be rendered into more detailed drawings using a model like stable diffusion version 2.1.

Notice how the model knew how to arrange the objects based on the text prompt. At this point I can't really explain how it can be used effectively. So let's start with the first example. So if you're using a calculator and you can use it effectively you can use it to call a character API and work out the number of characters.

So let's start with the first example. So if you're using a calculator and you can use it to call a character API and work out the number of characters. Notice how the model knew how to arrange the objects based on the text prompt. These can then easily be rendered into more detailed drawings using a model like stable diffusion version 2.1.

Notice how the model knew how to arrange the objects based on the text prompt. At this point I can't 14.3. The k equals 5 bit by the way is that they picked the best of its five attempts. Deep in the appendices you see this. This is the human level.

Easy, medium and hard. And by the way they were a little bit generous with humans because they didn't include those guys who got none of the tasks right. They took those guys out of the database and compared GPT-4 only to those humans that got at least one task right.

And you thought it was just standard coding? How about 3D game development? When given a task to create a 3D game of some complexity I must say, the report says that GPT-4 produces a working game in a zero shot fashion. ChatGPT by contrast responds that it can't do it.

When I say a complex game the enemy is trying to rush towards you and you have a defender that's trying to block the enemy. It's not a simple game. As you can see from this video they are not the only ones who have used GPT-4 to create a detailed game.

And trust me I would talk about this amazing achievement for long. But I need to get on to the next topic. And that is that they tested GPT-4 on the 2022 International Mathematics Olympiad. That was not in its database and trust me I've studied for this kind of thing and it is not easy.

It's an extremely high level of math. And as the authors say solving this problem requires a more creative approach as there is no clear strategy for beginning the proof. As you might expect GPT-4 manages to produce a correct proof. As I have demonstrated in other videos it does get some math problems wrong.

As the paper points out that's often due to technical proficiency making basic calculation errors. But remember the paper proved that it could use a calculator if given access to one. Give GPT-4 tools and honestly it is going to shock the world. Next and this is a quick one but I loved it.

Give it Fermi questions. These are the kind of questions asked in really difficult interviews and they have no easy answer. Things like how many golf balls could you fit in a swimming pool? Or please estimate roughly how many Fermi questions are being asked every day. Truly complex questions and GPT-4 can hazard great guesses.

Next and this one was worth waiting for. Finally we get a personal assistant that actually works. I know it's called Google Assistant but it isn't really an assistant is it? GPT-4 can use available APIs to retrieve information about a user's calendar, coordinate with other people over email, book a dinner and message the user with the details.

This is a sample of the interactions it performed sending an email to Luke and then receiving Luke's reply, checking the calendar then putting the event in the calendar then sending an email to Joe etc etc. When this becomes available in an app format we will finally have that AI personal assistant that we have been waiting for.

Moving on did you know that GPT-4 can be your personal handyman? One of the authors of the paper had a leak in their bathroom. They went through a diagnostic process with GPT-4 and it figured out what the problem was. When the author followed GPT-4's advice what happened? The leak was gone.

The problem was that the leak was gone. The problem was solved. If you thought that was impressive wait till you see this. If it's allowed to ask enough questions as you can see above GPT-4 can build up a mental map of say a house that is entering. On the left you can see a map of the true locations of each room and on the right you can see GPT-4's mental image of them.

That was revealed by the way by drawing a pie plot. This ability of course is going to become very relevant when GPT-4 gets embodied and I'm going to talk about that in my next video. Speaking of which if you're learning anything from this video please don't forget to leave a like and let me know in the comments.

Next up is theory of mind and I have done a whole video on this so do check it out afterwards but essentially the authors discovered the same thing that we have which is to say that GPT-4 can build up a mental model of what other people are thinking. You can pause the video and read the scenario yourself.

It essentially involves knowing what Alice must be thinking, what she must believe about a situation even though the reality is different. Separating what is actually true with what a human being believes to be true. This is a key milestone on the road to possible consciousness but if you're interested in that topic honestly check out my video on it.

Now I know at this point you're thinking I must have covered the best bits but no there's more. On page 80 the authors sketch out how GPT-4 is an auto-regressive model which means that it bases its outputs on what has already come before. That's great but it stops it from planning ahead.

It doesn't know how its output is going to end before it starts and I'm going to reveal the implications of this fascinating weakness in a couple of ways. First with their examples and then with one of my own making. In this task they try to get GPT-4 to create a poem which begins with a sentence and then ends with the same sentence in reverse order but it's got to make sense.

GPT-4 simply can't do it because it doesn't know how its poem is going to end before it starts. Remember it's an auto-regressive model. After repeatedly and successfully testing GPT-4's ability to do this the authors broke it down like this: GPT-4 is amazing at incremental tasks but not as good at discontinuous tasks.

Incremental tasks are those where you follow a standard procedure building things up step by step like composing a poem using a rhyme scheme or writing a summary of a text. Start at the beginning and then next sentence etc. But discontinuous tasks require you to know a bit about the output the end result before you start.

They give a great example of writing a joke. You kind of need to know the punch line before you do the setup. Maybe that's why GPT-4 is so bad at joke telling. It can't think of an amazing punch line and then work backwards to create the scenario around it.

I came up with a simple demonstration of this to show you guys. Try asking GPT-4 this question: How many words are in the full response to this prompt? If you think about it it has to know the final result of its output to give a correct answer. Because it's just generating an answer word by word, token by token.

It can't do this. It said that there are 43 words in the full response to this prompt including the words in the question and the answer. Okay that's kind of weird. I didn't want to include the question itself but let's see if it got it right. I said list them out and count them and then it went through including the prompt which I didn't want but fine.

How many words are in the full response etc etc. And lo and behold there were only 31 words in the prompt and the output. But remember it had said that there were 43 words. It doesn't know the end result when it starts. Before you conclude that this will be a permanent block on language models like GPT-4 progressing further ponder this.

A paper came out in January showing that it was at least theoretically possible to augment large language models with external memory and the paper both asks and answers this question: Such works raise the question of whether augmenting a language model with an external feedback loop is merely useful or fundamentally expensive.

The paper expands the range of computations that can be performed. This paper gives an affirmative answer. Now obviously it's still a huge leap from here to there. Imagine if GPT-4 gets access to an external memory or say GPT-5. Then as the authors know you could have different layers of language models.

One doing the fast thinking subroutines and another doing the slow thinking big picture. Monitoring the output of the language model and adjusting from there. Arguably that would be the ultimate breakthrough. Possibly even a dangerous breakthrough. Speaking of dangerous on page 84 the authors note that the unrestricted GPT-4 is incredible at propaganda and conspiracy theories.

It can design entire misinformation campaigns replete with links and images and I worry that it's only a matter of time before someone jailbreaks this kind of version of GPT-4 and uses it in the wild. Next and I think this is quite a stunning admission from researchers at Microsoft. They say that some people may ask for the ability and right to decide and specify which content they want or do not want to be crawled.

They're flagging this up in terms of privacy and potential lawsuits. The context they're giving is of models like GPT-4 taking away jobs and if they're taking away jobs from people whose content has been crawled I wouldn't be surprised if there's some contention there. Two final points from this bombshell paper.

The authors talk about equipping LLM large language models with agency and intrinsic motivation and say that this is a fascinating and important direction for future work. This is in the context of GPT-4 not being motivated by anything, just being passive. Well I do think that that's a fascinating direction for future work but it's also a very concerning one.

Giving a language model intrinsic motivation not only has ethical concerns and questions like when would it have rights then, but it also raises huge safety concerns. Of course they do admit with this direction of work great care would have to be taken on alignment and safety. I'm not personally too keen on this phrasing of giving it motivation is a fascinating and important direction as if it's definitely something we should be working on.

This is especially true in the context of the final part of the paper. They admit that they don't really know what is actually happening. They know what GPT-4 is capable of but not really why it's capable of those things. Of course they propose hypotheses but they end with this: Overall, elucidating the nature and mechanisms of AI systems such as GPT-4 is a formidable challenge that has suddenly become important and urgent.

Translated, we need to figure out how these things work and fast. Well I definitely agree with that. Thank you so much for watching to the end. Let me know your thoughts in the comments and have a wonderful day.