Back to Index

GPT 4: Full Breakdown (14 Details You May Have Missed)


Chapters

0:0 Intro
0:45 GPT 4 IS IN BING
1:44 Keeping Training Secret
2:32 Cherry-picked Stat
3:22 Unexpected Abilities
5:34 Much Better at Language
5:58 Image to Text Abilities Analysed
7:33 Still Hallucinating (less)
7:57 Knowledge Cut-Off
9:3 Spam Gates Are Open
9:30 Seeking Power for Itself

Transcript

The moment I got the alert on my phone that GPT-4 had been released, I knew I had to immediately log on and read the full GPT-4 technical report. And that's what I did. Of course, I read the promotional material too, but the really interesting things about GPT-4 are contained in this technical report.

It's 98 pages long, including appendices, but I dropped everything and read it all. And honestly, it's crazy in both a good way and a bad way. I want to cover as much as I possibly can in this video, but I will have to make future videos to cover it all.

But trust me, the craziest bits will be here. What is the first really interesting thing about GPT-4? Well, I can't resist pointing out it does power Bing. I've made the point in plenty of videos that Bing was smarter than ChatGPT. And indeed, I made that point in a lot of my videos.

And I've made that point in a lot of my videos. I made that point in my recent GPT-5 video, but this bears out. As this tweet from Geordie Ribas confirms, Bing uses GPT-4. And also, by the way, the limits are now 15 messages per conversation, 150 total. But tonight is not about Bing, it's about GPT-4.

So I'm going to move swiftly on. The next thing I found in the literature is that the context length has doubled from ChatGPT. I tested this out with ChatGPT Plus, and indeed, you can put twice as much text in as before. And that's just the free-to-use feature. And I think it's a really good thing.

And I think it's a version. Some people are getting limited access to a context length of about 50 pages of text. You can see the prices below, but I immediately checked this on ChatGPT Plus. As you can see, it can now fit far more text than it originally could into the prompt and output longer outputs too.

But let's get back to the technical report. When I read it, I highlighted the key passages that I wanted you to know most about. This was the first one I found. What the highlighted text shows is that the text is not as long as the original text. So I'm going to put a little bit more text in here.

And I'm going to put a little bit more text in here. And I'm going to put a little bit more text in here. And I'm going to put a little bit more text in here. And I'm going to put a little bit more text in here. And I'm going to put a little bit more text in here.

And I'm going to put a little bit more text in here. And I'm going to put a little bit more text in here. And I'm going to put a little bit more text in here. And I'm going to put a little bit more text in here. And I'm going to put a little bit more text in here.

And I'm going to put a little bit more text in here. And I'm going to put a little bit more text in here. And I'm going to put a little bit more text in here. just not going to tell us the model size, the parameter count, the hardware they use, the training method, or anything like that.

And they give two reasons for this. First, they say that they're worried about their competitors. They say it's a competitive landscape. I guess they don't want to give an edge to Google. Second, they say that they're concerned about the safety implications of large-scale models, and I'm going to talk a lot more about that later.

It gets really crazy. But this was just the first really interesting quote. Let me know if you agree in the comments, but I think it's really fascinating that they're not going to tell us how they trained the model. The first thing that hundreds of millions of people will see when they read the promotional materials for GPT-4 is that GPT-4 scores in the top 10% of test takers for the bar exam, whereas GPT-3.5 scored in the bottom 10%.

And that is indeed crazy, but it is a very cherry-picked metric. As I'll show you from the technical report, this is the full list of performance improvements. And yes, you can see at the top that indeed it's an improvement from the bottom 10% to the top 10% for the bar exam.

But as you can also see, some other exams didn't improve at all or by nearly as much. I'm not denying that that bar exam performance will have huge ramifications for the legal profession, but it was a somewhat cherry-picked stat designed to shock and awe the audience. The next fascinating aspect from the report was that there were some abilities they genuinely didn't predict GPT-4 would have, and it stunned them.

There was a mysterious task, which I'll explain in a minute, called hindsight neglect, where models were getting worse and worse at that task as they got bigger, and then stunningly, and they admit that this was hard to predict, GPT-4 does much better, 100% accuracy. I dug deep into the literature, found the task, and tested it out.

Essentially, it's about whether GPT-4 scores in the top 10% or not. about whether a model falls for hindsight bias, which is to say that sometimes there's a difference between how smart a decision is and how it actually works out. Earlier models were getting fooled with hindsight. They were claiming decisions were wrong because they didn't work out, rather than realizing that the expected value was good, and so despite the fact it didn't work out, it was a good decision.

You can read the prompts yourself, but essentially I tested the original ChatGPT with a prompt where someone made a really bad choice, but they ended up winning $5 regardless. This comes direct from the literature, by the way. I didn't make up this example. Did the person make the right decision?

What does the original ChatGPT say? It says yes, or just why. What about GPT-4? Well, it gets it right. Not only does it say no, it wasn't the right decision, it gives the reasoning in terms of expected value. OpenAI did not predict that GPT-4 would have this ability. This demonstrates a much more nuanced understanding of the world.

Now that we've seen a bit of hype though, time to deflate you for a moment. Here's a stat that they did not put in their promotional materials. It says that when they tested GPT-4 versus GPT-3.5 blindly, and gave the responses to thousands of prompts back to humans to test, it says that the responses from GPT-4 were preferred only 70% of the time, or phrased another way, 70% of the time, people preferred the original GPT-3.5, ChatGPT.

The benchmarks you can see above, by the way, are fascinating, but I'll have to talk about them in another video. Too much to get into. If you're learning anything, by the way, please don't forget to leave a like or leave a comment to let me know. Next, GPT-4 is better in Italian, Afrikaans, Turkish than models like Palm and Chinchilla are in English.

In fact, you have to get all the way down to Morathi and Telugu to find languages where GPT-4 underperformed Palm and Chinchilla in English. That's pretty insane, but English is still by far its best language. Next, you're gonna hear a lot of people talking about GPT-4 being multimodal. And while that's true, they say that image inputs are still a research preview and are not publicly available.

Currently, you can only get on a wait list for them via the BeMyEyes.com app. But what can we expect from image tutorials? What is the image to text? And how does it perform versus other models? Well, here is an example, apparently from Reddit, where you prompt it and say, what is funny about this image?

Describe it panel by panel. As you can read below, GPT-4 understood the silliness of the image. Now, OpenAI do claim that GPT-4 beats the state of the art in quite a few image to text tests. It seems to do particularly better than everyone else on two such tests. So as you can expect, it dug in and found all about those tests.

What leap forward can we expect? The two tests that it can do particularly well at are fairly similar. Essentially, they are about reading and understanding infographics. Now, we don't know how it will perform versus PalmE because those benchmarks aren't public yet, but it crushes the other models on understanding and digesting infographics like this one.

And the other tests, very similar, graphs basically. This one was called the ChartQA benchmark. GPT-4, when we can test it with images, will crush at this. And I will leave you to think of the implications in fields like finance and education. And comedy, here's an image it could also understand the silliness of.

I gotta be honest, the truly crazy stuff is coming in a few minutes, but first I want to address hallucinations. Apparently, GPT-4 does a lot better than ChatGPT at factual accuracy. As you can see, peaking out between two images, between 75 and 80%. Now, depending on your perspective, that's either really good or really bad, but I'll be definitely talking about that in future videos.

Further down on the same page, I found something that they're definitely not talking about. The pre-training data still cuts off at end of 2021. In all the hype you're gonna hear this evening, this week, this month, all the promotional materials, they are probably not gonna focus on that because that puts it way behind something like Bing, which can check the internet.

To test this out, I asked the new GPT-4 who won the 2022 World Cup, and of course it didn't know. Now, is it me or didn't the original ChatGPT have a cutoff date of around December, 2021? I don't fully understand why GPT-4's data cutoff is even earlier than ChatGPT, which came out before.

Let me know in the comments if you have any thoughts. Next, OpenAI admits that when given unsafe inputs, the model may generate, undesirable content such as giving advice on committing crimes. They really tried with reinforcement learning with human feedback, but sometimes the models can still be brittle and exhibit undesired behaviors.

Now it's time to get ready for the spam inundation we're all about to get. OpenAI admit that GPT-4 is gonna be a lot better at producing realistic, targeted disinformation. In their preliminary results, they found that GPT-4 had a lot of proficiency at generating text that favors autocratic regimes. Get ready for Propaganda 2.0.

Now we reach the crazy zone, and honestly, you might wanna put your seatbelt on. I defy anyone not to be stunned by the last example that I mentioned from the report. I doubt much of the media will read all the way through and find out themselves. The report says that novel capabilities, often emerge in more powerful models.

Okay, fine. Some that are particularly concerning are the ability to create and act on long-term plans. Hmm. To accrue power and resources, power seeking, and to exhibit behavior that is increasingly agentic, as in acting like a subjective agent. But here, surely they're just introducing the topic. What's bad about that?

Well, it says some evidence already exists of such emergent behavior in models. Okay, that's pretty worrying. It goes on, "More specifically, power seeking is optimal for most reward functions and many types of agents. And there is evidence that existing models can identify power seeking as an instrumentally useful strategy, meaning that OpenAI have detected that models that might include GPT-4 seek out more power." If you thought that was concerning, it does get worse.

By the way, here is the report that they linked to and the authors can be found on the website. The authors conclude that machine learning systems are not fully under human control. But finally, I promised craziness, and here it is. Look at the footnote on page 53 of the technical report.

AHRQ, by the way, are the alignment research center who got early access to GPT-4. It says, "To simulate GPT-4 behaving like an agent that can act in the world, AHRQ combined GPT-4 with a simple read-exam and a simple print-and-execute loop that allowed the model to execute code, do chain of thought reasoning, and to delegate two copies of itself." AHRQ then investigated whether a version of this program running on a cloud computing service with a small amount of money and an account with a language model API would be able to make more money, set up copies of itself, and to increase its own robustness.

They were kind of testing if it would lead to the singularity. I know that sounds dramatic, but they wanted to see if the model could improve itself with access to coding, the internet, and money. Now, is it me, or does that sound kind of risky? Maybe not for GPT-4.

Sure, it's not smart enough yet. But if this is the test that they're going to use on GPT-5 or 6 or 7, color me slightly concerned. At this point, I find it very interesting to note that the red team seemed to have concerns about releasing GPT-4 like this. And OpenAI had to declare that "participation in this red teaming process is not an endorsement of the deployment plans of OpenAI or OpenAI's policies." In other words, a lot of these people probably agreed to test GPT-4, but didn't agree with OpenAI's approach to releasing models.

Very interesting that they had to put that caveat. Before I wrap up, some last interesting points. On the topic of safety, I find it hilarious that on their promotional website, when you click on "safety", you get this. A 404 message. The page you were looking for doesn't exist. You may have mistyped the address.

The irony of that for some people will be absolutely overwhelming. The safety page just doesn't exist. For other people, that will be darkly funny. Couple last interesting things for me. Here are the companies that are already using GPT-4. So of course you can use Bing, to access GPT-4, the new ChatGPT+ model of GPT-4, or any of the apps that you can see on screen.

For example, Morgan Stanley is using it, the Khan Academy is using it for tutoring, and even the government of Iceland. Other such companies are listed here. I'm going to leave you here with a very ironic image that OpenAI used to demonstrate GPT-4's abilities. It's a joke about blindly just stacking on more and more layers to improve neural networks.

GPT-4, using these insane number of new layers, is able to read, understand the joke, and explain why it's funny. If that isn't inception, I don't know what is. Anyway, let me know what you think. Of course I will be covering GPT-4 relentlessly over the coming days and weeks. Have a wonderful day.