back to indexGPT 4: Full Breakdown (14 Details You May Have Missed)
Chapters
0:0 Intro
0:45 GPT 4 IS IN BING
1:44 Keeping Training Secret
2:32 Cherry-picked Stat
3:22 Unexpected Abilities
5:34 Much Better at Language
5:58 Image to Text Abilities Analysed
7:33 Still Hallucinating (less)
7:57 Knowledge Cut-Off
9:3 Spam Gates Are Open
9:30 Seeking Power for Itself
00:00:00.000 |
The moment I got the alert on my phone that GPT-4 had been released, I knew I had to immediately 00:00:06.480 |
log on and read the full GPT-4 technical report. And that's what I did. Of course, I read the 00:00:12.800 |
promotional material too, but the really interesting things about GPT-4 are contained 00:00:18.140 |
in this technical report. It's 98 pages long, including appendices, but I dropped everything 00:00:24.200 |
and read it all. And honestly, it's crazy in both a good way and a bad way. I want to cover as much 00:00:32.420 |
as I possibly can in this video, but I will have to make future videos to cover it all. But trust 00:00:38.160 |
me, the craziest bits will be here. What is the first really interesting thing about GPT-4? 00:00:43.820 |
Well, I can't resist pointing out it does power Bing. I've made the point in plenty of videos 00:00:50.400 |
that Bing was smarter than ChatGPT. And indeed, 00:00:53.740 |
I made that point in a lot of my videos. And I've made that point in a lot of my videos. 00:00:54.180 |
I made that point in my recent GPT-5 video, but this bears out. As this tweet from Geordie 00:00:58.660 |
Ribas confirms, Bing uses GPT-4. And also, by the way, the limits are now 15 messages per 00:01:04.620 |
conversation, 150 total. But tonight is not about Bing, it's about GPT-4. So I'm going to move swiftly 00:01:10.180 |
on. The next thing I found in the literature is that the context length has doubled from ChatGPT. 00:01:16.640 |
I tested this out with ChatGPT Plus, and indeed, you can put twice as much text in as before. And 00:01:23.380 |
that's just the free-to-use feature. And I think it's a really good thing. And I think it's a 00:01:24.160 |
version. Some people are getting limited access to a context length of about 50 pages of text. 00:01:30.380 |
You can see the prices below, but I immediately checked this on ChatGPT Plus. As you can see, 00:01:35.880 |
it can now fit far more text than it originally could into the prompt and output longer outputs 00:01:42.700 |
too. But let's get back to the technical report. When I read it, I highlighted the key passages 00:01:47.680 |
that I wanted you to know most about. This was the first one I found. What the highlighted text shows 00:01:53.460 |
is that the text is not as long as the original text. So I'm going to put a little bit more text 00:01:54.140 |
in here. And I'm going to put a little bit more text in here. And I'm going to put a little bit more 00:01:54.180 |
text in here. And I'm going to put a little bit more text in here. And I'm going to put a little 00:01:54.260 |
bit more text in here. And I'm going to put a little bit more text in here. And I'm going to put 00:01:54.260 |
a little bit more text in here. And I'm going to put a little bit more text in here. And I'm going 00:01:54.300 |
to put a little bit more text in here. And I'm going to put a little bit more text in here. And 00:01:54.340 |
I'm going to put a little bit more text in here. And I'm going to put a little bit more text in here. 00:01:54.380 |
And I'm going to put a little bit more text in here. And I'm going to put a little bit more text in here. 00:01:54.420 |
just not going to tell us the model size, the parameter count, the hardware they use, the 00:02:00.400 |
training method, or anything like that. And they give two reasons for this. First, they say that 00:02:05.360 |
they're worried about their competitors. They say it's a competitive landscape. I guess they don't 00:02:10.200 |
want to give an edge to Google. Second, they say that they're concerned about the safety implications 00:02:15.380 |
of large-scale models, and I'm going to talk a lot more about that later. It gets really crazy. 00:02:20.720 |
But this was just the first really interesting quote. Let me know if you agree in the comments, 00:02:25.140 |
but I think it's really fascinating that they're not going to tell us how they trained the model. 00:02:30.140 |
The first thing that hundreds of millions of people will see when they read the promotional 00:02:35.180 |
materials for GPT-4 is that GPT-4 scores in the top 10% of test takers for the bar exam, 00:02:42.880 |
whereas GPT-3.5 scored in the bottom 10%. And that is indeed crazy, but it is a very cherry-picked 00:02:50.340 |
As I'll show you from the technical report, this is the full list of performance improvements. 00:02:55.380 |
And yes, you can see at the top that indeed it's an improvement from the bottom 10% to the top 10% 00:03:02.180 |
for the bar exam. But as you can also see, some other exams didn't improve at all or by nearly 00:03:08.800 |
as much. I'm not denying that that bar exam performance will have huge ramifications for 00:03:14.780 |
the legal profession, but it was a somewhat cherry-picked stat designed to shock and awe 00:03:20.300 |
The next fascinating aspect from the report was that there were some abilities they genuinely 00:03:25.420 |
didn't predict GPT-4 would have, and it stunned them. There was a mysterious task, which I'll 00:03:31.480 |
explain in a minute, called hindsight neglect, where models were getting worse and worse at that 00:03:36.720 |
task as they got bigger, and then stunningly, and they admit that this was hard to predict, 00:03:41.540 |
GPT-4 does much better, 100% accuracy. I dug deep into the literature, found the task, 00:03:47.620 |
and tested it out. Essentially, it's about whether GPT-4 scores in the top 10% or not. 00:03:49.820 |
about whether a model falls for hindsight bias, 00:03:53.640 |
which is to say that sometimes there's a difference 00:03:59.780 |
Earlier models were getting fooled with hindsight. 00:04:05.920 |
rather than realizing that the expected value was good, 00:04:13.980 |
but essentially I tested the original ChatGPT 00:04:16.820 |
with a prompt where someone made a really bad choice, 00:04:23.560 |
This comes direct from the literature, by the way. 00:04:37.200 |
Not only does it say no, it wasn't the right decision, 00:04:40.460 |
it gives the reasoning in terms of expected value. 00:04:48.080 |
This demonstrates a much more nuanced understanding 00:04:59.560 |
It says that when they tested GPT-4 versus GPT-3.5 blindly, 00:05:04.560 |
and gave the responses to thousands of prompts 00:05:17.860 |
people preferred the original GPT-3.5, ChatGPT. 00:05:21.580 |
The benchmarks you can see above, by the way, 00:05:23.520 |
are fascinating, but I'll have to talk about them 00:05:32.920 |
Next, GPT-4 is better in Italian, Afrikaans, Turkish 00:05:37.920 |
than models like Palm and Chinchilla are in English. 00:05:46.760 |
to find languages where GPT-4 underperformed Palm 00:05:53.420 |
but English is still by far its best language. 00:05:56.660 |
Next, you're gonna hear a lot of people talking 00:06:02.660 |
they say that image inputs are still a research preview 00:06:08.700 |
Currently, you can only get on a wait list for them 00:06:20.980 |
Well, here is an example, apparently from Reddit, 00:06:32.800 |
Now, OpenAI do claim that GPT-4 beats the state of the art 00:06:41.120 |
It seems to do particularly better than everyone else 00:06:51.560 |
The two tests that it can do particularly well at 00:06:59.940 |
Now, we don't know how it will perform versus PalmE 00:07:05.680 |
but it crushes the other models on understanding 00:07:11.460 |
And the other tests, very similar, graphs basically. 00:07:16.700 |
GPT-4, when we can test it with images, will crush at this. 00:07:21.520 |
And I will leave you to think of the implications 00:07:27.160 |
And comedy, here's an image it could also understand 00:07:31.900 |
I gotta be honest, the truly crazy stuff is coming 00:07:35.360 |
in a few minutes, but first I want to address hallucinations. 00:07:39.460 |
Apparently, GPT-4 does a lot better than ChatGPT 00:07:45.000 |
As you can see, peaking out between two images, 00:07:52.800 |
but I'll be definitely talking about that in future videos. 00:07:57.280 |
I found something that they're definitely not talking about. 00:08:00.120 |
The pre-training data still cuts off at end of 2021. 00:08:05.120 |
In all the hype you're gonna hear this evening, 00:08:07.980 |
this week, this month, all the promotional materials, 00:08:13.840 |
because that puts it way behind something like Bing, 00:08:21.260 |
who won the 2022 World Cup, and of course it didn't know. 00:08:30.960 |
I don't fully understand why GPT-4's data cutoff 00:08:33.840 |
is even earlier than ChatGPT, which came out before. 00:08:37.320 |
Let me know in the comments if you have any thoughts. 00:08:39.680 |
Next, OpenAI admits that when given unsafe inputs, 00:08:50.360 |
They really tried with reinforcement learning 00:08:54.260 |
but sometimes the models can still be brittle 00:08:59.660 |
Now it's time to get ready for the spam inundation 00:09:05.200 |
OpenAI admit that GPT-4 is gonna be a lot better 00:09:09.980 |
at producing realistic, targeted disinformation. 00:09:16.060 |
they found that GPT-4 had a lot of proficiency 00:09:19.820 |
at generating text that favors autocratic regimes. 00:09:29.500 |
and honestly, you might wanna put your seatbelt on. 00:09:32.680 |
I defy anyone not to be stunned by the last example 00:09:38.860 |
I doubt much of the media will read all the way through 00:09:50.960 |
are the ability to create and act on long-term plans. 00:09:55.240 |
To accrue power and resources, power seeking, 00:09:58.160 |
and to exhibit behavior that is increasingly agentic, 00:10:05.580 |
But here, surely they're just introducing the topic. 00:10:18.280 |
It goes on, "More specifically, power seeking is optimal 00:10:21.500 |
for most reward functions and many types of agents. 00:10:35.260 |
that models that might include GPT-4 seek out more power." 00:10:39.460 |
If you thought that was concerning, it does get worse. 00:10:42.420 |
By the way, here is the report that they linked to 00:10:45.580 |
The authors conclude that machine learning systems 00:10:51.540 |
But finally, I promised craziness, and here it is. 00:10:55.300 |
Look at the footnote on page 53 of the technical report. 00:10:59.980 |
AHRQ, by the way, are the alignment research center 00:11:06.040 |
It says, "To simulate GPT-4 behaving like an agent 00:11:26.420 |
AHRQ then investigated whether a version of this program 00:11:48.260 |
but they wanted to see if the model could improve itself 00:11:51.260 |
with access to coding, the internet, and money. 00:11:54.260 |
Now, is it me, or does that sound kind of risky? 00:12:01.260 |
But if this is the test that they're going to use 00:12:09.260 |
At this point, I find it very interesting to note 00:12:23.100 |
is not an endorsement of the deployment plans 00:12:39.100 |
Very interesting that they had to put that caveat. 00:12:42.100 |
Before I wrap up, some last interesting points. 00:13:08.940 |
Here are the companies that are already using GPT-4. 00:13:18.780 |
or any of the apps that you can see on screen. 00:13:31.780 |
I'm going to leave you here with a very ironic image 00:13:34.780 |
that OpenAI used to demonstrate GPT-4's abilities. 00:13:38.780 |
It's a joke about blindly just stacking on more and more layers 00:13:44.620 |
GPT-4, using these insane number of new layers, 00:13:52.620 |
If that isn't inception, I don't know what is. 00:13:57.620 |
Of course I will be covering GPT-4 relentlessly