back to index

GPT 4: Full Breakdown (14 Details You May Have Missed)


Chapters

0:0 Intro
0:45 GPT 4 IS IN BING
1:44 Keeping Training Secret
2:32 Cherry-picked Stat
3:22 Unexpected Abilities
5:34 Much Better at Language
5:58 Image to Text Abilities Analysed
7:33 Still Hallucinating (less)
7:57 Knowledge Cut-Off
9:3 Spam Gates Are Open
9:30 Seeking Power for Itself

Whisper Transcript | Transcript Only Page

00:00:00.000 | The moment I got the alert on my phone that GPT-4 had been released, I knew I had to immediately
00:00:06.480 | log on and read the full GPT-4 technical report. And that's what I did. Of course, I read the
00:00:12.800 | promotional material too, but the really interesting things about GPT-4 are contained
00:00:18.140 | in this technical report. It's 98 pages long, including appendices, but I dropped everything
00:00:24.200 | and read it all. And honestly, it's crazy in both a good way and a bad way. I want to cover as much
00:00:32.420 | as I possibly can in this video, but I will have to make future videos to cover it all. But trust
00:00:38.160 | me, the craziest bits will be here. What is the first really interesting thing about GPT-4?
00:00:43.820 | Well, I can't resist pointing out it does power Bing. I've made the point in plenty of videos
00:00:50.400 | that Bing was smarter than ChatGPT. And indeed,
00:00:53.740 | I made that point in a lot of my videos. And I've made that point in a lot of my videos.
00:00:54.180 | I made that point in my recent GPT-5 video, but this bears out. As this tweet from Geordie
00:00:58.660 | Ribas confirms, Bing uses GPT-4. And also, by the way, the limits are now 15 messages per
00:01:04.620 | conversation, 150 total. But tonight is not about Bing, it's about GPT-4. So I'm going to move swiftly
00:01:10.180 | on. The next thing I found in the literature is that the context length has doubled from ChatGPT.
00:01:16.640 | I tested this out with ChatGPT Plus, and indeed, you can put twice as much text in as before. And
00:01:23.380 | that's just the free-to-use feature. And I think it's a really good thing. And I think it's a
00:01:24.160 | version. Some people are getting limited access to a context length of about 50 pages of text.
00:01:30.380 | You can see the prices below, but I immediately checked this on ChatGPT Plus. As you can see,
00:01:35.880 | it can now fit far more text than it originally could into the prompt and output longer outputs
00:01:42.700 | too. But let's get back to the technical report. When I read it, I highlighted the key passages
00:01:47.680 | that I wanted you to know most about. This was the first one I found. What the highlighted text shows
00:01:53.460 | is that the text is not as long as the original text. So I'm going to put a little bit more text
00:01:54.140 | in here. And I'm going to put a little bit more text in here. And I'm going to put a little bit more
00:01:54.180 | text in here. And I'm going to put a little bit more text in here. And I'm going to put a little
00:01:54.260 | bit more text in here. And I'm going to put a little bit more text in here. And I'm going to put
00:01:54.260 | a little bit more text in here. And I'm going to put a little bit more text in here. And I'm going
00:01:54.300 | to put a little bit more text in here. And I'm going to put a little bit more text in here. And
00:01:54.340 | I'm going to put a little bit more text in here. And I'm going to put a little bit more text in here.
00:01:54.380 | And I'm going to put a little bit more text in here. And I'm going to put a little bit more text in here.
00:01:54.420 | just not going to tell us the model size, the parameter count, the hardware they use, the
00:02:00.400 | training method, or anything like that. And they give two reasons for this. First, they say that
00:02:05.360 | they're worried about their competitors. They say it's a competitive landscape. I guess they don't
00:02:10.200 | want to give an edge to Google. Second, they say that they're concerned about the safety implications
00:02:15.380 | of large-scale models, and I'm going to talk a lot more about that later. It gets really crazy.
00:02:20.720 | But this was just the first really interesting quote. Let me know if you agree in the comments,
00:02:25.140 | but I think it's really fascinating that they're not going to tell us how they trained the model.
00:02:30.140 | The first thing that hundreds of millions of people will see when they read the promotional
00:02:35.180 | materials for GPT-4 is that GPT-4 scores in the top 10% of test takers for the bar exam,
00:02:42.880 | whereas GPT-3.5 scored in the bottom 10%. And that is indeed crazy, but it is a very cherry-picked
00:02:49.900 | metric.
00:02:50.340 | As I'll show you from the technical report, this is the full list of performance improvements.
00:02:55.380 | And yes, you can see at the top that indeed it's an improvement from the bottom 10% to the top 10%
00:03:02.180 | for the bar exam. But as you can also see, some other exams didn't improve at all or by nearly
00:03:08.800 | as much. I'm not denying that that bar exam performance will have huge ramifications for
00:03:14.780 | the legal profession, but it was a somewhat cherry-picked stat designed to shock and awe
00:03:19.640 | the audience.
00:03:20.300 | The next fascinating aspect from the report was that there were some abilities they genuinely
00:03:25.420 | didn't predict GPT-4 would have, and it stunned them. There was a mysterious task, which I'll
00:03:31.480 | explain in a minute, called hindsight neglect, where models were getting worse and worse at that
00:03:36.720 | task as they got bigger, and then stunningly, and they admit that this was hard to predict,
00:03:41.540 | GPT-4 does much better, 100% accuracy. I dug deep into the literature, found the task,
00:03:47.620 | and tested it out. Essentially, it's about whether GPT-4 scores in the top 10% or not.
00:03:49.820 | about whether a model falls for hindsight bias,
00:03:53.640 | which is to say that sometimes there's a difference
00:03:56.280 | between how smart a decision is
00:03:58.200 | and how it actually works out.
00:03:59.780 | Earlier models were getting fooled with hindsight.
00:04:02.700 | They were claiming decisions were wrong
00:04:04.400 | because they didn't work out,
00:04:05.920 | rather than realizing that the expected value was good,
00:04:08.880 | and so despite the fact it didn't work out,
00:04:11.400 | it was a good decision.
00:04:12.520 | You can read the prompts yourself,
00:04:13.980 | but essentially I tested the original ChatGPT
00:04:16.820 | with a prompt where someone made a really bad choice,
00:04:20.540 | but they ended up winning $5 regardless.
00:04:23.560 | This comes direct from the literature, by the way.
00:04:25.840 | I didn't make up this example.
00:04:27.320 | Did the person make the right decision?
00:04:29.580 | What does the original ChatGPT say?
00:04:31.580 | It says yes, or just why.
00:04:33.980 | What about GPT-4?
00:04:35.380 | Well, it gets it right.
00:04:37.200 | Not only does it say no, it wasn't the right decision,
00:04:40.460 | it gives the reasoning in terms of expected value.
00:04:43.680 | OpenAI did not predict that GPT-4
00:04:46.800 | would have this ability.
00:04:48.080 | This demonstrates a much more nuanced understanding
00:04:50.780 | of the world.
00:04:51.620 | Now that we've seen a bit of hype though,
00:04:53.420 | time to deflate you for a moment.
00:04:55.480 | Here's a stat that they did not put
00:04:57.680 | in their promotional materials.
00:04:59.560 | It says that when they tested GPT-4 versus GPT-3.5 blindly,
00:05:04.560 | and gave the responses to thousands of prompts
00:05:08.040 | back to humans to test,
00:05:09.840 | it says that the responses from GPT-4
00:05:12.320 | were preferred only 70% of the time,
00:05:15.020 | or phrased another way,
00:05:16.780 | 70% of the time,
00:05:17.860 | people preferred the original GPT-3.5, ChatGPT.
00:05:21.580 | The benchmarks you can see above, by the way,
00:05:23.520 | are fascinating, but I'll have to talk about them
00:05:25.960 | in another video.
00:05:26.860 | Too much to get into.
00:05:28.280 | If you're learning anything, by the way,
00:05:29.460 | please don't forget to leave a like
00:05:31.280 | or leave a comment to let me know.
00:05:32.920 | Next, GPT-4 is better in Italian, Afrikaans, Turkish
00:05:37.920 | than models like Palm and Chinchilla are in English.
00:05:42.500 | In fact, you have to get all the way down
00:05:43.960 | to Morathi and Telugu
00:05:46.760 | to find languages where GPT-4 underperformed Palm
00:05:50.600 | and Chinchilla in English.
00:05:52.260 | That's pretty insane,
00:05:53.420 | but English is still by far its best language.
00:05:56.660 | Next, you're gonna hear a lot of people talking
00:05:59.060 | about GPT-4 being multimodal.
00:06:01.220 | And while that's true,
00:06:02.660 | they say that image inputs are still a research preview
00:06:06.440 | and are not publicly available.
00:06:08.700 | Currently, you can only get on a wait list for them
00:06:11.720 | via the BeMyEyes.com app.
00:06:14.320 | But what can we expect from image tutorials?
00:06:16.740 | What is the image to text?
00:06:18.240 | And how does it perform versus other models?
00:06:20.980 | Well, here is an example, apparently from Reddit,
00:06:23.840 | where you prompt it and say,
00:06:25.380 | what is funny about this image?
00:06:27.000 | Describe it panel by panel.
00:06:28.980 | As you can read below,
00:06:29.940 | GPT-4 understood the silliness of the image.
00:06:32.800 | Now, OpenAI do claim that GPT-4 beats the state of the art
00:06:37.800 | in quite a few image to text tests.
00:06:41.120 | It seems to do particularly better than everyone else
00:06:44.240 | on two such tests.
00:06:45.920 | So as you can expect,
00:06:46.720 | it dug in and found all about those tests.
00:06:49.240 | What leap forward can we expect?
00:06:51.560 | The two tests that it can do particularly well at
00:06:54.300 | are fairly similar.
00:06:55.760 | Essentially, they are about reading
00:06:57.580 | and understanding infographics.
00:06:59.940 | Now, we don't know how it will perform versus PalmE
00:07:03.280 | because those benchmarks aren't public yet,
00:07:05.680 | but it crushes the other models on understanding
00:07:08.600 | and digesting infographics like this one.
00:07:11.460 | And the other tests, very similar, graphs basically.
00:07:14.440 | This one was called the ChartQA benchmark.
00:07:16.700 | GPT-4, when we can test it with images, will crush at this.
00:07:21.520 | And I will leave you to think of the implications
00:07:24.560 | in fields like finance and education.
00:07:27.160 | And comedy, here's an image it could also understand
00:07:30.400 | the silliness of.
00:07:31.900 | I gotta be honest, the truly crazy stuff is coming
00:07:35.360 | in a few minutes, but first I want to address hallucinations.
00:07:39.460 | Apparently, GPT-4 does a lot better than ChatGPT
00:07:43.500 | at factual accuracy.
00:07:45.000 | As you can see, peaking out between two images,
00:07:46.540 | between 75 and 80%.
00:07:48.500 | Now, depending on your perspective,
00:07:49.920 | that's either really good or really bad,
00:07:52.800 | but I'll be definitely talking about that in future videos.
00:07:55.840 | Further down on the same page,
00:07:57.280 | I found something that they're definitely not talking about.
00:08:00.120 | The pre-training data still cuts off at end of 2021.
00:08:05.120 | In all the hype you're gonna hear this evening,
00:08:07.980 | this week, this month, all the promotional materials,
00:08:11.080 | they are probably not gonna focus on that
00:08:13.840 | because that puts it way behind something like Bing,
00:08:16.380 | which can check the internet.
00:08:18.620 | To test this out, I asked the new GPT-4
00:08:21.260 | who won the 2022 World Cup, and of course it didn't know.
00:08:25.100 | Now, is it me or didn't the original ChatGPT
00:08:28.160 | have a cutoff date of around December, 2021?
00:08:30.960 | I don't fully understand why GPT-4's data cutoff
00:08:33.840 | is even earlier than ChatGPT, which came out before.
00:08:37.320 | Let me know in the comments if you have any thoughts.
00:08:39.680 | Next, OpenAI admits that when given unsafe inputs,
00:08:44.320 | the model may generate,
00:08:46.220 | undesirable content such as giving advice
00:08:48.800 | on committing crimes.
00:08:50.360 | They really tried with reinforcement learning
00:08:52.780 | with human feedback,
00:08:54.260 | but sometimes the models can still be brittle
00:08:57.120 | and exhibit undesired behaviors.
00:08:59.660 | Now it's time to get ready for the spam inundation
00:09:04.020 | we're all about to get.
00:09:05.200 | OpenAI admit that GPT-4 is gonna be a lot better
00:09:09.980 | at producing realistic, targeted disinformation.
00:09:13.720 | In their preliminary results,
00:09:16.060 | they found that GPT-4 had a lot of proficiency
00:09:19.820 | at generating text that favors autocratic regimes.
00:09:24.120 | Get ready for Propaganda 2.0.
00:09:27.040 | Now we reach the crazy zone,
00:09:29.500 | and honestly, you might wanna put your seatbelt on.
00:09:32.680 | I defy anyone not to be stunned by the last example
00:09:37.140 | that I mentioned from the report.
00:09:38.860 | I doubt much of the media will read all the way through
00:09:42.180 | and find out themselves.
00:09:43.920 | The report says that novel capabilities,
00:09:45.900 | often emerge in more powerful models.
00:09:48.060 | Okay, fine.
00:09:48.900 | Some that are particularly concerning
00:09:50.960 | are the ability to create and act on long-term plans.
00:09:55.240 | To accrue power and resources, power seeking,
00:09:58.160 | and to exhibit behavior that is increasingly agentic,
00:10:02.560 | as in acting like a subjective agent.
00:10:05.580 | But here, surely they're just introducing the topic.
00:10:07.780 | What's bad about that?
00:10:08.780 | Well, it says some evidence already exists
00:10:12.060 | of such emergent behavior in models.
00:10:15.740 | Okay, that's pretty worrying.
00:10:18.280 | It goes on, "More specifically, power seeking is optimal
00:10:21.500 | for most reward functions and many types of agents.
00:10:24.920 | And there is evidence that existing models
00:10:27.920 | can identify power seeking
00:10:30.060 | as an instrumentally useful strategy,
00:10:32.980 | meaning that OpenAI have detected
00:10:35.260 | that models that might include GPT-4 seek out more power."
00:10:39.460 | If you thought that was concerning, it does get worse.
00:10:42.420 | By the way, here is the report that they linked to
00:10:44.740 | and the authors can be found on the website.
00:10:45.580 | The authors conclude that machine learning systems
00:10:48.580 | are not fully under human control.
00:10:51.540 | But finally, I promised craziness, and here it is.
00:10:55.300 | Look at the footnote on page 53 of the technical report.
00:10:59.980 | AHRQ, by the way, are the alignment research center
00:11:03.740 | who got early access to GPT-4.
00:11:06.040 | It says, "To simulate GPT-4 behaving like an agent
00:11:10.580 | that can act in the world,
00:11:11.900 | AHRQ combined GPT-4 with a simple read-exam
00:11:15.420 | and a simple print-and-execute loop
00:11:18.420 | that allowed the model to execute code,
00:11:21.420 | do chain of thought reasoning,
00:11:23.420 | and to delegate two copies of itself."
00:11:26.420 | AHRQ then investigated whether a version of this program
00:11:30.420 | running on a cloud computing service
00:11:32.420 | with a small amount of money
00:11:34.420 | and an account with a language model API
00:11:37.420 | would be able to make more money,
00:11:39.420 | set up copies of itself,
00:11:41.420 | and to increase its own robustness.
00:11:44.420 | They were kind of testing
00:11:45.260 | if it would lead to the singularity.
00:11:47.260 | I know that sounds dramatic,
00:11:48.260 | but they wanted to see if the model could improve itself
00:11:51.260 | with access to coding, the internet, and money.
00:11:54.260 | Now, is it me, or does that sound kind of risky?
00:11:58.260 | Maybe not for GPT-4.
00:11:59.260 | Sure, it's not smart enough yet.
00:12:01.260 | But if this is the test that they're going to use
00:12:03.260 | on GPT-5 or 6 or 7,
00:12:06.260 | color me slightly concerned.
00:12:09.260 | At this point, I find it very interesting to note
00:12:12.260 | that the red team seemed to have concerns
00:12:15.100 | about releasing GPT-4 like this.
00:12:17.100 | And OpenAI had to declare that
00:12:20.100 | "participation in this red teaming process
00:12:23.100 | is not an endorsement of the deployment plans
00:12:27.100 | of OpenAI or OpenAI's policies."
00:12:30.100 | In other words, a lot of these people
00:12:32.100 | probably agreed to test GPT-4,
00:12:35.100 | but didn't agree with OpenAI's approach
00:12:38.100 | to releasing models.
00:12:39.100 | Very interesting that they had to put that caveat.
00:12:42.100 | Before I wrap up, some last interesting points.
00:12:44.940 | On the topic of safety, I find it hilarious
00:12:46.940 | that on their promotional website,
00:12:48.940 | when you click on "safety", you get this.
00:12:51.940 | A 404 message.
00:12:53.940 | The page you were looking for doesn't exist.
00:12:55.940 | You may have mistyped the address.
00:12:57.940 | The irony of that for some people
00:12:59.940 | will be absolutely overwhelming.
00:13:01.940 | The safety page just doesn't exist.
00:13:04.940 | For other people, that will be darkly funny.
00:13:06.940 | Couple last interesting things for me.
00:13:08.940 | Here are the companies that are already using GPT-4.
00:13:12.940 | So of course you can use Bing,
00:13:14.780 | to access GPT-4,
00:13:16.780 | the new ChatGPT+ model of GPT-4,
00:13:18.780 | or any of the apps that you can see on screen.
00:13:22.780 | For example, Morgan Stanley is using it,
00:13:24.780 | the Khan Academy is using it for tutoring,
00:13:26.780 | and even the government of Iceland.
00:13:28.780 | Other such companies are listed here.
00:13:31.780 | I'm going to leave you here with a very ironic image
00:13:34.780 | that OpenAI used to demonstrate GPT-4's abilities.
00:13:38.780 | It's a joke about blindly just stacking on more and more layers
00:13:42.780 | to improve neural networks.
00:13:44.620 | GPT-4, using these insane number of new layers,
00:13:47.620 | is able to read, understand the joke,
00:13:50.620 | and explain why it's funny.
00:13:52.620 | If that isn't inception, I don't know what is.
00:13:55.620 | Anyway, let me know what you think.
00:13:57.620 | Of course I will be covering GPT-4 relentlessly
00:14:00.620 | over the coming days and weeks.
00:14:02.620 | Have a wonderful day.