back to indexGPT 5 is All About Data
00:00:00.000 |
To find out what I could about GPT-5, I have read every academic paper I could find about it, 00:00:06.520 |
every leak report, interview snippet and media article. I can summarize it like this, 00:00:12.260 |
it will come down to data, how much of it there is, how it's used and where it comes from. 00:00:18.660 |
These are the factors that will dictate whether GPT-5 gets released later this year and whether 00:00:24.900 |
it will actually approach genius level IQ. Some media reports have picked up on this potential 00:00:30.800 |
leak about GPT-5, you can read it here. I have put quite a few hours in trying to verify whether 00:00:37.140 |
this might be accurate and even though it's now being quoted by reputable sources, I still can't 00:00:42.840 |
confirm its accuracy. So for now I'll just say that the rest of the document seems accurate 00:00:47.560 |
but who knows. I am not relying on this for my research about GPT-5 but the scale, 25,000 GPUs, 00:00:54.880 |
does seem right. TechRadar here describes ChatGPT as having been trained on 10,000 NVIDIA GPUs. 00:01:02.160 |
And don't forget those were A100 GPUs. Microsoft might well now have access to the H100 GPU which 00:01:10.220 |
according to every source is a big step up from A100 GPUs on pretty much every metric. 00:01:16.540 |
And what about timelines for GPT-5? Would later this year be accurate? Well we can infer from 00:01:23.160 |
Jordi Rybas that GPT-5 is a good idea. If you're not sure, you can look at the chart on the right 00:01:24.860 |
side of the screen. GPT-4 or equivalent was completed sometime around late spring/early summer 00:01:30.740 |
of 2022. That would be just around the time that DeepMind published this which in massively 00:01:37.760 |
oversimplified terms lays out a framework for optimizing parameter size with the number 00:01:44.180 |
of training tokens aka how much info from the web it's trained on. Turns out models 00:01:49.940 |
like GPT-3 and Palm had way more parameters than needed anyway. 00:01:54.360 |
It was the data and especially high quality data that it was lacking. So all those graphs about GPT-4 00:02:01.680 |
needing 100 trillion parameters were absolutely farcical. It could even be that GPT-5 has the 00:02:08.640 |
same or fewer parameters than GPT-4. This less wrong post from July of 2022 picks up on that 00:02:17.100 |
finding and points out that it is data not size that is currently the active constraint on language 00:02:23.860 |
modeling performance. Current returns to additional data are immense and current returns to additional 00:02:30.520 |
model size are miniscule. Indeed most recent landmark models are wastefully big. If we can 00:02:36.340 |
leverage enough data there is no reason to run 500 billion parameter models much less 1 trillion 00:02:42.700 |
parameter or larger models. Remember it's data not parameter count. The link to all of these articles 00:02:48.760 |
by the way will be in the description. At this point let me quickly say that if you're learning anything don't forget to 00:02:53.360 |
leave a like or a comment. Frankly even abuse helps the algorithm so go for it. What about chat GPT? 00:02:58.760 |
Well GPT-3 along with a host of other models was trained on about 300 billion tokens. By the way 00:03:06.320 |
what defines a token shifts in the literature but it's somewhere between 1 and 1.4 words. Therefore 00:03:12.680 |
think of a token as roughly one word. As you can see from the graph below Palm was trained on about 00:03:22.860 |
DeepMind's chinchilla on about 1.4 trillion tokens. That particular less wrong post was referenced here 00:03:30.540 |
in this academic paper released in October. This paper is absolutely key to this video. It's focused 00:03:38.700 |
entirely on whether we will run out of data as it pertains to machine learning and large language 00:03:44.520 |
models. One of the key takeaways of this paper is the approximation given for how much high quality data / tokens 00:03:52.360 |
might be out there. The stock of high quality language data is approximated at between 4.6 00:03:58.660 |
trillion and 17 trillion words. The next point it makes is key. We are within one order of magnitude 00:04:06.580 |
of exhausting high quality data and this will likely happen between 2023 and 2027. For those that 00:04:14.920 |
don't know being an order of magnitude bigger means being 10 times bigger than what came previously. 00:04:19.540 |
Now I want you to remember that 2023 to 2027 timeline for a moment because first I want to 00:04:25.660 |
mention why high quality data is important. Running out of that could mean running out of the rapid 00:04:31.960 |
improvements in GPT models. The paper says models trained on the latter kind of high quality data 00:04:38.320 |
perform better so it is common practice to use high quality data for training language models. 00:04:44.380 |
And where does that high quality data come from? Well to be honest not knowing that is a big part of the data. 00:04:49.040 |
Which we will definitely come back to but here is a rough idea. We have scientific papers, books, 00:04:56.720 |
scraped content from the web, the news, code etc. Plus Wikipedia of course. The paper also mentions 00:05:05.000 |
here the middle of the road estimate of 9 trillion tokens of high quality data available. That 00:05:11.480 |
estimate will be central in defining the near-term future of artificial intelligence. One order of magnitude 00:05:18.540 |
more as an increase in performance is a huge deal. That would change everything. But I must say this 00:05:25.800 |
estimate contrasts with some others such as the 3.2 trillion token estimate from that original post. 00:05:33.180 |
And the author did say that they were trying to make it an overestimate. And what about this from 00:05:39.000 |
David Chapman a PhD in AI from MIT. He references the DeepMind study and that less wrong post and 00:05:47.040 |
makes two important and very important statements. The first one is that the data is not necessarily 00:05:48.040 |
and plausible observations. First that GPT-4 or Bing may have scraped the bottom of the web text 00:05:55.720 |
barrel and that this might be why its responses sometimes turn out like emoting teenagers. I 00:06:03.200 |
actually did a video on the crazy conversations you can have with Bing that you can check out 00:06:08.340 |
after this one. But second he suggests that there might be a reason that neither Google nor OpenAI 00:06:14.220 |
have been forthcoming about where they get their data from. Now I'm not saying it might be about 00:06:20.540 |
illegality but it might be about avoiding controversy over attribution and compensation. 00:06:26.880 |
Take me, I have math tutorials on the web that I'm sure have been scraped and now lo and behold 00:06:33.760 |
Bing can teach math. I'm not complaining but it would be nice to at least know what has been used 00:06:38.740 |
and what hasn't. This of course mirrors the raging legal issues around AI image generation. 00:06:44.200 |
Fights that are only just beginning for these web techs. Wanting to know where the data came from 00:06:50.720 |
is going to become a huge issue and this article lays out just some of the surprising sources of 00:06:56.620 |
data for Google's BARD model. Check out one of them which is YouTube. Could it be that your 00:07:02.500 |
comments right now are being harvested? Quite possibly. But I want to get back to the central 00:07:08.200 |
question. What of GPT-5? Well here on the far right is Google Parallel. 00:07:13.880 |
Which if you remember back from the earlier paper was powered by only 800 billion tokens 00:07:20.900 |
and Palm was definitely not optimized for parameters. GPT-5 will learn the lessons from this 00:07:27.560 |
and will probably scrape as much high quality data as it possibly can. And don't forget another year 00:07:33.580 |
has gone by since GPT-4 was handed to Microsoft and the stock of high quality data grows by around 00:07:43.620 |
Even without further efficiencies in data use or extraction. So even if Bing did use all the 00:07:49.800 |
high quality data available I don't think it did. And even if David Chapman is right, 00:07:53.760 |
the stock of data now available is going to be greater. But if Bing was trained on a similar 00:08:00.000 |
amount of data to Palm, say 1 trillion tokens, but now GPT-5 maxes out, we could genuinely be 00:08:08.280 |
talking about an order of magnitude improvement. I'm going to briefly survey some of the 00:08:13.320 |
implications of that in a moment. But before I do I want to show you the ways that OpenAI will 00:08:18.840 |
likely be improving GPT-5 regardless of previous limitations. First, more ways might be found to 00:08:25.740 |
extract high quality data from low quality sources. No offense Facebook. Second, this paper from only 00:08:33.660 |
last week shows that gains can be made by automating chain of thought prompting into the 00:08:40.860 |
model. If you're not sure what chain of thought prompting is, you can look at the chart above. 00:08:43.020 |
It's a form of prompt engineering that I discussed in my video "8 Upgrades in GPT-4" where essentially 00:08:50.520 |
you force the model to lay out its working and thereby improve its output. Now this paper talks 00:08:56.160 |
about 2-3% gains but even those small gains when Bing is already this strong would be significant. 00:09:02.760 |
Don't forget these are separate upgrades to the data discussion. 00:09:06.480 |
Third, this paper from three weeks ago shows that language models can teach themselves 00:09:12.720 |
to use tools such as calculators, calendars and APIs. If there were no other improvements 00:09:20.100 |
honestly in GPT-5 other than this it would change the world. And I know for a fact that 00:09:25.680 |
people are working on integrating Wolfram Alpha into a large language model and look 00:09:31.200 |
at the number of tools that Wolfram Alpha has in science, math, money and more. These models 00:09:37.560 |
can actually teach themselves how to use tools and that chimes perfectly 00:09:42.420 |
with this paper which essentially lays out that using a Python interpreter models can actually 00:09:48.600 |
check if their code compiles and thereby teach themselves better coding. The links to all of 00:09:54.300 |
these papers will be in the description as I said. The fourth way that GPT-5 might be improved 00:09:59.400 |
even without more high quality data would be it being trained multiple times on the same data, 00:10:06.120 |
as laid out here by Professor Swayam Dipta. He says that currently these models are trained 00:10:12.120 |
on the same data just once owing to performance and cost constraints but it may be possible to 00:10:18.600 |
train a model several times using the same data. Sure it might cost more but I think that for 00:10:24.420 |
Microsoft when all of search and its profits is the prize a few billion could be deemed worth it. 00:10:30.660 |
And this paper co-authored by that same professor lays out how models can generate additional data 00:10:37.020 |
sets on problems with which they struggle such as those with complex patterns and 00:10:41.820 |
humans could filter their answers for correctness. Think of this as artificial data generation and it 00:10:49.080 |
can lead to 10% or more in improvements. And if artificial data can be integrated honestly what 00:10:55.740 |
is actually going to bottleneck these GPT models? I could go on with the improvements that might be 00:11:01.440 |
made without new data. My central point is that data will be the big determinant but 00:11:07.560 |
there are other ways to improve GPT-5 if data turns out to be a bottleneck. 00:11:11.520 |
But what if they can fully utilize 9 trillion tokens as the original paper surmised by the 00:11:18.300 |
end of 2024 or even the beginning of 2024? What could one more order of magnitude improvement 00:11:24.240 |
actually look like? The short answer is that no one knows. Probably not AGI but certainly a 00:11:31.260 |
revolution in the jobs market. Maybe this is why Sam Altman tweeted 2023 $30,000 to get a simple iPhone app created $300 for 00:11:41.220 |
a plumbing job. I wonder what those relative prices will look like in 2028. The likely coming 00:11:47.520 |
divergence between changes to cognitive work and changes to physical work could be quite dramatic. 00:11:52.800 |
That gives a sense of his timelines but my own guess is that the best human raters will be 00:11:58.980 |
beaten on at least some of the following benchmarks. Take reading comprehension where you can imagine 00:12:04.500 |
the extrapolation to GPT-5. If and when it occurs that would have huge implications for summarization 00:12:10.920 |
and creative writing. Next logic and critical reasoning. We're talking debating topics, 00:12:16.740 |
doing law work, discerning causality in complex scenarios. That would be huge in finance where 00:12:23.820 |
you have to sort the signal from the noise in large data sets. Physics and high school math 00:12:29.280 |
would be close to solved by an order of magnitude improvement. AI tutors replacing my job for 00:12:36.240 |
example could be with us by the end of next year. Don't forget the release 00:12:40.620 |
of GPT-5 in whichever month it comes will likely roughly coincide with the final refinements in 00:12:47.880 |
text to speech, image to text, text to image and text to video avatars. So don't think AI tutors 00:12:55.320 |
are as far as you might imagine. The reason why no one and certainly not me can be sure of timelines 00:13:01.800 |
for GPT-5 though is because they depend partly on internal safety research at Google and OpenAI. Take 00:13:10.320 |
Sam Altman to the New York Times: "And when we are ready, when we think we have completed our 00:13:15.720 |
alignment work and all of our safety thinking and worked with external auditors, other AGI Labs, 00:13:22.680 |
then we'll release those things." Here he's probably talking about GPT-4 but the same would 00:13:27.780 |
apply even more so to GPT-5. On the other hand the release and then unrelease of the Sydney model of 00:13:34.800 |
Bing might suggest otherwise. But at least according to him safety and alignment are 00:13:40.020 |
the goal. I'm going to end with this quote from Sam Altman again. He added the blue text last 00:13:45.720 |
minute to his public post on AGI released the other week. It says: "It's important that the 00:13:52.680 |
ratio of safety progress to capability progress increases." In other words these models are 00:13:59.640 |
getting much more powerful much faster than they can keep up with. 00:14:04.680 |
But thank you for keeping up with this video. Thank you for watching to the end. Please 00:14:09.720 |
do check out my other videos on Bing chat and its use cases and either way have a wonderful day.