GPT 5 is All About Data

00:00:00.000 | To find out what I could about GPT-5, I have read every academic paper I could find about it,

00:00:06.520 | every leak report, interview snippet and media article. I can summarize it like this,

00:00:12.260 | it will come down to data, how much of it there is, how it's used and where it comes from.

00:00:18.660 | These are the factors that will dictate whether GPT-5 gets released later this year and whether

00:00:24.900 | it will actually approach genius level IQ. Some media reports have picked up on this potential

00:00:30.800 | leak about GPT-5, you can read it here. I have put quite a few hours in trying to verify whether

00:00:37.140 | this might be accurate and even though it's now being quoted by reputable sources, I still can't

00:00:42.840 | confirm its accuracy. So for now I'll just say that the rest of the document seems accurate

00:00:47.560 | but who knows. I am not relying on this for my research about GPT-5 but the scale, 25,000 GPUs,

00:00:54.880 | does seem right. TechRadar here describes ChatGPT as having been trained on 10,000 NVIDIA GPUs.

00:01:02.160 | And don't forget those were A100 GPUs. Microsoft might well now have access to the H100 GPU which

00:01:10.220 | according to every source is a big step up from A100 GPUs on pretty much every metric.

00:01:16.540 | And what about timelines for GPT-5? Would later this year be accurate? Well we can infer from

00:01:23.160 | Jordi Rybas that GPT-5 is a good idea. If you're not sure, you can look at the chart on the right

00:01:24.860 | side of the screen. GPT-4 or equivalent was completed sometime around late spring/early summer

00:01:30.740 | of 2022. That would be just around the time that DeepMind published this which in massively

00:01:37.760 | oversimplified terms lays out a framework for optimizing parameter size with the number

00:01:44.180 | of training tokens aka how much info from the web it's trained on. Turns out models

00:01:49.940 | like GPT-3 and Palm had way more parameters than needed anyway.

00:01:54.360 | It was the data and especially high quality data that it was lacking. So all those graphs about GPT-4

00:02:01.680 | needing 100 trillion parameters were absolutely farcical. It could even be that GPT-5 has the

00:02:08.640 | same or fewer parameters than GPT-4. This less wrong post from July of 2022 picks up on that

00:02:17.100 | finding and points out that it is data not size that is currently the active constraint on language

00:02:23.860 | modeling performance. Current returns to additional data are immense and current returns to additional

00:02:30.520 | model size are miniscule. Indeed most recent landmark models are wastefully big. If we can

00:02:36.340 | leverage enough data there is no reason to run 500 billion parameter models much less 1 trillion

00:02:42.700 | parameter or larger models. Remember it's data not parameter count. The link to all of these articles

00:02:48.760 | by the way will be in the description. At this point let me quickly say that if you're learning anything don't forget to

00:02:53.360 | leave a like or a comment. Frankly even abuse helps the algorithm so go for it. What about chat GPT?

00:02:58.760 | Well GPT-3 along with a host of other models was trained on about 300 billion tokens. By the way

00:03:06.320 | what defines a token shifts in the literature but it's somewhere between 1 and 1.4 words. Therefore

00:03:12.680 | think of a token as roughly one word. As you can see from the graph below Palm was trained on about

00:03:19.640 | 800 billion tokens approximately.

00:03:22.860 | DeepMind's chinchilla on about 1.4 trillion tokens. That particular less wrong post was referenced here

00:03:30.540 | in this academic paper released in October. This paper is absolutely key to this video. It's focused

00:03:38.700 | entirely on whether we will run out of data as it pertains to machine learning and large language

00:03:44.520 | models. One of the key takeaways of this paper is the approximation given for how much high quality data / tokens

00:03:52.360 | might be out there. The stock of high quality language data is approximated at between 4.6

00:03:58.660 | trillion and 17 trillion words. The next point it makes is key. We are within one order of magnitude

00:04:06.580 | of exhausting high quality data and this will likely happen between 2023 and 2027. For those that

00:04:14.920 | don't know being an order of magnitude bigger means being 10 times bigger than what came previously.

00:04:19.540 | Now I want you to remember that 2023 to 2027 timeline for a moment because first I want to

00:04:25.660 | mention why high quality data is important. Running out of that could mean running out of the rapid

00:04:31.960 | improvements in GPT models. The paper says models trained on the latter kind of high quality data

00:04:38.320 | perform better so it is common practice to use high quality data for training language models.

00:04:44.380 | And where does that high quality data come from? Well to be honest not knowing that is a big part of the data.

00:04:49.040 | Which we will definitely come back to but here is a rough idea. We have scientific papers, books,

00:04:56.720 | scraped content from the web, the news, code etc. Plus Wikipedia of course. The paper also mentions

00:05:05.000 | here the middle of the road estimate of 9 trillion tokens of high quality data available. That

00:05:11.480 | estimate will be central in defining the near-term future of artificial intelligence. One order of magnitude

00:05:18.540 | more as an increase in performance is a huge deal. That would change everything. But I must say this

00:05:25.800 | estimate contrasts with some others such as the 3.2 trillion token estimate from that original post.

00:05:33.180 | And the author did say that they were trying to make it an overestimate. And what about this from

00:05:39.000 | David Chapman a PhD in AI from MIT. He references the DeepMind study and that less wrong post and

00:05:47.040 | makes two important and very important statements. The first one is that the data is not necessarily

00:05:48.040 | and plausible observations. First that GPT-4 or Bing may have scraped the bottom of the web text

00:05:55.720 | barrel and that this might be why its responses sometimes turn out like emoting teenagers. I

00:06:03.200 | actually did a video on the crazy conversations you can have with Bing that you can check out

00:06:08.340 | after this one. But second he suggests that there might be a reason that neither Google nor OpenAI

00:06:14.220 | have been forthcoming about where they get their data from. Now I'm not saying it might be about

00:06:20.540 | illegality but it might be about avoiding controversy over attribution and compensation.

00:06:26.880 | Take me, I have math tutorials on the web that I'm sure have been scraped and now lo and behold

00:06:33.760 | Bing can teach math. I'm not complaining but it would be nice to at least know what has been used

00:06:38.740 | and what hasn't. This of course mirrors the raging legal issues around AI image generation.

00:06:44.200 | Fights that are only just beginning for these web techs. Wanting to know where the data came from

00:06:50.720 | is going to become a huge issue and this article lays out just some of the surprising sources of

00:06:56.620 | data for Google's BARD model. Check out one of them which is YouTube. Could it be that your

00:07:02.500 | comments right now are being harvested? Quite possibly. But I want to get back to the central

00:07:08.200 | question. What of GPT-5? Well here on the far right is Google Parallel.

00:07:13.880 | Which if you remember back from the earlier paper was powered by only 800 billion tokens

00:07:20.900 | and Palm was definitely not optimized for parameters. GPT-5 will learn the lessons from this

00:07:27.560 | and will probably scrape as much high quality data as it possibly can. And don't forget another year

00:07:33.580 | has gone by since GPT-4 was handed to Microsoft and the stock of high quality data grows by around

00:07:41.360 | 10% annually anyway.

00:07:43.620 | Even without further efficiencies in data use or extraction. So even if Bing did use all the

00:07:49.800 | high quality data available I don't think it did. And even if David Chapman is right,

00:07:53.760 | the stock of data now available is going to be greater. But if Bing was trained on a similar

00:08:00.000 | amount of data to Palm, say 1 trillion tokens, but now GPT-5 maxes out, we could genuinely be

00:08:08.280 | talking about an order of magnitude improvement. I'm going to briefly survey some of the

00:08:13.320 | implications of that in a moment. But before I do I want to show you the ways that OpenAI will

00:08:18.840 | likely be improving GPT-5 regardless of previous limitations. First, more ways might be found to

00:08:25.740 | extract high quality data from low quality sources. No offense Facebook. Second, this paper from only

00:08:33.660 | last week shows that gains can be made by automating chain of thought prompting into the

00:08:40.860 | model. If you're not sure what chain of thought prompting is, you can look at the chart above.

00:08:43.020 | It's a form of prompt engineering that I discussed in my video "8 Upgrades in GPT-4" where essentially

00:08:50.520 | you force the model to lay out its working and thereby improve its output. Now this paper talks

00:08:56.160 | about 2-3% gains but even those small gains when Bing is already this strong would be significant.

00:09:02.760 | Don't forget these are separate upgrades to the data discussion.

00:09:06.480 | Third, this paper from three weeks ago shows that language models can teach themselves

00:09:12.720 | to use tools such as calculators, calendars and APIs. If there were no other improvements

00:09:20.100 | honestly in GPT-5 other than this it would change the world. And I know for a fact that

00:09:25.680 | people are working on integrating Wolfram Alpha into a large language model and look

00:09:31.200 | at the number of tools that Wolfram Alpha has in science, math, money and more. These models

00:09:37.560 | can actually teach themselves how to use tools and that chimes perfectly

00:09:42.420 | with this paper which essentially lays out that using a Python interpreter models can actually

00:09:48.600 | check if their code compiles and thereby teach themselves better coding. The links to all of

00:09:54.300 | these papers will be in the description as I said. The fourth way that GPT-5 might be improved

00:09:59.400 | even without more high quality data would be it being trained multiple times on the same data,

00:10:06.120 | as laid out here by Professor Swayam Dipta. He says that currently these models are trained

00:10:12.120 | on the same data just once owing to performance and cost constraints but it may be possible to

00:10:18.600 | train a model several times using the same data. Sure it might cost more but I think that for

00:10:24.420 | Microsoft when all of search and its profits is the prize a few billion could be deemed worth it.

00:10:30.660 | And this paper co-authored by that same professor lays out how models can generate additional data

00:10:37.020 | sets on problems with which they struggle such as those with complex patterns and

00:10:41.820 | humans could filter their answers for correctness. Think of this as artificial data generation and it

00:10:49.080 | can lead to 10% or more in improvements. And if artificial data can be integrated honestly what

00:10:55.740 | is actually going to bottleneck these GPT models? I could go on with the improvements that might be

00:11:01.440 | made without new data. My central point is that data will be the big determinant but

00:11:07.560 | there are other ways to improve GPT-5 if data turns out to be a bottleneck.

00:11:11.520 | But what if they can fully utilize 9 trillion tokens as the original paper surmised by the

00:11:18.300 | end of 2024 or even the beginning of 2024? What could one more order of magnitude improvement

00:11:24.240 | actually look like? The short answer is that no one knows. Probably not AGI but certainly a

00:11:31.260 | revolution in the jobs market. Maybe this is why Sam Altman tweeted 2023 $30,000 to get a simple iPhone app created $300 for

00:11:41.220 | a plumbing job. I wonder what those relative prices will look like in 2028. The likely coming

00:11:47.520 | divergence between changes to cognitive work and changes to physical work could be quite dramatic.

00:11:52.800 | That gives a sense of his timelines but my own guess is that the best human raters will be

00:11:58.980 | beaten on at least some of the following benchmarks. Take reading comprehension where you can imagine

00:12:04.500 | the extrapolation to GPT-5. If and when it occurs that would have huge implications for summarization

00:12:10.920 | and creative writing. Next logic and critical reasoning. We're talking debating topics,

00:12:16.740 | doing law work, discerning causality in complex scenarios. That would be huge in finance where

00:12:23.820 | you have to sort the signal from the noise in large data sets. Physics and high school math

00:12:29.280 | would be close to solved by an order of magnitude improvement. AI tutors replacing my job for

00:12:36.240 | example could be with us by the end of next year. Don't forget the release

00:12:40.620 | of GPT-5 in whichever month it comes will likely roughly coincide with the final refinements in

00:12:47.880 | text to speech, image to text, text to image and text to video avatars. So don't think AI tutors

00:12:55.320 | are as far as you might imagine. The reason why no one and certainly not me can be sure of timelines

00:13:01.800 | for GPT-5 though is because they depend partly on internal safety research at Google and OpenAI. Take

00:13:09.480 | this quote from

00:13:10.320 | Sam Altman to the New York Times: "And when we are ready, when we think we have completed our

00:13:15.720 | alignment work and all of our safety thinking and worked with external auditors, other AGI Labs,

00:13:22.680 | then we'll release those things." Here he's probably talking about GPT-4 but the same would

00:13:27.780 | apply even more so to GPT-5. On the other hand the release and then unrelease of the Sydney model of

00:13:34.800 | Bing might suggest otherwise. But at least according to him safety and alignment are

00:13:40.020 | the goal. I'm going to end with this quote from Sam Altman again. He added the blue text last

00:13:45.720 | minute to his public post on AGI released the other week. It says: "It's important that the

00:13:52.680 | ratio of safety progress to capability progress increases." In other words these models are

00:13:59.640 | getting much more powerful much faster than they can keep up with.

00:14:04.680 | But thank you for keeping up with this video. Thank you for watching to the end. Please

00:14:09.720 | do check out my other videos on Bing chat and its use cases and either way have a wonderful day.