Back to Index

GPT 5 is All About Data


Transcript

To find out what I could about GPT-5, I have read every academic paper I could find about it, every leak report, interview snippet and media article. I can summarize it like this, it will come down to data, how much of it there is, how it's used and where it comes from.

These are the factors that will dictate whether GPT-5 gets released later this year and whether it will actually approach genius level IQ. Some media reports have picked up on this potential leak about GPT-5, you can read it here. I have put quite a few hours in trying to verify whether this might be accurate and even though it's now being quoted by reputable sources, I still can't confirm its accuracy.

So for now I'll just say that the rest of the document seems accurate but who knows. I am not relying on this for my research about GPT-5 but the scale, 25,000 GPUs, does seem right. TechRadar here describes ChatGPT as having been trained on 10,000 NVIDIA GPUs. And don't forget those were A100 GPUs.

Microsoft might well now have access to the H100 GPU which according to every source is a big step up from A100 GPUs on pretty much every metric. And what about timelines for GPT-5? Would later this year be accurate? Well we can infer from Jordi Rybas that GPT-5 is a good idea.

If you're not sure, you can look at the chart on the right side of the screen. GPT-4 or equivalent was completed sometime around late spring/early summer of 2022. That would be just around the time that DeepMind published this which in massively oversimplified terms lays out a framework for optimizing parameter size with the number of training tokens aka how much info from the web it's trained on.

Turns out models like GPT-3 and Palm had way more parameters than needed anyway. It was the data and especially high quality data that it was lacking. So all those graphs about GPT-4 needing 100 trillion parameters were absolutely farcical. It could even be that GPT-5 has the same or fewer parameters than GPT-4.

This less wrong post from July of 2022 picks up on that finding and points out that it is data not size that is currently the active constraint on language modeling performance. Current returns to additional data are immense and current returns to additional model size are miniscule. Indeed most recent landmark models are wastefully big.

If we can leverage enough data there is no reason to run 500 billion parameter models much less 1 trillion parameter or larger models. Remember it's data not parameter count. The link to all of these articles by the way will be in the description. At this point let me quickly say that if you're learning anything don't forget to leave a like or a comment.

Frankly even abuse helps the algorithm so go for it. What about chat GPT? Well GPT-3 along with a host of other models was trained on about 300 billion tokens. By the way what defines a token shifts in the literature but it's somewhere between 1 and 1.4 words. Therefore think of a token as roughly one word.

As you can see from the graph below Palm was trained on about 800 billion tokens approximately. DeepMind's chinchilla on about 1.4 trillion tokens. That particular less wrong post was referenced here in this academic paper released in October. This paper is absolutely key to this video. It's focused entirely on whether we will run out of data as it pertains to machine learning and large language models.

One of the key takeaways of this paper is the approximation given for how much high quality data / tokens might be out there. The stock of high quality language data is approximated at between 4.6 trillion and 17 trillion words. The next point it makes is key. We are within one order of magnitude of exhausting high quality data and this will likely happen between 2023 and 2027.

For those that don't know being an order of magnitude bigger means being 10 times bigger than what came previously. Now I want you to remember that 2023 to 2027 timeline for a moment because first I want to mention why high quality data is important. Running out of that could mean running out of the rapid improvements in GPT models.

The paper says models trained on the latter kind of high quality data perform better so it is common practice to use high quality data for training language models. And where does that high quality data come from? Well to be honest not knowing that is a big part of the data.

Which we will definitely come back to but here is a rough idea. We have scientific papers, books, scraped content from the web, the news, code etc. Plus Wikipedia of course. The paper also mentions here the middle of the road estimate of 9 trillion tokens of high quality data available.

That estimate will be central in defining the near-term future of artificial intelligence. One order of magnitude more as an increase in performance is a huge deal. That would change everything. But I must say this estimate contrasts with some others such as the 3.2 trillion token estimate from that original post.

And the author did say that they were trying to make it an overestimate. And what about this from David Chapman a PhD in AI from MIT. He references the DeepMind study and that less wrong post and makes two important and very important statements. The first one is that the data is not necessarily and plausible observations.

First that GPT-4 or Bing may have scraped the bottom of the web text barrel and that this might be why its responses sometimes turn out like emoting teenagers. I actually did a video on the crazy conversations you can have with Bing that you can check out after this one.

But second he suggests that there might be a reason that neither Google nor OpenAI have been forthcoming about where they get their data from. Now I'm not saying it might be about illegality but it might be about avoiding controversy over attribution and compensation. Take me, I have math tutorials on the web that I'm sure have been scraped and now lo and behold Bing can teach math.

I'm not complaining but it would be nice to at least know what has been used and what hasn't. This of course mirrors the raging legal issues around AI image generation. Fights that are only just beginning for these web techs. Wanting to know where the data came from is going to become a huge issue and this article lays out just some of the surprising sources of data for Google's BARD model.

Check out one of them which is YouTube. Could it be that your comments right now are being harvested? Quite possibly. But I want to get back to the central question. What of GPT-5? Well here on the far right is Google Parallel. Which if you remember back from the earlier paper was powered by only 800 billion tokens and Palm was definitely not optimized for parameters.

GPT-5 will learn the lessons from this and will probably scrape as much high quality data as it possibly can. And don't forget another year has gone by since GPT-4 was handed to Microsoft and the stock of high quality data grows by around 10% annually anyway. Even without further efficiencies in data use or extraction.

So even if Bing did use all the high quality data available I don't think it did. And even if David Chapman is right, the stock of data now available is going to be greater. But if Bing was trained on a similar amount of data to Palm, say 1 trillion tokens, but now GPT-5 maxes out, we could genuinely be talking about an order of magnitude improvement.

I'm going to briefly survey some of the implications of that in a moment. But before I do I want to show you the ways that OpenAI will likely be improving GPT-5 regardless of previous limitations. First, more ways might be found to extract high quality data from low quality sources.

No offense Facebook. Second, this paper from only last week shows that gains can be made by automating chain of thought prompting into the model. If you're not sure what chain of thought prompting is, you can look at the chart above. It's a form of prompt engineering that I discussed in my video "8 Upgrades in GPT-4" where essentially you force the model to lay out its working and thereby improve its output.

Now this paper talks about 2-3% gains but even those small gains when Bing is already this strong would be significant. Don't forget these are separate upgrades to the data discussion. Third, this paper from three weeks ago shows that language models can teach themselves to use tools such as calculators, calendars and APIs.

If there were no other improvements honestly in GPT-5 other than this it would change the world. And I know for a fact that people are working on integrating Wolfram Alpha into a large language model and look at the number of tools that Wolfram Alpha has in science, math, money and more.

These models can actually teach themselves how to use tools and that chimes perfectly with this paper which essentially lays out that using a Python interpreter models can actually check if their code compiles and thereby teach themselves better coding. The links to all of these papers will be in the description as I said.

The fourth way that GPT-5 might be improved even without more high quality data would be it being trained multiple times on the same data, as laid out here by Professor Swayam Dipta. He says that currently these models are trained on the same data just once owing to performance and cost constraints but it may be possible to train a model several times using the same data.

Sure it might cost more but I think that for Microsoft when all of search and its profits is the prize a few billion could be deemed worth it. And this paper co-authored by that same professor lays out how models can generate additional data sets on problems with which they struggle such as those with complex patterns and humans could filter their answers for correctness.

Think of this as artificial data generation and it can lead to 10% or more in improvements. And if artificial data can be integrated honestly what is actually going to bottleneck these GPT models? I could go on with the improvements that might be made without new data. My central point is that data will be the big determinant but there are other ways to improve GPT-5 if data turns out to be a bottleneck.

But what if they can fully utilize 9 trillion tokens as the original paper surmised by the end of 2024 or even the beginning of 2024? What could one more order of magnitude improvement actually look like? The short answer is that no one knows. Probably not AGI but certainly a revolution in the jobs market.

Maybe this is why Sam Altman tweeted 2023 $30,000 to get a simple iPhone app created $300 for a plumbing job. I wonder what those relative prices will look like in 2028. The likely coming divergence between changes to cognitive work and changes to physical work could be quite dramatic. That gives a sense of his timelines but my own guess is that the best human raters will be beaten on at least some of the following benchmarks.

Take reading comprehension where you can imagine the extrapolation to GPT-5. If and when it occurs that would have huge implications for summarization and creative writing. Next logic and critical reasoning. We're talking debating topics, doing law work, discerning causality in complex scenarios. That would be huge in finance where you have to sort the signal from the noise in large data sets.

Physics and high school math would be close to solved by an order of magnitude improvement. AI tutors replacing my job for example could be with us by the end of next year. Don't forget the release of GPT-5 in whichever month it comes will likely roughly coincide with the final refinements in text to speech, image to text, text to image and text to video avatars.

So don't think AI tutors are as far as you might imagine. The reason why no one and certainly not me can be sure of timelines for GPT-5 though is because they depend partly on internal safety research at Google and OpenAI. Take this quote from Sam Altman to the New York Times: "And when we are ready, when we think we have completed our alignment work and all of our safety thinking and worked with external auditors, other AGI Labs, then we'll release those things." Here he's probably talking about GPT-4 but the same would apply even more so to GPT-5.

On the other hand the release and then unrelease of the Sydney model of Bing might suggest otherwise. But at least according to him safety and alignment are the goal. I'm going to end with this quote from Sam Altman again. He added the blue text last minute to his public post on AGI released the other week.

It says: "It's important that the ratio of safety progress to capability progress increases." In other words these models are getting much more powerful much faster than they can keep up with. But thank you for keeping up with this video. Thank you for watching to the end. Please do check out my other videos on Bing chat and its use cases and either way have a wonderful day.