Leak: ‘GPT-5 exhibits diminishing returns’, Sam Altman: ‘lol’

00:00:00.000 | it would be easy to write a title based on yesterday's OpenAI leak that language model

00:00:06.960 | progress is hitting a wall. And it would be just as easy to write an all-caps hype headline

00:00:13.200 | if I only listened to the very latest cherry-picked quotes from, say, Sam Altman,

00:00:19.280 | CEO of OpenAI. But I want to try to convey the nuance behind the headlines and will be diving

00:00:27.040 | into some brand new papers to do so. The ground truth lies somewhere between both extremes and

00:00:34.320 | even OpenAI don't know exactly where. Here's how the video would look if I only wanted to focus

00:00:41.600 | on the bear case for LLMs or AI more generally. An undisclosed source at OpenAI leaked information

00:00:49.440 | apparently to The Information. The article was published late yesterday and here it is.

00:00:55.680 | While the number of people using chatGPT has soared, the rate of improvement, it says,

00:01:01.120 | for the basic building blocks, the language models underpinning them, appears to be slowing down.

00:01:06.240 | The article isn't talking about a particular product or a way of generating outputs from a

00:01:11.520 | model. It's talking about the underlying model. The current core model of OpenAI is GPT-4.0,

00:01:18.560 | and so the next model would naturally be called, say, GPT-5. The suggestion from this article is

00:01:23.920 | that the more favoured name is Orion, but still we're talking about that core underlying pre-trained

00:01:29.280 | model. At least at an early stage, that new Orion model was looking pretty decent. Remember,

00:01:34.480 | this comes from a person who heard comments directly from Sam Altman at OpenAI, so presumably

00:01:40.640 | an OpenAI staff member. And that person said, though OpenAI had only completed 20% of the

00:01:46.880 | training process for this new model Orion, it was already on par with GPT-4 in terms of

00:01:52.720 | intelligence and abilities to fulfil tasks and answer questions. That sounds decent, right,

00:01:57.760 | but when the training was finished, according to this person, the final increase in quality

00:02:03.360 | was far smaller compared with the jump between GPT-3 and GPT-4. So it's apparently better than

00:02:09.440 | prior models, its performance exceeds those prior models, but not by as much as previous leaps.

00:02:15.200 | And the article gives a bit more detail here. Some researchers at the company believe Orion

00:02:20.880 | isn't reliably better than its predecessor in handling certain tasks, like coding. It might

00:02:26.000 | be better at language tasks, but remember, the trade-off for a bigger model is that it would

00:02:31.360 | generally be slower and more expensive. And why, according to the article, might progress have

00:02:36.880 | slowed down? Well, roughly speaking, you can think of GPT-4 as having trained on most of the

00:02:43.600 | accessible web. They kind of ignored copyright to basically train on anything they could grab hold

00:02:49.200 | of. At that point, it becomes quite hard to scale up another 10 times, another order of magnitude,

00:02:54.480 | because where are you going to get that extra data? You can tell though that the article is

00:02:58.560 | somewhat guessing because they also put forward the hypothesis that it could be that it's just

00:03:02.880 | getting too expensive to train models. They selectively quote Noam Brown, who said that

00:03:08.160 | it's soon going to cost hundreds of billions of dollars to train the next generation. And at some

00:03:12.720 | point, the scaling paradigm breaks down. How do I know that's a selective quote? Well, because Noam

00:03:18.000 | Brown just today said so, saying he was selectively quoted and that he thinks that there won't be a

00:03:24.720 | slowdown in AI progress any time soon. The theme though of the article is clear and they quote one

00:03:30.880 | open AI investor saying this, "We're increasing the number of GPUs used to train AI, but we're

00:03:36.880 | not getting the intelligence improvements out of it at all." And another analyst was quoted at the

00:03:42.000 | end saying, "You could argue that for now we are seeing a plateau in the performance of LLMs."

00:03:48.080 | As the author of Simplebench, I could bring in tons of examples of the latest models making

00:03:53.440 | silly mistakes that a human might not make. Then throw in a few more quotes making Sam Altman or

00:03:59.120 | OpenAI look bad and call it a day. Alternatively, I could make a video focused only on the most hype

00:04:05.840 | worthy of clips. For example, here's four times in the last few days, Sam Altman has given us

00:04:11.920 | quotes that make it seem like we're on the verge of a giant leap forward. First, he says,

00:04:17.600 | "We know what to do to reach AGI." Second, and by the way, these are in

00:04:43.920 | approximately ascending order of optimism, he says scaling is going to continue yes for a long time.

00:04:50.800 | Then he mysteriously alludes to a quote, "Breathtaking research result that he can't

00:05:10.400 | talk about." Fourth, and this has to be the most extreme example, he hinted at

00:05:23.920 | solving all of physics using AI. To his credit at least, he could see the grandiosity in some

00:05:50.080 | of his claims. The typed female Twitter account jokingly quoted Sam Altman, "We are a few thousand

00:05:56.000 | days away from building God. We will build suns on earth, unify physics, and resurrect the worthy

00:06:02.560 | dead." Interview host, "Sounds like this will be really impactful for startups." Sam Altman,

00:06:07.680 | "Definitely." So hopefully you can see that just by selecting which news stories to quote

00:06:13.760 | and which clips to use, you can present entirely different stories. What then is the truth and how

00:06:20.480 | will we know? Well, a quick hint before I give some evidence to that effect is that even OpenAI

00:06:26.640 | don't know. That's according to a key researcher whom I'll quote in a moment.

00:06:30.720 | This paper, Frontier Math, isn't just interesting because it gives the results of whether current AI

00:06:36.240 | models can compete at the very frontier of mathematics. Answer, no they can't. But it's

00:06:41.600 | also interesting because it shows us what needs to happen before they can. They came up with around

00:06:47.120 | a hundred questions developed in collaboration with 60 mathematicians from leading institutions,

00:06:52.640 | professors, International Math Olympiad question writers, and field medalists. Think of that like

00:06:58.000 | the Nobel Prize for mathematics. They go on that these problems typically demand hours or even days

00:07:05.280 | for specialist mathematicians to solve. And Terence Tao, widely regarded as one of the smartest human

00:07:11.520 | beings alive, said these are extremely challenging. Even he couldn't solve most of them.

00:07:17.360 | How about the latest language models? Well, they can solve between 1 and 2% of them. That though

00:07:32.640 | isn't too disgraceful given that these are unpublished problems, novel ones not found

00:07:39.120 | in the training data. This benchmark though should serve as somewhat of a canary in the coal mine

00:07:45.120 | because before any model can quote "solve all of physics" you'd have thought it could get at least

00:07:50.640 | 50 to 90% in this benchmark. Why not 100%? Because they estimate an error rate in the benchmark

00:07:57.680 | itself of around 10%. Quick sidebar is that other benchmarks are known to have an error rate around

00:08:04.480 | 10% for example the MMLU. So depending on your perspective this could either be a sobering wake

00:08:11.360 | up call about the remaining deficiencies of models or actually startlingly impressive. Is it the long

00:08:18.000 | context window of Gemini 1.5 Pro that enables it to get 2%? Or is O1 Preview being underestimated

00:08:24.720 | when it says it gets 1%? On page 9 of the paper they admit that those results were from a single

00:08:30.560 | evaluation and when they tested models across repeated trials O1 Preview demonstrated the

00:08:36.000 | strongest performance. Of course you probably don't need reminding that this is just O1 Preview,

00:08:41.040 | the full O1 comes out maybe in the next two weeks. If I was only trying to present the pessimistic

00:08:47.040 | case however I could focus on this quote. One mathematician interviewed about the difficulty

00:08:52.160 | of the frontier math problems said this "Benchmark problems aren't quite the same as coming up with

00:08:58.000 | original proofs. So much of mathematics takes years to develop and research and that's really

00:09:02.960 | hard to encapsulate in a benchmark." On the other hand if you wanted to get hyped you could hear

00:09:07.760 | this quote from the co-founder of Anthropic "You're saying these things are dumb? People

00:09:12.720 | are making the math test equivalent of a basketball eval designed by NBA all-stars because the things

00:09:19.680 | have gotten so good at basketball that no other tests stand up for more than six months before

00:09:25.200 | they're obliterated by AI models." Should of course be interesting to see if my own benchmark

00:09:30.320 | SimpleBench is obliterated in the next six months. We tested the new small Claude 3.5 Haiku from

00:09:36.720 | Anthropic and it displaced GPT-40 Mini to reach 13th at 15.6% in SimpleBench. SimpleBench is all

00:09:44.800 | about testing common reasoning and the human baseline is in the mid 80s while frontier models

00:09:50.240 | get around the low 40s. At this point I would definitely forgive you for being torn between

00:09:54.960 | optimism and pessimism but where do I come down? Well the key to further progress could come from

00:10:01.120 | a different axis entirely, data efficiency. After all to solve frontier math problems you either

00:10:07.440 | need to be a genius or you need to have access to relevant training data that is almost non-existent.

00:10:13.760 | There are apparently only a dozen papers with the relevant things said Terence Tao. Now of course

00:10:19.840 | maths isn't everything and many of you will rightly argue that general intelligence, AGI,

00:10:25.280 | might arrive well before a model can crush frontier math. But still the challenges of solving frontier

00:10:31.120 | math are roughly analogous to solving other domains. So I would ask, will companies like

00:10:37.040 | OpenAI get access to those few dozen papers that contain the relevant reasoning steps? And even if

00:10:43.280 | they do, can the models themselves pick out the signal from the noise? Pick out those reasoning

00:10:49.200 | steps contained within those dozens of papers from the tens of trillions of words that they're also

00:10:55.360 | trained on? The O1 family of models from OpenAI suggests that is at least possible. If you're new

00:11:02.000 | to the channel and you have no idea what I'm talking about when I talk about the O1 family

00:11:06.080 | of models do check out my video on the topic. The very brief TLDR is that that test time compute

00:11:12.240 | paradigm suggests that models might be able to extract at inference time when they're doing

00:11:17.760 | outputs just one output among tens of thousands that contains the necessary reasoning steps.

00:11:24.080 | If that's correct expect rapid progress on the frontier math benchmark. Of course those reasoning

00:11:29.680 | steps do have to be found somewhere in the training data that the weights of a model derive from. But

00:11:35.280 | as long as they are progress can continue and at this point I'll bring us back to the article that

00:11:40.480 | kicked everything off. Even if OpenAI can only improve the quality of the underlying model the

00:11:45.680 | GPC5 or Orion at a slower rate it will result in a much better reasoning output. In short because

00:11:52.480 | if we're asking a model to output 10,000 different answers there's a greater chance that at least one

00:11:58.800 | of those answers is correct. And again if the quality of the underlying model is even just

00:12:04.800 | incrementally better it has a significantly better chance of discerning that correct answer from the

00:12:11.920 | noise. Hopefully I've conveyed that the ground truth reality is a lot more nuanced and complex

00:12:18.560 | than the positive Panglossian perspective or the everything's hit a wall perspective.

00:12:24.720 | Here at last then is the quote that I've been teasing for quite a while. One of the stars behind

00:12:30.000 | the training of the O1 family of models said he can see progress continuing at least for one or

00:12:34.640 | two more years but they simply don't know how long that will last. "I think it's still unclear

00:12:40.800 | um I think that basically a lot of the assumptions about why we would be hitting a wall needs to be

00:12:48.080 | re-evaluated completely given that this there's this new paradigm and so we're still trying to

00:12:52.400 | figure that out. I suspect that a lot of other people are going to be trying to figure that out

00:12:56.000 | and the answer is like right now we don't know. Looking at the limitations of pre-training and

00:13:01.200 | saying that's going to be a blocker on continued progress I think that that is no longer a blocker."

00:13:06.000 | If even OpenAI don't know how much further scaling can go, how can we know? I'm going to end though

00:13:12.080 | on a lighter note because not everything in AI is about text-based benchmarks and reasoning.

00:13:17.920 | According to the well-known co-founder and CEO of Runway, OpenAI is planning to finally release

00:13:24.880 | Sora in around two weeks. Sora is of course that incredible video generation model first described

00:13:30.960 | back in February. So even if you're one of those unlike me who believes that reasoning progress

00:13:36.480 | will ground to a hole, that doesn't mean progress in other modalities will too. What would explain

00:13:41.760 | the discrepancy? Well we just have so much more data from videos say YouTube and from images to

00:13:48.080 | train models on in those domains than we do for text. Indeed any domain where there's an abundance

00:13:54.000 | of data expects progress to continue rapidly. Take for example speech-to-text and you guys have

00:13:59.600 | heard before on the channel talking about channel sponsor AssemblyAI and their Universal One model.

00:14:05.840 | Well even in a surprise to me they have now come out with Universal Two. I'll link the research

00:14:11.920 | page below because there is an absolute ton I could go through. Probably goes without saying

00:14:16.720 | that the word error rates are much lower for Universal Two compared to all these other models.

00:14:22.640 | That's why as I've mentioned many times before I actually reached out to AssemblyAI to be a channel

00:14:28.240 | sponsor. And speaking of audio don't forget to check out my podcasts and exclusive videos

00:14:33.120 | on my Patreon which is called AI Insiders. What am I up to now? Almost 40 videos and podcasts.

00:14:39.440 | If you leave this video neither overly hyped nor overly skeptical I've done my job and I get that

00:14:44.800 | that leaves us in somewhat of a weird place. So here is an AI generated video that captures just

00:14:50.880 | a bit of that weirdness. "Can a robot write a symphony? Can a robot turn a canvas into a

00:14:56.480 | beautiful masterpiece? Can you?"

00:15:02.080 | [Music]

00:15:14.560 | [Music]

00:15:26.880 | Well done to Dari3D who put that video together and a big thank you to all of you for watching

00:15:40.560 | to the end. Thank you so much and have a wonderful day.

Leak: ‘GPT-5 exhibits diminishing returns’, Sam Altman: ‘lol’

Chapters