back to indexLeak: ‘GPT-5 exhibits diminishing returns’, Sam Altman: ‘lol’
Chapters
0:0 Introduction
0:39 Bear Case, TheInformation Leak
4:1 Bull Case, Sam Altman
6:20 FrontierMath
11:29 o1 Paradigm
13:11 Text to Video Greatness and Universal-2
00:00:00.000 |
it would be easy to write a title based on yesterday's OpenAI leak that language model 00:00:06.960 |
progress is hitting a wall. And it would be just as easy to write an all-caps hype headline 00:00:13.200 |
if I only listened to the very latest cherry-picked quotes from, say, Sam Altman, 00:00:19.280 |
CEO of OpenAI. But I want to try to convey the nuance behind the headlines and will be diving 00:00:27.040 |
into some brand new papers to do so. The ground truth lies somewhere between both extremes and 00:00:34.320 |
even OpenAI don't know exactly where. Here's how the video would look if I only wanted to focus 00:00:41.600 |
on the bear case for LLMs or AI more generally. An undisclosed source at OpenAI leaked information 00:00:49.440 |
apparently to The Information. The article was published late yesterday and here it is. 00:00:55.680 |
While the number of people using chatGPT has soared, the rate of improvement, it says, 00:01:01.120 |
for the basic building blocks, the language models underpinning them, appears to be slowing down. 00:01:06.240 |
The article isn't talking about a particular product or a way of generating outputs from a 00:01:11.520 |
model. It's talking about the underlying model. The current core model of OpenAI is GPT-4.0, 00:01:18.560 |
and so the next model would naturally be called, say, GPT-5. The suggestion from this article is 00:01:23.920 |
that the more favoured name is Orion, but still we're talking about that core underlying pre-trained 00:01:29.280 |
model. At least at an early stage, that new Orion model was looking pretty decent. Remember, 00:01:34.480 |
this comes from a person who heard comments directly from Sam Altman at OpenAI, so presumably 00:01:40.640 |
an OpenAI staff member. And that person said, though OpenAI had only completed 20% of the 00:01:46.880 |
training process for this new model Orion, it was already on par with GPT-4 in terms of 00:01:52.720 |
intelligence and abilities to fulfil tasks and answer questions. That sounds decent, right, 00:01:57.760 |
but when the training was finished, according to this person, the final increase in quality 00:02:03.360 |
was far smaller compared with the jump between GPT-3 and GPT-4. So it's apparently better than 00:02:09.440 |
prior models, its performance exceeds those prior models, but not by as much as previous leaps. 00:02:15.200 |
And the article gives a bit more detail here. Some researchers at the company believe Orion 00:02:20.880 |
isn't reliably better than its predecessor in handling certain tasks, like coding. It might 00:02:26.000 |
be better at language tasks, but remember, the trade-off for a bigger model is that it would 00:02:31.360 |
generally be slower and more expensive. And why, according to the article, might progress have 00:02:36.880 |
slowed down? Well, roughly speaking, you can think of GPT-4 as having trained on most of the 00:02:43.600 |
accessible web. They kind of ignored copyright to basically train on anything they could grab hold 00:02:49.200 |
of. At that point, it becomes quite hard to scale up another 10 times, another order of magnitude, 00:02:54.480 |
because where are you going to get that extra data? You can tell though that the article is 00:02:58.560 |
somewhat guessing because they also put forward the hypothesis that it could be that it's just 00:03:02.880 |
getting too expensive to train models. They selectively quote Noam Brown, who said that 00:03:08.160 |
it's soon going to cost hundreds of billions of dollars to train the next generation. And at some 00:03:12.720 |
point, the scaling paradigm breaks down. How do I know that's a selective quote? Well, because Noam 00:03:18.000 |
Brown just today said so, saying he was selectively quoted and that he thinks that there won't be a 00:03:24.720 |
slowdown in AI progress any time soon. The theme though of the article is clear and they quote one 00:03:30.880 |
open AI investor saying this, "We're increasing the number of GPUs used to train AI, but we're 00:03:36.880 |
not getting the intelligence improvements out of it at all." And another analyst was quoted at the 00:03:42.000 |
end saying, "You could argue that for now we are seeing a plateau in the performance of LLMs." 00:03:48.080 |
As the author of Simplebench, I could bring in tons of examples of the latest models making 00:03:53.440 |
silly mistakes that a human might not make. Then throw in a few more quotes making Sam Altman or 00:03:59.120 |
OpenAI look bad and call it a day. Alternatively, I could make a video focused only on the most hype 00:04:05.840 |
worthy of clips. For example, here's four times in the last few days, Sam Altman has given us 00:04:11.920 |
quotes that make it seem like we're on the verge of a giant leap forward. First, he says, 00:04:17.600 |
"We know what to do to reach AGI." Second, and by the way, these are in 00:04:43.920 |
approximately ascending order of optimism, he says scaling is going to continue yes for a long time. 00:04:50.800 |
Then he mysteriously alludes to a quote, "Breathtaking research result that he can't 00:05:10.400 |
talk about." Fourth, and this has to be the most extreme example, he hinted at 00:05:23.920 |
solving all of physics using AI. To his credit at least, he could see the grandiosity in some 00:05:50.080 |
of his claims. The typed female Twitter account jokingly quoted Sam Altman, "We are a few thousand 00:05:56.000 |
days away from building God. We will build suns on earth, unify physics, and resurrect the worthy 00:06:02.560 |
dead." Interview host, "Sounds like this will be really impactful for startups." Sam Altman, 00:06:07.680 |
"Definitely." So hopefully you can see that just by selecting which news stories to quote 00:06:13.760 |
and which clips to use, you can present entirely different stories. What then is the truth and how 00:06:20.480 |
will we know? Well, a quick hint before I give some evidence to that effect is that even OpenAI 00:06:26.640 |
don't know. That's according to a key researcher whom I'll quote in a moment. 00:06:30.720 |
This paper, Frontier Math, isn't just interesting because it gives the results of whether current AI 00:06:36.240 |
models can compete at the very frontier of mathematics. Answer, no they can't. But it's 00:06:41.600 |
also interesting because it shows us what needs to happen before they can. They came up with around 00:06:47.120 |
a hundred questions developed in collaboration with 60 mathematicians from leading institutions, 00:06:52.640 |
professors, International Math Olympiad question writers, and field medalists. Think of that like 00:06:58.000 |
the Nobel Prize for mathematics. They go on that these problems typically demand hours or even days 00:07:05.280 |
for specialist mathematicians to solve. And Terence Tao, widely regarded as one of the smartest human 00:07:11.520 |
beings alive, said these are extremely challenging. Even he couldn't solve most of them. 00:07:17.360 |
How about the latest language models? Well, they can solve between 1 and 2% of them. That though 00:07:32.640 |
isn't too disgraceful given that these are unpublished problems, novel ones not found 00:07:39.120 |
in the training data. This benchmark though should serve as somewhat of a canary in the coal mine 00:07:45.120 |
because before any model can quote "solve all of physics" you'd have thought it could get at least 00:07:50.640 |
50 to 90% in this benchmark. Why not 100%? Because they estimate an error rate in the benchmark 00:07:57.680 |
itself of around 10%. Quick sidebar is that other benchmarks are known to have an error rate around 00:08:04.480 |
10% for example the MMLU. So depending on your perspective this could either be a sobering wake 00:08:11.360 |
up call about the remaining deficiencies of models or actually startlingly impressive. Is it the long 00:08:18.000 |
context window of Gemini 1.5 Pro that enables it to get 2%? Or is O1 Preview being underestimated 00:08:24.720 |
when it says it gets 1%? On page 9 of the paper they admit that those results were from a single 00:08:30.560 |
evaluation and when they tested models across repeated trials O1 Preview demonstrated the 00:08:36.000 |
strongest performance. Of course you probably don't need reminding that this is just O1 Preview, 00:08:41.040 |
the full O1 comes out maybe in the next two weeks. If I was only trying to present the pessimistic 00:08:47.040 |
case however I could focus on this quote. One mathematician interviewed about the difficulty 00:08:52.160 |
of the frontier math problems said this "Benchmark problems aren't quite the same as coming up with 00:08:58.000 |
original proofs. So much of mathematics takes years to develop and research and that's really 00:09:02.960 |
hard to encapsulate in a benchmark." On the other hand if you wanted to get hyped you could hear 00:09:07.760 |
this quote from the co-founder of Anthropic "You're saying these things are dumb? People 00:09:12.720 |
are making the math test equivalent of a basketball eval designed by NBA all-stars because the things 00:09:19.680 |
have gotten so good at basketball that no other tests stand up for more than six months before 00:09:25.200 |
they're obliterated by AI models." Should of course be interesting to see if my own benchmark 00:09:30.320 |
SimpleBench is obliterated in the next six months. We tested the new small Claude 3.5 Haiku from 00:09:36.720 |
Anthropic and it displaced GPT-40 Mini to reach 13th at 15.6% in SimpleBench. SimpleBench is all 00:09:44.800 |
about testing common reasoning and the human baseline is in the mid 80s while frontier models 00:09:50.240 |
get around the low 40s. At this point I would definitely forgive you for being torn between 00:09:54.960 |
optimism and pessimism but where do I come down? Well the key to further progress could come from 00:10:01.120 |
a different axis entirely, data efficiency. After all to solve frontier math problems you either 00:10:07.440 |
need to be a genius or you need to have access to relevant training data that is almost non-existent. 00:10:13.760 |
There are apparently only a dozen papers with the relevant things said Terence Tao. Now of course 00:10:19.840 |
maths isn't everything and many of you will rightly argue that general intelligence, AGI, 00:10:25.280 |
might arrive well before a model can crush frontier math. But still the challenges of solving frontier 00:10:31.120 |
math are roughly analogous to solving other domains. So I would ask, will companies like 00:10:37.040 |
OpenAI get access to those few dozen papers that contain the relevant reasoning steps? And even if 00:10:43.280 |
they do, can the models themselves pick out the signal from the noise? Pick out those reasoning 00:10:49.200 |
steps contained within those dozens of papers from the tens of trillions of words that they're also 00:10:55.360 |
trained on? The O1 family of models from OpenAI suggests that is at least possible. If you're new 00:11:02.000 |
to the channel and you have no idea what I'm talking about when I talk about the O1 family 00:11:06.080 |
of models do check out my video on the topic. The very brief TLDR is that that test time compute 00:11:12.240 |
paradigm suggests that models might be able to extract at inference time when they're doing 00:11:17.760 |
outputs just one output among tens of thousands that contains the necessary reasoning steps. 00:11:24.080 |
If that's correct expect rapid progress on the frontier math benchmark. Of course those reasoning 00:11:29.680 |
steps do have to be found somewhere in the training data that the weights of a model derive from. But 00:11:35.280 |
as long as they are progress can continue and at this point I'll bring us back to the article that 00:11:40.480 |
kicked everything off. Even if OpenAI can only improve the quality of the underlying model the 00:11:45.680 |
GPC5 or Orion at a slower rate it will result in a much better reasoning output. In short because 00:11:52.480 |
if we're asking a model to output 10,000 different answers there's a greater chance that at least one 00:11:58.800 |
of those answers is correct. And again if the quality of the underlying model is even just 00:12:04.800 |
incrementally better it has a significantly better chance of discerning that correct answer from the 00:12:11.920 |
noise. Hopefully I've conveyed that the ground truth reality is a lot more nuanced and complex 00:12:18.560 |
than the positive Panglossian perspective or the everything's hit a wall perspective. 00:12:24.720 |
Here at last then is the quote that I've been teasing for quite a while. One of the stars behind 00:12:30.000 |
the training of the O1 family of models said he can see progress continuing at least for one or 00:12:34.640 |
two more years but they simply don't know how long that will last. "I think it's still unclear 00:12:40.800 |
um I think that basically a lot of the assumptions about why we would be hitting a wall needs to be 00:12:48.080 |
re-evaluated completely given that this there's this new paradigm and so we're still trying to 00:12:52.400 |
figure that out. I suspect that a lot of other people are going to be trying to figure that out 00:12:56.000 |
and the answer is like right now we don't know. Looking at the limitations of pre-training and 00:13:01.200 |
saying that's going to be a blocker on continued progress I think that that is no longer a blocker." 00:13:06.000 |
If even OpenAI don't know how much further scaling can go, how can we know? I'm going to end though 00:13:12.080 |
on a lighter note because not everything in AI is about text-based benchmarks and reasoning. 00:13:17.920 |
According to the well-known co-founder and CEO of Runway, OpenAI is planning to finally release 00:13:24.880 |
Sora in around two weeks. Sora is of course that incredible video generation model first described 00:13:30.960 |
back in February. So even if you're one of those unlike me who believes that reasoning progress 00:13:36.480 |
will ground to a hole, that doesn't mean progress in other modalities will too. What would explain 00:13:41.760 |
the discrepancy? Well we just have so much more data from videos say YouTube and from images to 00:13:48.080 |
train models on in those domains than we do for text. Indeed any domain where there's an abundance 00:13:54.000 |
of data expects progress to continue rapidly. Take for example speech-to-text and you guys have 00:13:59.600 |
heard before on the channel talking about channel sponsor AssemblyAI and their Universal One model. 00:14:05.840 |
Well even in a surprise to me they have now come out with Universal Two. I'll link the research 00:14:11.920 |
page below because there is an absolute ton I could go through. Probably goes without saying 00:14:16.720 |
that the word error rates are much lower for Universal Two compared to all these other models. 00:14:22.640 |
That's why as I've mentioned many times before I actually reached out to AssemblyAI to be a channel 00:14:28.240 |
sponsor. And speaking of audio don't forget to check out my podcasts and exclusive videos 00:14:33.120 |
on my Patreon which is called AI Insiders. What am I up to now? Almost 40 videos and podcasts. 00:14:39.440 |
If you leave this video neither overly hyped nor overly skeptical I've done my job and I get that 00:14:44.800 |
that leaves us in somewhat of a weird place. So here is an AI generated video that captures just 00:14:50.880 |
a bit of that weirdness. "Can a robot write a symphony? Can a robot turn a canvas into a 00:15:26.880 |
Well done to Dari3D who put that video together and a big thank you to all of you for watching 00:15:40.560 |
to the end. Thank you so much and have a wonderful day.