it would be easy to write a title based on yesterday's OpenAI leak that language model progress is hitting a wall. And it would be just as easy to write an all-caps hype headline if I only listened to the very latest cherry-picked quotes from, say, Sam Altman, CEO of OpenAI.
But I want to try to convey the nuance behind the headlines and will be diving into some brand new papers to do so. The ground truth lies somewhere between both extremes and even OpenAI don't know exactly where. Here's how the video would look if I only wanted to focus on the bear case for LLMs or AI more generally.
An undisclosed source at OpenAI leaked information apparently to The Information. The article was published late yesterday and here it is. While the number of people using chatGPT has soared, the rate of improvement, it says, for the basic building blocks, the language models underpinning them, appears to be slowing down.
The article isn't talking about a particular product or a way of generating outputs from a model. It's talking about the underlying model. The current core model of OpenAI is GPT-4.0, and so the next model would naturally be called, say, GPT-5. The suggestion from this article is that the more favoured name is Orion, but still we're talking about that core underlying pre-trained model.
At least at an early stage, that new Orion model was looking pretty decent. Remember, this comes from a person who heard comments directly from Sam Altman at OpenAI, so presumably an OpenAI staff member. And that person said, though OpenAI had only completed 20% of the training process for this new model Orion, it was already on par with GPT-4 in terms of intelligence and abilities to fulfil tasks and answer questions.
That sounds decent, right, but when the training was finished, according to this person, the final increase in quality was far smaller compared with the jump between GPT-3 and GPT-4. So it's apparently better than prior models, its performance exceeds those prior models, but not by as much as previous leaps.
And the article gives a bit more detail here. Some researchers at the company believe Orion isn't reliably better than its predecessor in handling certain tasks, like coding. It might be better at language tasks, but remember, the trade-off for a bigger model is that it would generally be slower and more expensive.
And why, according to the article, might progress have slowed down? Well, roughly speaking, you can think of GPT-4 as having trained on most of the accessible web. They kind of ignored copyright to basically train on anything they could grab hold of. At that point, it becomes quite hard to scale up another 10 times, another order of magnitude, because where are you going to get that extra data?
You can tell though that the article is somewhat guessing because they also put forward the hypothesis that it could be that it's just getting too expensive to train models. They selectively quote Noam Brown, who said that it's soon going to cost hundreds of billions of dollars to train the next generation.
And at some point, the scaling paradigm breaks down. How do I know that's a selective quote? Well, because Noam Brown just today said so, saying he was selectively quoted and that he thinks that there won't be a slowdown in AI progress any time soon. The theme though of the article is clear and they quote one open AI investor saying this, "We're increasing the number of GPUs used to train AI, but we're not getting the intelligence improvements out of it at all." And another analyst was quoted at the end saying, "You could argue that for now we are seeing a plateau in the performance of LLMs." As the author of Simplebench, I could bring in tons of examples of the latest models making silly mistakes that a human might not make.
Then throw in a few more quotes making Sam Altman or OpenAI look bad and call it a day. Alternatively, I could make a video focused only on the most hype worthy of clips. For example, here's four times in the last few days, Sam Altman has given us quotes that make it seem like we're on the verge of a giant leap forward.
First, he says, "We know what to do to reach AGI." Second, and by the way, these are in approximately ascending order of optimism, he says scaling is going to continue yes for a long time. Then he mysteriously alludes to a quote, "Breathtaking research result that he can't talk about." Fourth, and this has to be the most extreme example, he hinted at solving all of physics using AI.
To his credit at least, he could see the grandiosity in some of his claims. The typed female Twitter account jokingly quoted Sam Altman, "We are a few thousand days away from building God. We will build suns on earth, unify physics, and resurrect the worthy dead." Interview host, "Sounds like this will be really impactful for startups." Sam Altman, "Definitely." So hopefully you can see that just by selecting which news stories to quote and which clips to use, you can present entirely different stories.
What then is the truth and how will we know? Well, a quick hint before I give some evidence to that effect is that even OpenAI don't know. That's according to a key researcher whom I'll quote in a moment. This paper, Frontier Math, isn't just interesting because it gives the results of whether current AI models can compete at the very frontier of mathematics.
Answer, no they can't. But it's also interesting because it shows us what needs to happen before they can. They came up with around a hundred questions developed in collaboration with 60 mathematicians from leading institutions, professors, International Math Olympiad question writers, and field medalists. Think of that like the Nobel Prize for mathematics.
They go on that these problems typically demand hours or even days for specialist mathematicians to solve. And Terence Tao, widely regarded as one of the smartest human beings alive, said these are extremely challenging. Even he couldn't solve most of them. How about the latest language models? Well, they can solve between 1 and 2% of them.
That though isn't too disgraceful given that these are unpublished problems, novel ones not found in the training data. This benchmark though should serve as somewhat of a canary in the coal mine because before any model can quote "solve all of physics" you'd have thought it could get at least 50 to 90% in this benchmark.
Why not 100%? Because they estimate an error rate in the benchmark itself of around 10%. Quick sidebar is that other benchmarks are known to have an error rate around 10% for example the MMLU. So depending on your perspective this could either be a sobering wake up call about the remaining deficiencies of models or actually startlingly impressive.
Is it the long context window of Gemini 1.5 Pro that enables it to get 2%? Or is O1 Preview being underestimated when it says it gets 1%? On page 9 of the paper they admit that those results were from a single evaluation and when they tested models across repeated trials O1 Preview demonstrated the strongest performance.
Of course you probably don't need reminding that this is just O1 Preview, the full O1 comes out maybe in the next two weeks. If I was only trying to present the pessimistic case however I could focus on this quote. One mathematician interviewed about the difficulty of the frontier math problems said this "Benchmark problems aren't quite the same as coming up with original proofs.
So much of mathematics takes years to develop and research and that's really hard to encapsulate in a benchmark." On the other hand if you wanted to get hyped you could hear this quote from the co-founder of Anthropic "You're saying these things are dumb? People are making the math test equivalent of a basketball eval designed by NBA all-stars because the things have gotten so good at basketball that no other tests stand up for more than six months before they're obliterated by AI models." Should of course be interesting to see if my own benchmark SimpleBench is obliterated in the next six months.
We tested the new small Claude 3.5 Haiku from Anthropic and it displaced GPT-40 Mini to reach 13th at 15.6% in SimpleBench. SimpleBench is all about testing common reasoning and the human baseline is in the mid 80s while frontier models get around the low 40s. At this point I would definitely forgive you for being torn between optimism and pessimism but where do I come down?
Well the key to further progress could come from a different axis entirely, data efficiency. After all to solve frontier math problems you either need to be a genius or you need to have access to relevant training data that is almost non-existent. There are apparently only a dozen papers with the relevant things said Terence Tao.
Now of course maths isn't everything and many of you will rightly argue that general intelligence, AGI, might arrive well before a model can crush frontier math. But still the challenges of solving frontier math are roughly analogous to solving other domains. So I would ask, will companies like OpenAI get access to those few dozen papers that contain the relevant reasoning steps?
And even if they do, can the models themselves pick out the signal from the noise? Pick out those reasoning steps contained within those dozens of papers from the tens of trillions of words that they're also trained on? The O1 family of models from OpenAI suggests that is at least possible.
If you're new to the channel and you have no idea what I'm talking about when I talk about the O1 family of models do check out my video on the topic. The very brief TLDR is that that test time compute paradigm suggests that models might be able to extract at inference time when they're doing outputs just one output among tens of thousands that contains the necessary reasoning steps.
If that's correct expect rapid progress on the frontier math benchmark. Of course those reasoning steps do have to be found somewhere in the training data that the weights of a model derive from. But as long as they are progress can continue and at this point I'll bring us back to the article that kicked everything off.
Even if OpenAI can only improve the quality of the underlying model the GPC5 or Orion at a slower rate it will result in a much better reasoning output. In short because if we're asking a model to output 10,000 different answers there's a greater chance that at least one of those answers is correct.
And again if the quality of the underlying model is even just incrementally better it has a significantly better chance of discerning that correct answer from the noise. Hopefully I've conveyed that the ground truth reality is a lot more nuanced and complex than the positive Panglossian perspective or the everything's hit a wall perspective.
Here at last then is the quote that I've been teasing for quite a while. One of the stars behind the training of the O1 family of models said he can see progress continuing at least for one or two more years but they simply don't know how long that will last.
"I think it's still unclear um I think that basically a lot of the assumptions about why we would be hitting a wall needs to be re-evaluated completely given that this there's this new paradigm and so we're still trying to figure that out. I suspect that a lot of other people are going to be trying to figure that out and the answer is like right now we don't know.
Looking at the limitations of pre-training and saying that's going to be a blocker on continued progress I think that that is no longer a blocker." If even OpenAI don't know how much further scaling can go, how can we know? I'm going to end though on a lighter note because not everything in AI is about text-based benchmarks and reasoning.
According to the well-known co-founder and CEO of Runway, OpenAI is planning to finally release Sora in around two weeks. Sora is of course that incredible video generation model first described back in February. So even if you're one of those unlike me who believes that reasoning progress will ground to a hole, that doesn't mean progress in other modalities will too.
What would explain the discrepancy? Well we just have so much more data from videos say YouTube and from images to train models on in those domains than we do for text. Indeed any domain where there's an abundance of data expects progress to continue rapidly. Take for example speech-to-text and you guys have heard before on the channel talking about channel sponsor AssemblyAI and their Universal One model.
Well even in a surprise to me they have now come out with Universal Two. I'll link the research page below because there is an absolute ton I could go through. Probably goes without saying that the word error rates are much lower for Universal Two compared to all these other models.
That's why as I've mentioned many times before I actually reached out to AssemblyAI to be a channel sponsor. And speaking of audio don't forget to check out my podcasts and exclusive videos on my Patreon which is called AI Insiders. What am I up to now? Almost 40 videos and podcasts.
If you leave this video neither overly hyped nor overly skeptical I've done my job and I get that that leaves us in somewhat of a weird place. So here is an AI generated video that captures just a bit of that weirdness. "Can a robot write a symphony? Can a robot turn a canvas into a beautiful masterpiece?
Can you?" Well done to Dari3D who put that video together and a big thank you to all of you for watching to the end. Thank you so much and have a wonderful day.