Back to Index

How Far Can We Scale AI? Gen 3, Claude 3.5 Sonnet and AI Hype


Transcript

Artificial worlds generated by AI video models have never been more tangible and accessible and look set to transform how millions and then billions of people consume content. And artificial intelligence in the form of the new free Claude 3.5 SONNET is more capable than it has ever been. But I will draw on interviews in the last few days to show that there are more questions than ever, not just about the merits of continued scaling of language models but about whether we can rely on the words of those who lead these giant AI orgs.

But first AI video generation which is truly on fire at the moment. These outputs are from Runway Gen 3 available to many now and to everyone apparently in the coming days. The audio by the way is also AI generated this time from UDIO. And as you watch these videos remember that the AI models that are generating them are likely trained on far less than 1% of the video data that's available.

Unlike high quality text data video data isn't even close to being used up. Expect generations to get far more realistic and not in too long either. And by the way if you're bored while waiting on the Gen 3 wait list do play about with the Luma Dream Machine. I've got to admit it is pretty fun to generate two images or submit two real ones and have the model interpolate between them.

Now those of you in China have actually already been able to play with a model of similar capabilities called Kling. But we are all waiting on the release of Sora the most promising video generation model of them all from OpenAI. Here are a couple of comparisons between Runway Gen 3 and Sora.

The prompts used in both cases are identical and there's one example that particularly caught my eye. As many of us may have realized by now simply training models on more data doesn't necessarily mean they pick up accurate world models. Now I strongly suspect that Sora was trained on way more data with way more compute.

With its generation at the bottom you can see that the dust emerges from behind the car. This neatly demonstrates the benefits of scale but still leaves open the question about whether scale will solve all. Now yes it would be simple to extrapolate a straight line upwards and say that with enough scale we get a perfect world simulation but I just don't think it will be like that.

And there are already more than tentative hints that scale won't solve everything. More on that in just a moment but there is one more modality I am sure we were all looking forward to which is going to be delayed. That's the real-time advanced voice mode from OpenAI. It was the star of the demo of GPC 4.0 and was promised in the coming weeks.

Alas though it has now been delayed to the fall or the autumn and they say that's in part because they want to improve the model's ability to detect and refuse certain content. I also suspect though like dodgy physics with the video generation and hallucinations with the language generation they also realized it occasionally goes off the rails.

Now I personally find this funny but you let me know whether this would be acceptable to release. "Refreshing coolness in the air that just makes you want to smile and take a deep breath of that crisp invigorating breeze. The sun's shining but it's got this lovely gentle warmth that's just perfect for a light jacket." So either way we're definitely gonna have epic entertainment but the question is what's next?

Particularly when it comes to the underlying intelligence of models is it a case of shooting past human level or diminishing returns? Well here's some anecdotal evidence with the recent release of Claude 3.5 Sonnet from Anthropic. It's free and fast and in certain domains more capable than comparable language models.

This table I would say shows you a comparison on things like basic mathematical ability and general knowledge compared to models like GPT-40 and Gemini 1.5 Pro from Google. I would caution that many of these benchmarks have significant flaws so decimal point differences I wouldn't pay too much attention to.

The most interesting comparison I would argue is between Claude 3.5 Sonnet and Claude 3 Sonnet. There is some evidence that Claude 3.5 Sonnet was trained on about four times as much compute as Claude 3 Sonnet and you can see the difference that makes. Definitely a boost across the board but it would be hard to argue that it's four times better and in the visual domain it is noticeably better than its predecessor and than many other models and I got early access so I tested it a fair bit.

These kind of benchmarks test reading charts and diagrams and answering basic questions about them but the real question is how much extra compute and therefore money can these companies continue to scale up and invest if the returns are still incremental? In other words how much more will you and more importantly businesses continue to pay for these incremental benefits?

After all in no domains are these models reaching a hundred percent and let me try to illustrate that with an example and as we follow this example ask yourself whether you would pay four times as much for a five percent hallucination rate versus an eight percent hallucination rate if in both cases you have to check the answer anyway.

Let me demonstrate with the brilliant new feature you can use with Claude 3.5 Sonnet from Anthropic. It's called Artifacts. Think of it like an interactive project that you can work on alongside the language model. I dumped a multi-hundred page document on the model and asked the following question. Find three questions on functions from this document and turn them into clickable flashcards in an artifact with full answers and explanations revealed interactively.

It did it and that is amazing but there's one slight problem. Question one is perfect. It's a real question from the document displayed perfectly and interactive with the correct answer and explanation. Same thing for question two but then we get to question three where it copied the question incorrectly.

Worse than that it rejigged and changed the answer options. Also is there a real difference between q squared and negative q squared when it claimed that negative q squared is the answer? Now you might find this example trivial but I think it's revealing. Don't get me wrong this feature is immensely useful and it wouldn't take me long to simply tweak that third question and by the way finding those three examples strewn across a multi-hundred page document is impressive.

Even though it would save me some time I would still have to diligently check every character of Claude's answer and at the moment as I discussed in more detail in my previous video there is no indication that scale will solve this issue. Now if you think I'm just quibbling and benchmarks show the real progress well here is the reasoning lead at Google DeepMind working on their Gemini series of models.

Someone pointed out a classic reasoning error made by Claude 3.5 Sonnet and Denny Zhou said this "Love seeing tweets like this rather than those on LLMs with PhD/superhuman intelligence or fancy results on leaked benchmarks." I'm definitely not the only one skeptical of benchmark results and an even more revealing response to Claude 3.5's basic errors came from OpenAI's Noam Brown.

I think it's more revealing because it shows that those AI labs Anthropic and OpenAI had their hopes slightly dashed based on the results they expected in reasoning from multimodal training. Noam Brown said frontier models like GPT-40 and now Claude 3.5 Sonnet may be at the level of a "smart high schooler" mimicking the words of Mira Murati CTO of OpenAI in some respects but they still struggle on basic tasks like tic-tac-toe.

And here's the key quote "There was hope that native multimodal training would help with this kind of reasoning but that hasn't been the case." That last sentence is somewhat devastating to the naive scaling hypothesis. "There was hope that native multimodal training on things like video from YouTube would teach models a world model.

It would help but that hasn't been the case." Now of course these companies are working on far more than just naive scaling as we'll hear in a moment from Bill Gates but it's not like you can look at the benchmark results on a chart and just extrapolate forwards. Here's Bill Gates promising two more turns of scaling, I think he means two more orders of magnitude, but notice how he looks sceptical about how that will be enough.

"The big frontier is not so much scaling. We have probably two more turns of the crank on scaling whereby accessing video data and getting very good at synthetic data that we can scale up probably you know two more times. That's not the most interesting dimension. The most interesting dimension is what I call metacognition where understanding how to think about a problem in a broad sense and step back and say okay how important is this answer, how could I check my answer, you know what external tools would help me with this?

So we're going to get the scaling benefits but at the same time the various actions to change the underlying reasoning algorithm from the trivial that we have today to more human-like metacognition, that's the big frontier. It's a little hard to predict how quickly that'll happen. You know I've seen that we will make progress on that next year but we won't completely solve it for some time after that." And there were others who used to be incredibly bullish on scaling that now sound a little different.

Here's Microsoft AI CEO Mustafa Suleiman perhaps drawing on lessons from the mostly defunct inflection AI that he used to run saying it won't be until GPT-6 that AI models will be able to follow instructions and take consistent action. "There's a lot of cherry picked examples that are impressive you know on Twitter and stuff like that but to really get it to consistently do it in novel environments is pretty hard and I think that it's going to be not one but two orders of magnitude more computation of training the models so not GPT-5 but more like GPT-6 scale models.

So I think we're talking about two years before we have systems that can really take action." Now based on the evidence that I put forward in my previous video let me know if you agree with me that I still think that's kind of naive. Reasoning breakthroughs will rely on new research breakthroughs not just more scale.

And even Sam Altman said as much about a year ago saying the era of ever more scaling of parameter count is over. Now as we'll hear he has since contradicted that saying current models are small relative to where they'll be. But at this point you might be wondering about emergent behaviors.

Don't certain capabilities just spring out when you reach a certain scale? Well I simply can't resist a quick plug for my new Coursera series that is out this week. The second module covers emergent behaviors and if you already have a Coursera account do please check it out it'd be free for you and if you were thinking of getting one there'll be a link in the description.

Anyway here's that quote from Sam Altman somewhat contradicting the comments he made a year ago. Models he says get predictably better with scale. "We're still just like so early in developing such a complex system. There's data issues, there's algorithmic issues, the models are still quite small relative to what they will be someday and we know they get predictably better." But this was the point I was trying to make at the start of the video.

As I argued in my previous video I think we're now at a time in AI where we really have to work hard to separate the hype from the reality. Simply trusting the words of the leaders of these AI labs is less advisable than ever and of course it's not just Sam Altman.

Here's the commitment from Anthropic led by Dario Amadei back last year. They described why they don't publish their research and they said it's because "we do not wish to advance the rate of AI capabilities progress" but their CEO just three days ago said AI is progressing fast due in part to their own efforts.

"To try and keep pace with the rate at which the complexity of the models is increasing. I think this is one of the biggest challenges in the field. The field is moving so fast, including by our own efforts, that we want to make sure that our understanding keeps pace with our abilities, our capabilities to produce powerful models." He then went on to say that today's models are like undergraduates, which if you've interacted with these models seems pretty harsh on undergraduates.

"If we go back to the analogy of like today's models are like undergraduates, you know, let's say those models get to the point where, you know, they're kind of, you know, graduate level or strong professional level. Think of biology and drug discovery. Think of a model that is as strong as, you know, a Nobel Prize winning scientist or, you know, the head of the, you know, the head of drug discovery at a major pharmaceutical company." Now, I don't know if he's basing that on a naive trust in benchmarks or whether he is deliberately hyping.

And then later in the conversation with the guy who's in charge of the world's largest sovereign wealth fund, he described how the kind of AI that Anthropic works on could be instrumental in curing cancer. "I look at all the things that have been invented. You know, if I look back at biology, you know, CRISPR, the ability to like edit genes.

If I look at, you know, CAR-T therapies, which have cured certain kinds of cancers, there's probably dozens of discoveries like that lying around. And if we had a million copies of an AI system that are as knowledgeable and as creative about the field as all those scientists that invented those things, then I think the rate of those discoveries could really proliferate.

And, you know, some of our really, really longstanding diseases, you know, could be addressed or even cured." Now, he added some caveats, of course, but that was a claim echoed on the same day, actually, I think, by OpenAI's Sam Altman. "One of our partners, Color Health, is now using GPT-4 for cancer screening and treatment plans.

And that's great. And then maybe a future version will help discover cures for cancer." Other AI lab leaders like Mark Zuckerberg think those claims are getting out of hand. "But, you know, part of that is the open source thing too. So that way, other companies out there can create different things and people can just hack on it themselves and mess around with it.

So I guess that's a pretty deep worldview that I have. And I don't know, I find it a pretty big turnoff when people in the tech industry kind of talk about building this one true AI. It's like, it's almost as if they kind of think they're creating God or something.

And it's like, it's just, that's not what we're doing. I don't think that's how this plays out." Implicitly, he's saying that companies like OpenAI and Anthropic are getting carried away. And later though, in that interview, the CEO of Anthropic admitted that he was somewhat pulling things out of his hat when it came to biology and actually with scaling.

"You know, let's say, you know, you extend people's productive ability to work by 10 years, right? That could be, you know, one sixth of the whole economy." "Do you think that's a realistic target?" "I mean, again, like I know some biology, I know something about how the AMLs are going to happen.

I wouldn't be able to tell you exactly what would happen, but like, I can tell a story where it's possible." "So 15%, and when will we, so when could we have added the equivalent of 10 years to our life? I mean, how long, what's the timeframe?" "Again, like, you know, this involves so many unknowns, right?

If I try and give an exact number, it's just going to sound like hype. But like, a thing I could, a thing I could imagine is like, I don't know, like two to three years from now, we have AI systems that are like capable of making that kind of discovery.

Five years from now, those, those discoveries are actually being made. And five years after that, it's all gone through the regulatory apparatus and, and really has. So, you know, we're talking about more, we're talking about, you know, a little over a decade, but really I'm just pulling things out of my hat here.

Like, I don't know that much about drug discovery. I don't know that much about biology. And frankly, although I invented AI scaling, I don't know that much about that either. I can't predict it." The truth, of course, is that we simply don't know what the ramifications will be of further scaling and of course, of new research.

Regardless, these companies are pressing ahead. "Right now, a hundred million. There are models in training today that are more like a billion. I think if we go to 10 or a hundred billion, and I think that will happen in 2025, 2026, maybe 2027, and the algorithmic improvements continue apace and the chip improvements continue apace, then I think there, there is in my mind a good chance that by that time we'll be able to get models that are better than most humans at most things." But I want to know what you think.

Are we at the dawn of a new era in entertainment and intelligence, or has the hype gone too far? If you want to hear more of my reflections, do check out my podcasts on Patreon on AI Insiders. You could also check out the dozens of bonus videos I've got on there and the live meetups arranged via Discord.

But regardless, I just want to thank you for getting all the way to the end and joining me in these wild times. Have a wonderful day.