back to indexHow Far Can We Scale AI? Gen 3, Claude 3.5 Sonnet and AI Hype
00:00:00.000 |
Artificial worlds generated by AI video models have never been more tangible and accessible and 00:00:08.480 |
look set to transform how millions and then billions of people consume content. And artificial 00:00:14.000 |
intelligence in the form of the new free Claude 3.5 SONNET is more capable than it has ever been. 00:00:20.960 |
But I will draw on interviews in the last few days to show that there are more questions than ever, 00:00:26.400 |
not just about the merits of continued scaling of language models but about whether we can rely on 00:00:32.800 |
the words of those who lead these giant AI orgs. But first AI video generation which is truly on 00:00:40.720 |
fire at the moment. These outputs are from Runway Gen 3 available to many now and to everyone 00:00:47.600 |
apparently in the coming days. The audio by the way is also AI generated this time from UDIO. 00:01:20.240 |
And as you watch these videos remember that the AI models that are generating them 00:01:24.720 |
are likely trained on far less than 1% of the video data that's available. 00:01:30.400 |
Unlike high quality text data video data isn't even close to being used up. Expect generations 00:01:36.560 |
to get far more realistic and not in too long either. And by the way if you're bored while 00:01:41.920 |
waiting on the Gen 3 wait list do play about with the Luma Dream Machine. I've got to admit it is 00:01:48.640 |
pretty fun to generate two images or submit two real ones and have the model interpolate between 00:01:55.040 |
them. Now those of you in China have actually already been able to play with a model of similar 00:02:00.800 |
capabilities called Kling. But we are all waiting on the release of Sora the most promising video 00:02:08.400 |
generation model of them all from OpenAI. Here are a couple of comparisons between Runway Gen 3 00:02:15.120 |
and Sora. The prompts used in both cases are identical and there's one example that particularly 00:02:21.440 |
caught my eye. As many of us may have realized by now simply training models on more data doesn't 00:02:27.120 |
necessarily mean they pick up accurate world models. Now I strongly suspect that Sora was 00:02:33.120 |
trained on way more data with way more compute. With its generation at the bottom you can see 00:02:38.960 |
that the dust emerges from behind the car. This neatly demonstrates the benefits of scale but 00:02:44.960 |
still leaves open the question about whether scale will solve all. Now yes it would be simple 00:02:51.280 |
to extrapolate a straight line upwards and say that with enough scale we get a perfect world 00:02:57.040 |
simulation but I just don't think it will be like that. And there are already more than tentative 00:03:02.000 |
hints that scale won't solve everything. More on that in just a moment but there is one more 00:03:07.440 |
modality I am sure we were all looking forward to which is going to be delayed. That's the real-time 00:03:13.840 |
advanced voice mode from OpenAI. It was the star of the demo of GPC 4.0 and was promised in the 00:03:21.040 |
coming weeks. Alas though it has now been delayed to the fall or the autumn and they say that's in 00:03:27.840 |
part because they want to improve the model's ability to detect and refuse certain content. 00:03:32.960 |
I also suspect though like dodgy physics with the video generation and hallucinations with the 00:03:38.240 |
language generation they also realized it occasionally goes off the rails. Now I personally 00:03:44.480 |
find this funny but you let me know whether this would be acceptable to release. "Refreshing 00:03:49.680 |
coolness in the air that just makes you want to smile and take a deep breath of that crisp 00:03:55.040 |
invigorating breeze. The sun's shining but it's got this lovely gentle warmth that's just perfect 00:04:02.880 |
for a light jacket." So either way we're definitely gonna have epic entertainment but the question is 00:04:08.320 |
what's next? Particularly when it comes to the underlying intelligence of models is it a case 00:04:13.360 |
of shooting past human level or diminishing returns? Well here's some anecdotal evidence with 00:04:19.520 |
the recent release of Claude 3.5 Sonnet from Anthropic. It's free and fast and in certain 00:04:26.240 |
domains more capable than comparable language models. This table I would say shows you a 00:04:31.200 |
comparison on things like basic mathematical ability and general knowledge compared to models 00:04:36.320 |
like GPT-40 and Gemini 1.5 Pro from Google. I would caution that many of these benchmarks have 00:04:42.640 |
significant flaws so decimal point differences I wouldn't pay too much attention to. The most 00:04:48.000 |
interesting comparison I would argue is between Claude 3.5 Sonnet and Claude 3 Sonnet. There is 00:04:53.840 |
some evidence that Claude 3.5 Sonnet was trained on about four times as much compute as Claude 3 00:04:59.520 |
Sonnet and you can see the difference that makes. Definitely a boost across the board but it would 00:05:04.640 |
be hard to argue that it's four times better and in the visual domain it is noticeably better than 00:05:11.360 |
its predecessor and than many other models and I got early access so I tested it a fair bit. 00:05:16.720 |
These kind of benchmarks test reading charts and diagrams and answering basic questions about them 00:05:22.240 |
but the real question is how much extra compute and therefore money can these companies continue 00:05:27.680 |
to scale up and invest if the returns are still incremental? In other words how much more will you 00:05:34.720 |
and more importantly businesses continue to pay for these incremental benefits? After all in no 00:05:40.720 |
domains are these models reaching a hundred percent and let me try to illustrate that with 00:05:45.760 |
an example and as we follow this example ask yourself whether you would pay four times as much 00:05:50.240 |
for a five percent hallucination rate versus an eight percent hallucination rate if in both cases 00:05:56.240 |
you have to check the answer anyway. Let me demonstrate with the brilliant new feature you 00:06:00.240 |
can use with Claude 3.5 Sonnet from Anthropic. It's called Artifacts. Think of it like an 00:06:05.600 |
interactive project that you can work on alongside the language model. I dumped a multi-hundred page 00:06:11.520 |
document on the model and asked the following question. Find three questions on functions from 00:06:16.560 |
this document and turn them into clickable flashcards in an artifact with full answers 00:06:21.440 |
and explanations revealed interactively. It did it and that is amazing but there's one slight 00:06:27.600 |
problem. Question one is perfect. It's a real question from the document displayed perfectly 00:06:33.120 |
and interactive with the correct answer and explanation. Same thing for question two but 00:06:38.400 |
then we get to question three where it copied the question incorrectly. Worse than that it rejigged 00:06:44.160 |
and changed the answer options. Also is there a real difference between q squared and negative 00:06:50.160 |
q squared when it claimed that negative q squared is the answer? Now you might find this example 00:06:55.760 |
trivial but I think it's revealing. Don't get me wrong this feature is immensely useful and it 00:07:00.640 |
wouldn't take me long to simply tweak that third question and by the way finding those three 00:07:05.520 |
examples strewn across a multi-hundred page document is impressive. Even though it would 00:07:10.640 |
save me some time I would still have to diligently check every character of Claude's answer and at 00:07:16.800 |
the moment as I discussed in more detail in my previous video there is no indication that scale 00:07:22.720 |
will solve this issue. Now if you think I'm just quibbling and benchmarks show the real progress 00:07:28.240 |
well here is the reasoning lead at Google DeepMind working on their Gemini series of models. Someone 00:07:34.640 |
pointed out a classic reasoning error made by Claude 3.5 Sonnet and Denny Zhou said this "Love 00:07:40.880 |
seeing tweets like this rather than those on LLMs with PhD/superhuman intelligence or fancy results 00:07:48.400 |
on leaked benchmarks." I'm definitely not the only one skeptical of benchmark results and an even 00:07:54.320 |
more revealing response to Claude 3.5's basic errors came from OpenAI's Noam Brown. I think 00:08:00.320 |
it's more revealing because it shows that those AI labs Anthropic and OpenAI had their hopes 00:08:05.840 |
slightly dashed based on the results they expected in reasoning from multimodal training. Noam Brown 00:08:12.080 |
said frontier models like GPT-40 and now Claude 3.5 Sonnet may be at the level of a "smart high 00:08:19.280 |
schooler" mimicking the words of Mira Murati CTO of OpenAI in some respects but they still struggle 00:08:25.520 |
on basic tasks like tic-tac-toe. And here's the key quote "There was hope that native multimodal 00:08:32.480 |
training would help with this kind of reasoning but that hasn't been the case." That last sentence 00:08:38.560 |
is somewhat devastating to the naive scaling hypothesis. "There was hope that native 00:08:44.240 |
multimodal training on things like video from YouTube would teach models a world model. It 00:08:49.600 |
would help but that hasn't been the case." Now of course these companies are working on far more 00:08:54.480 |
than just naive scaling as we'll hear in a moment from Bill Gates but it's not like you can look at 00:08:58.800 |
the benchmark results on a chart and just extrapolate forwards. Here's Bill Gates promising 00:09:04.000 |
two more turns of scaling, I think he means two more orders of magnitude, 00:09:07.840 |
but notice how he looks sceptical about how that will be enough. "The big frontier is not so much 00:09:13.840 |
scaling. We have probably two more turns of the crank on scaling whereby accessing video data and 00:09:23.440 |
getting very good at synthetic data that we can scale up probably you know two more times. That's 00:09:32.160 |
not the most interesting dimension. The most interesting dimension is what I call metacognition 00:09:37.920 |
where understanding how to think about a problem in a broad sense and step back and say okay how 00:09:45.760 |
important is this answer, how could I check my answer, you know what external tools would help 00:09:50.800 |
me with this? So we're going to get the scaling benefits but at the same time the various actions 00:10:00.400 |
to change the underlying reasoning algorithm from the trivial that we have today to more human-like 00:10:10.320 |
metacognition, that's the big frontier. It's a little hard to predict how quickly that'll happen. 00:10:18.320 |
You know I've seen that we will make progress on that next year but we won't completely solve it 00:10:24.160 |
for some time after that." And there were others who used to be incredibly bullish on scaling that 00:10:30.800 |
now sound a little different. Here's Microsoft AI CEO Mustafa Suleiman perhaps drawing on lessons 00:10:36.720 |
from the mostly defunct inflection AI that he used to run saying it won't be until GPT-6 that AI 00:10:43.040 |
models will be able to follow instructions and take consistent action. "There's a lot of cherry 00:10:47.120 |
picked examples that are impressive you know on Twitter and stuff like that but to really get it 00:10:52.640 |
to consistently do it in novel environments is pretty hard and I think that it's going to be 00:10:58.640 |
not one but two orders of magnitude more computation of training the models so not 00:11:04.160 |
GPT-5 but more like GPT-6 scale models. So I think we're talking about two years before we have 00:11:10.720 |
systems that can really take action." Now based on the evidence that I put forward in my previous 00:11:15.920 |
video let me know if you agree with me that I still think that's kind of naive. Reasoning 00:11:20.880 |
breakthroughs will rely on new research breakthroughs not just more scale. And even 00:11:25.920 |
Sam Altman said as much about a year ago saying the era of ever more scaling of parameter count 00:11:32.240 |
is over. Now as we'll hear he has since contradicted that saying current models are small relative to 00:11:38.160 |
where they'll be. But at this point you might be wondering about emergent behaviors. Don't 00:11:42.480 |
certain capabilities just spring out when you reach a certain scale? Well I simply can't resist 00:11:47.520 |
a quick plug for my new Coursera series that is out this week. The second module covers emergent 00:11:53.360 |
behaviors and if you already have a Coursera account do please check it out it'd be free for 00:11:58.160 |
you and if you were thinking of getting one there'll be a link in the description. Anyway 00:12:02.560 |
here's that quote from Sam Altman somewhat contradicting the comments he made a year ago. 00:12:07.040 |
Models he says get predictably better with scale. "We're still just like so early in developing such 00:12:13.440 |
a complex system. There's data issues, there's algorithmic issues, the models are still quite 00:12:20.880 |
small relative to what they will be someday and we know they get predictably better." 00:12:24.080 |
But this was the point I was trying to make at the start of the video. As I argued in my previous 00:12:29.920 |
video I think we're now at a time in AI where we really have to work hard to separate the hype 00:12:36.240 |
from the reality. Simply trusting the words of the leaders of these AI labs is less advisable 00:12:42.800 |
than ever and of course it's not just Sam Altman. Here's the commitment from Anthropic led by Dario 00:12:48.640 |
Amadei back last year. They described why they don't publish their research and they said it's 00:12:52.960 |
because "we do not wish to advance the rate of AI capabilities progress" but their CEO just three 00:12:59.440 |
days ago said AI is progressing fast due in part to their own efforts. "To try and keep pace with 00:13:06.800 |
the rate at which the complexity of the models is increasing. I think this is one of the biggest 00:13:10.800 |
challenges in the field. The field is moving so fast, including by our own efforts, that we want 00:13:16.160 |
to make sure that our understanding keeps pace with our abilities, our capabilities to produce 00:13:21.840 |
powerful models." He then went on to say that today's models are like undergraduates, which 00:13:27.040 |
if you've interacted with these models seems pretty harsh on undergraduates. "If we go back 00:13:32.800 |
to the analogy of like today's models are like undergraduates, you know, let's say those models 00:13:37.680 |
get to the point where, you know, they're kind of, you know, graduate level or strong professional 00:13:42.720 |
level. Think of biology and drug discovery. Think of a model that is as strong as, you know, a Nobel 00:13:51.440 |
Prize winning scientist or, you know, the head of the, you know, the head of drug discovery at a 00:13:56.080 |
major pharmaceutical company." Now, I don't know if he's basing that on a naive trust in benchmarks 00:14:02.240 |
or whether he is deliberately hyping. And then later in the conversation with the guy who's 00:14:07.440 |
in charge of the world's largest sovereign wealth fund, he described how the kind of AI that 00:14:12.400 |
Anthropic works on could be instrumental in curing cancer. "I look at all the things that have been 00:14:17.920 |
invented. You know, if I look back at biology, you know, CRISPR, the ability to like edit genes. If 00:14:23.280 |
I look at, you know, CAR-T therapies, which have cured certain kinds of cancers, there's probably 00:14:30.800 |
dozens of discoveries like that lying around. And if we had a million copies of an AI system that 00:14:38.400 |
are as knowledgeable and as creative about the field as all those scientists that invented those 00:14:43.680 |
things, then I think the rate of those discoveries could really proliferate. And, you know, some of 00:14:49.280 |
our really, really longstanding diseases, you know, could be addressed or even cured." Now, 00:14:55.760 |
he added some caveats, of course, but that was a claim echoed on the same day, actually, I think, 00:15:01.120 |
by OpenAI's Sam Altman. "One of our partners, Color Health, is now using 00:15:05.360 |
GPT-4 for cancer screening and treatment plans. And that's great. And then maybe a future version 00:15:11.280 |
will help discover cures for cancer." Other AI lab leaders like Mark Zuckerberg think those claims 00:15:18.000 |
are getting out of hand. "But, you know, part of that is the open source thing too. So that way, 00:15:22.000 |
other companies out there can create different things and people can just hack on it themselves 00:15:25.440 |
and mess around with it. So I guess that's a pretty deep worldview that I have. And I don't 00:15:31.120 |
know, I find it a pretty big turnoff when people in the tech industry kind of talk about building 00:15:37.200 |
this one true AI. It's like, it's almost as if they kind of think they're creating God or something. 00:15:42.640 |
And it's like, it's just, that's not what we're doing. I don't think that's how this plays 00:15:47.920 |
out." Implicitly, he's saying that companies like OpenAI and Anthropic are getting carried away. 00:15:53.920 |
And later though, in that interview, the CEO of Anthropic admitted that he was somewhat 00:15:58.800 |
pulling things out of his hat when it came to biology and actually with scaling. 00:16:04.080 |
"You know, let's say, you know, you extend people's productive ability to work 00:16:08.080 |
by 10 years, right? That could be, you know, one sixth of the whole economy." 00:16:11.760 |
"Do you think that's a realistic target?" "I mean, again, like I know some biology, 00:16:17.680 |
I know something about how the AMLs are going to happen. I wouldn't be able to tell you exactly 00:16:22.240 |
what would happen, but like, I can tell a story where it's possible." 00:16:26.000 |
"So 15%, and when will we, so when could we have added the equivalent of 10 years to our life? I 00:16:32.640 |
mean, how long, what's the timeframe?" "Again, like, you know, this involves so 00:16:36.560 |
many unknowns, right? If I try and give an exact number, it's just going to sound like hype. But 00:16:41.200 |
like, a thing I could, a thing I could imagine is like, I don't know, like two to three years from 00:16:46.720 |
now, we have AI systems that are like capable of making that kind of discovery. Five years from 00:16:52.560 |
now, those, those discoveries are actually being made. And five years after that, it's all gone 00:16:57.360 |
through the regulatory apparatus and, and really has. So, you know, we're talking about more, 00:17:01.760 |
we're talking about, you know, a little over a decade, but really I'm just pulling things out 00:17:05.840 |
of my hat here. Like, I don't know that much about drug discovery. I don't know that much 00:17:09.440 |
about biology. And frankly, although I invented AI scaling, I don't know that much about that 00:17:15.040 |
either. I can't predict it." The truth, of course, is that we simply don't know what the 00:17:20.160 |
ramifications will be of further scaling and of course, of new research. Regardless, these 00:17:25.600 |
companies are pressing ahead. "Right now, a hundred million. There are models in training 00:17:30.800 |
today that are more like a billion. I think if we go to 10 or a hundred billion, and I think that 00:17:36.000 |
will happen in 2025, 2026, maybe 2027, and the algorithmic improvements continue apace and the 00:17:44.000 |
chip improvements continue apace, then I think there, there is in my mind a good chance that by 00:17:49.680 |
that time we'll be able to get models that are better than most humans at most things." But I 00:17:55.440 |
want to know what you think. Are we at the dawn of a new era in entertainment and intelligence, 00:18:01.680 |
or has the hype gone too far? If you want to hear more of my reflections, do check out my podcasts 00:18:07.280 |
on Patreon on AI Insiders. You could also check out the dozens of bonus videos I've got on there 00:18:13.360 |
and the live meetups arranged via Discord. But regardless, I just want to thank you for getting 00:18:19.040 |
all the way to the end and joining me in these wild times. Have a wonderful day.