back to index

How Far Can We Scale AI? Gen 3, Claude 3.5 Sonnet and AI Hype


Whisper Transcript | Transcript Only Page

00:00:00.000 | Artificial worlds generated by AI video models have never been more tangible and accessible and
00:00:08.480 | look set to transform how millions and then billions of people consume content. And artificial
00:00:14.000 | intelligence in the form of the new free Claude 3.5 SONNET is more capable than it has ever been.
00:00:20.960 | But I will draw on interviews in the last few days to show that there are more questions than ever,
00:00:26.400 | not just about the merits of continued scaling of language models but about whether we can rely on
00:00:32.800 | the words of those who lead these giant AI orgs. But first AI video generation which is truly on
00:00:40.720 | fire at the moment. These outputs are from Runway Gen 3 available to many now and to everyone
00:00:47.600 | apparently in the coming days. The audio by the way is also AI generated this time from UDIO.
00:00:55.360 | [Music]
00:01:20.240 | And as you watch these videos remember that the AI models that are generating them
00:01:24.720 | are likely trained on far less than 1% of the video data that's available.
00:01:30.400 | Unlike high quality text data video data isn't even close to being used up. Expect generations
00:01:36.560 | to get far more realistic and not in too long either. And by the way if you're bored while
00:01:41.920 | waiting on the Gen 3 wait list do play about with the Luma Dream Machine. I've got to admit it is
00:01:48.640 | pretty fun to generate two images or submit two real ones and have the model interpolate between
00:01:55.040 | them. Now those of you in China have actually already been able to play with a model of similar
00:02:00.800 | capabilities called Kling. But we are all waiting on the release of Sora the most promising video
00:02:08.400 | generation model of them all from OpenAI. Here are a couple of comparisons between Runway Gen 3
00:02:15.120 | and Sora. The prompts used in both cases are identical and there's one example that particularly
00:02:21.440 | caught my eye. As many of us may have realized by now simply training models on more data doesn't
00:02:27.120 | necessarily mean they pick up accurate world models. Now I strongly suspect that Sora was
00:02:33.120 | trained on way more data with way more compute. With its generation at the bottom you can see
00:02:38.960 | that the dust emerges from behind the car. This neatly demonstrates the benefits of scale but
00:02:44.960 | still leaves open the question about whether scale will solve all. Now yes it would be simple
00:02:51.280 | to extrapolate a straight line upwards and say that with enough scale we get a perfect world
00:02:57.040 | simulation but I just don't think it will be like that. And there are already more than tentative
00:03:02.000 | hints that scale won't solve everything. More on that in just a moment but there is one more
00:03:07.440 | modality I am sure we were all looking forward to which is going to be delayed. That's the real-time
00:03:13.840 | advanced voice mode from OpenAI. It was the star of the demo of GPC 4.0 and was promised in the
00:03:21.040 | coming weeks. Alas though it has now been delayed to the fall or the autumn and they say that's in
00:03:27.840 | part because they want to improve the model's ability to detect and refuse certain content.
00:03:32.960 | I also suspect though like dodgy physics with the video generation and hallucinations with the
00:03:38.240 | language generation they also realized it occasionally goes off the rails. Now I personally
00:03:44.480 | find this funny but you let me know whether this would be acceptable to release. "Refreshing
00:03:49.680 | coolness in the air that just makes you want to smile and take a deep breath of that crisp
00:03:55.040 | invigorating breeze. The sun's shining but it's got this lovely gentle warmth that's just perfect
00:04:02.880 | for a light jacket." So either way we're definitely gonna have epic entertainment but the question is
00:04:08.320 | what's next? Particularly when it comes to the underlying intelligence of models is it a case
00:04:13.360 | of shooting past human level or diminishing returns? Well here's some anecdotal evidence with
00:04:19.520 | the recent release of Claude 3.5 Sonnet from Anthropic. It's free and fast and in certain
00:04:26.240 | domains more capable than comparable language models. This table I would say shows you a
00:04:31.200 | comparison on things like basic mathematical ability and general knowledge compared to models
00:04:36.320 | like GPT-40 and Gemini 1.5 Pro from Google. I would caution that many of these benchmarks have
00:04:42.640 | significant flaws so decimal point differences I wouldn't pay too much attention to. The most
00:04:48.000 | interesting comparison I would argue is between Claude 3.5 Sonnet and Claude 3 Sonnet. There is
00:04:53.840 | some evidence that Claude 3.5 Sonnet was trained on about four times as much compute as Claude 3
00:04:59.520 | Sonnet and you can see the difference that makes. Definitely a boost across the board but it would
00:05:04.640 | be hard to argue that it's four times better and in the visual domain it is noticeably better than
00:05:11.360 | its predecessor and than many other models and I got early access so I tested it a fair bit.
00:05:16.720 | These kind of benchmarks test reading charts and diagrams and answering basic questions about them
00:05:22.240 | but the real question is how much extra compute and therefore money can these companies continue
00:05:27.680 | to scale up and invest if the returns are still incremental? In other words how much more will you
00:05:34.720 | and more importantly businesses continue to pay for these incremental benefits? After all in no
00:05:40.720 | domains are these models reaching a hundred percent and let me try to illustrate that with
00:05:45.760 | an example and as we follow this example ask yourself whether you would pay four times as much
00:05:50.240 | for a five percent hallucination rate versus an eight percent hallucination rate if in both cases
00:05:56.240 | you have to check the answer anyway. Let me demonstrate with the brilliant new feature you
00:06:00.240 | can use with Claude 3.5 Sonnet from Anthropic. It's called Artifacts. Think of it like an
00:06:05.600 | interactive project that you can work on alongside the language model. I dumped a multi-hundred page
00:06:11.520 | document on the model and asked the following question. Find three questions on functions from
00:06:16.560 | this document and turn them into clickable flashcards in an artifact with full answers
00:06:21.440 | and explanations revealed interactively. It did it and that is amazing but there's one slight
00:06:27.600 | problem. Question one is perfect. It's a real question from the document displayed perfectly
00:06:33.120 | and interactive with the correct answer and explanation. Same thing for question two but
00:06:38.400 | then we get to question three where it copied the question incorrectly. Worse than that it rejigged
00:06:44.160 | and changed the answer options. Also is there a real difference between q squared and negative
00:06:50.160 | q squared when it claimed that negative q squared is the answer? Now you might find this example
00:06:55.760 | trivial but I think it's revealing. Don't get me wrong this feature is immensely useful and it
00:07:00.640 | wouldn't take me long to simply tweak that third question and by the way finding those three
00:07:05.520 | examples strewn across a multi-hundred page document is impressive. Even though it would
00:07:10.640 | save me some time I would still have to diligently check every character of Claude's answer and at
00:07:16.800 | the moment as I discussed in more detail in my previous video there is no indication that scale
00:07:22.720 | will solve this issue. Now if you think I'm just quibbling and benchmarks show the real progress
00:07:28.240 | well here is the reasoning lead at Google DeepMind working on their Gemini series of models. Someone
00:07:34.640 | pointed out a classic reasoning error made by Claude 3.5 Sonnet and Denny Zhou said this "Love
00:07:40.880 | seeing tweets like this rather than those on LLMs with PhD/superhuman intelligence or fancy results
00:07:48.400 | on leaked benchmarks." I'm definitely not the only one skeptical of benchmark results and an even
00:07:54.320 | more revealing response to Claude 3.5's basic errors came from OpenAI's Noam Brown. I think
00:08:00.320 | it's more revealing because it shows that those AI labs Anthropic and OpenAI had their hopes
00:08:05.840 | slightly dashed based on the results they expected in reasoning from multimodal training. Noam Brown
00:08:12.080 | said frontier models like GPT-40 and now Claude 3.5 Sonnet may be at the level of a "smart high
00:08:19.280 | schooler" mimicking the words of Mira Murati CTO of OpenAI in some respects but they still struggle
00:08:25.520 | on basic tasks like tic-tac-toe. And here's the key quote "There was hope that native multimodal
00:08:32.480 | training would help with this kind of reasoning but that hasn't been the case." That last sentence
00:08:38.560 | is somewhat devastating to the naive scaling hypothesis. "There was hope that native
00:08:44.240 | multimodal training on things like video from YouTube would teach models a world model. It
00:08:49.600 | would help but that hasn't been the case." Now of course these companies are working on far more
00:08:54.480 | than just naive scaling as we'll hear in a moment from Bill Gates but it's not like you can look at
00:08:58.800 | the benchmark results on a chart and just extrapolate forwards. Here's Bill Gates promising
00:09:04.000 | two more turns of scaling, I think he means two more orders of magnitude,
00:09:07.840 | but notice how he looks sceptical about how that will be enough. "The big frontier is not so much
00:09:13.840 | scaling. We have probably two more turns of the crank on scaling whereby accessing video data and
00:09:23.440 | getting very good at synthetic data that we can scale up probably you know two more times. That's
00:09:32.160 | not the most interesting dimension. The most interesting dimension is what I call metacognition
00:09:37.920 | where understanding how to think about a problem in a broad sense and step back and say okay how
00:09:45.760 | important is this answer, how could I check my answer, you know what external tools would help
00:09:50.800 | me with this? So we're going to get the scaling benefits but at the same time the various actions
00:10:00.400 | to change the underlying reasoning algorithm from the trivial that we have today to more human-like
00:10:10.320 | metacognition, that's the big frontier. It's a little hard to predict how quickly that'll happen.
00:10:18.320 | You know I've seen that we will make progress on that next year but we won't completely solve it
00:10:24.160 | for some time after that." And there were others who used to be incredibly bullish on scaling that
00:10:30.800 | now sound a little different. Here's Microsoft AI CEO Mustafa Suleiman perhaps drawing on lessons
00:10:36.720 | from the mostly defunct inflection AI that he used to run saying it won't be until GPT-6 that AI
00:10:43.040 | models will be able to follow instructions and take consistent action. "There's a lot of cherry
00:10:47.120 | picked examples that are impressive you know on Twitter and stuff like that but to really get it
00:10:52.640 | to consistently do it in novel environments is pretty hard and I think that it's going to be
00:10:58.640 | not one but two orders of magnitude more computation of training the models so not
00:11:04.160 | GPT-5 but more like GPT-6 scale models. So I think we're talking about two years before we have
00:11:10.720 | systems that can really take action." Now based on the evidence that I put forward in my previous
00:11:15.920 | video let me know if you agree with me that I still think that's kind of naive. Reasoning
00:11:20.880 | breakthroughs will rely on new research breakthroughs not just more scale. And even
00:11:25.920 | Sam Altman said as much about a year ago saying the era of ever more scaling of parameter count
00:11:32.240 | is over. Now as we'll hear he has since contradicted that saying current models are small relative to
00:11:38.160 | where they'll be. But at this point you might be wondering about emergent behaviors. Don't
00:11:42.480 | certain capabilities just spring out when you reach a certain scale? Well I simply can't resist
00:11:47.520 | a quick plug for my new Coursera series that is out this week. The second module covers emergent
00:11:53.360 | behaviors and if you already have a Coursera account do please check it out it'd be free for
00:11:58.160 | you and if you were thinking of getting one there'll be a link in the description. Anyway
00:12:02.560 | here's that quote from Sam Altman somewhat contradicting the comments he made a year ago.
00:12:07.040 | Models he says get predictably better with scale. "We're still just like so early in developing such
00:12:13.440 | a complex system. There's data issues, there's algorithmic issues, the models are still quite
00:12:20.880 | small relative to what they will be someday and we know they get predictably better."
00:12:24.080 | But this was the point I was trying to make at the start of the video. As I argued in my previous
00:12:29.920 | video I think we're now at a time in AI where we really have to work hard to separate the hype
00:12:36.240 | from the reality. Simply trusting the words of the leaders of these AI labs is less advisable
00:12:42.800 | than ever and of course it's not just Sam Altman. Here's the commitment from Anthropic led by Dario
00:12:48.640 | Amadei back last year. They described why they don't publish their research and they said it's
00:12:52.960 | because "we do not wish to advance the rate of AI capabilities progress" but their CEO just three
00:12:59.440 | days ago said AI is progressing fast due in part to their own efforts. "To try and keep pace with
00:13:06.800 | the rate at which the complexity of the models is increasing. I think this is one of the biggest
00:13:10.800 | challenges in the field. The field is moving so fast, including by our own efforts, that we want
00:13:16.160 | to make sure that our understanding keeps pace with our abilities, our capabilities to produce
00:13:21.840 | powerful models." He then went on to say that today's models are like undergraduates, which
00:13:27.040 | if you've interacted with these models seems pretty harsh on undergraduates. "If we go back
00:13:32.800 | to the analogy of like today's models are like undergraduates, you know, let's say those models
00:13:37.680 | get to the point where, you know, they're kind of, you know, graduate level or strong professional
00:13:42.720 | level. Think of biology and drug discovery. Think of a model that is as strong as, you know, a Nobel
00:13:51.440 | Prize winning scientist or, you know, the head of the, you know, the head of drug discovery at a
00:13:56.080 | major pharmaceutical company." Now, I don't know if he's basing that on a naive trust in benchmarks
00:14:02.240 | or whether he is deliberately hyping. And then later in the conversation with the guy who's
00:14:07.440 | in charge of the world's largest sovereign wealth fund, he described how the kind of AI that
00:14:12.400 | Anthropic works on could be instrumental in curing cancer. "I look at all the things that have been
00:14:17.920 | invented. You know, if I look back at biology, you know, CRISPR, the ability to like edit genes. If
00:14:23.280 | I look at, you know, CAR-T therapies, which have cured certain kinds of cancers, there's probably
00:14:30.800 | dozens of discoveries like that lying around. And if we had a million copies of an AI system that
00:14:38.400 | are as knowledgeable and as creative about the field as all those scientists that invented those
00:14:43.680 | things, then I think the rate of those discoveries could really proliferate. And, you know, some of
00:14:49.280 | our really, really longstanding diseases, you know, could be addressed or even cured." Now,
00:14:55.760 | he added some caveats, of course, but that was a claim echoed on the same day, actually, I think,
00:15:01.120 | by OpenAI's Sam Altman. "One of our partners, Color Health, is now using
00:15:05.360 | GPT-4 for cancer screening and treatment plans. And that's great. And then maybe a future version
00:15:11.280 | will help discover cures for cancer." Other AI lab leaders like Mark Zuckerberg think those claims
00:15:18.000 | are getting out of hand. "But, you know, part of that is the open source thing too. So that way,
00:15:22.000 | other companies out there can create different things and people can just hack on it themselves
00:15:25.440 | and mess around with it. So I guess that's a pretty deep worldview that I have. And I don't
00:15:31.120 | know, I find it a pretty big turnoff when people in the tech industry kind of talk about building
00:15:37.200 | this one true AI. It's like, it's almost as if they kind of think they're creating God or something.
00:15:42.640 | And it's like, it's just, that's not what we're doing. I don't think that's how this plays
00:15:47.920 | out." Implicitly, he's saying that companies like OpenAI and Anthropic are getting carried away.
00:15:53.920 | And later though, in that interview, the CEO of Anthropic admitted that he was somewhat
00:15:58.800 | pulling things out of his hat when it came to biology and actually with scaling.
00:16:04.080 | "You know, let's say, you know, you extend people's productive ability to work
00:16:08.080 | by 10 years, right? That could be, you know, one sixth of the whole economy."
00:16:11.760 | "Do you think that's a realistic target?" "I mean, again, like I know some biology,
00:16:17.680 | I know something about how the AMLs are going to happen. I wouldn't be able to tell you exactly
00:16:22.240 | what would happen, but like, I can tell a story where it's possible."
00:16:26.000 | "So 15%, and when will we, so when could we have added the equivalent of 10 years to our life? I
00:16:32.640 | mean, how long, what's the timeframe?" "Again, like, you know, this involves so
00:16:36.560 | many unknowns, right? If I try and give an exact number, it's just going to sound like hype. But
00:16:41.200 | like, a thing I could, a thing I could imagine is like, I don't know, like two to three years from
00:16:46.720 | now, we have AI systems that are like capable of making that kind of discovery. Five years from
00:16:52.560 | now, those, those discoveries are actually being made. And five years after that, it's all gone
00:16:57.360 | through the regulatory apparatus and, and really has. So, you know, we're talking about more,
00:17:01.760 | we're talking about, you know, a little over a decade, but really I'm just pulling things out
00:17:05.840 | of my hat here. Like, I don't know that much about drug discovery. I don't know that much
00:17:09.440 | about biology. And frankly, although I invented AI scaling, I don't know that much about that
00:17:15.040 | either. I can't predict it." The truth, of course, is that we simply don't know what the
00:17:20.160 | ramifications will be of further scaling and of course, of new research. Regardless, these
00:17:25.600 | companies are pressing ahead. "Right now, a hundred million. There are models in training
00:17:30.800 | today that are more like a billion. I think if we go to 10 or a hundred billion, and I think that
00:17:36.000 | will happen in 2025, 2026, maybe 2027, and the algorithmic improvements continue apace and the
00:17:44.000 | chip improvements continue apace, then I think there, there is in my mind a good chance that by
00:17:49.680 | that time we'll be able to get models that are better than most humans at most things." But I
00:17:55.440 | want to know what you think. Are we at the dawn of a new era in entertainment and intelligence,
00:18:01.680 | or has the hype gone too far? If you want to hear more of my reflections, do check out my podcasts
00:18:07.280 | on Patreon on AI Insiders. You could also check out the dozens of bonus videos I've got on there
00:18:13.360 | and the live meetups arranged via Discord. But regardless, I just want to thank you for getting
00:18:19.040 | all the way to the end and joining me in these wild times. Have a wonderful day.