back to indexThe New Claude 3.5 Sonnet: Better, Yes, But Not Just in the Way You Might Think

Chapters
0:0 Introduction
0:57 Claude 3.5 Sonnet (New) Paper
2:6 Demo
2:58 OSWorld
4:29 Benchmarks compared + OpenAI Response
8:30 Tau-Bench
13:9 SimpleBench Results
17:5 Yellowstone Detour
17:29 Runway Act-One
18:44 HeyGen Interactive Avatars + Demo
21:6 NotebookLM Update
00:00:00.000 | 
The new Claude 3.5 Sonnet from Anthropic, their flagship language model, is a significant 00:00:07.840 | 
But wait, no, not particularly because it can use a mouse to perform a basic Google 00:00:13.820 | 
I'll dive into a couple of papers to show that not all is as it seems on that front. 00:00:19.280 | 
And for now, I can browse Google for National Park photos myself. 00:00:22.640 | 
No, for me, the new 3.5 Sonnet is a step forward because, yet again, we have progress on that 00:00:32.440 | 
And I do have some new evidence to back that up. 00:00:37.560 | 
And so while Anthropic were cooking, Haygen and Runway were also in the kitchen, stirring 00:00:43.480 | 
up AIs you can Zoom chat with and create movie scenes alongside. 00:00:49.640 | 
Most of this stuff, of course, has only been out a day or at most two. 00:00:56.480 | 
And the first TLDR I'll give is don't sleep on the new Claude 3.5 Sonnet. 00:01:01.440 | 
Even if you're not interested in an LLM taking over your computer, it's reasoning, coding 00:01:07.120 | 
and visual processing abilities are a step forward. 00:01:10.040 | 
I'll also touch on one of its key weaknesses in just a few minutes. 00:01:13.360 | 
First, of course, the name and somewhat confusingly, it's still 3.5 Sonnet, just with brackets 00:01:21.080 | 
Feels like they could have called it at least 3.6 Sonnet, but for now, that new is the giveaway 00:01:27.480 | 
It has knowledge of world events up until April of 2024, which is not bad compared to 00:01:37.120 | 
Naturally, though, most of the attention will be focused on the new Claude 3.5 Sonnet's 00:01:45.680 | 
I think it's safe to say that as of today, that won't see broad public adoption because 00:01:52.240 | 
of its unreliability and the long list of things that it can't do, like send emails, 00:01:58.040 | 
make purchases, technically, complete captures or edit or manipulate images. 00:02:02.200 | 
I feel like it's more anthropic dipping their toe into the water. 00:02:05.760 | 
And it looks like it's found something great. 00:02:10.320 | 
So how far away is the location from my place? 00:02:18.680 | 
Searching for the distance between my area and the hiking location. 00:02:35.200 | 
So now it looks like Claude is searching for the sunrise time tomorrow and is now dropping 00:02:41.760 | 
But from what we know so far, how does the new Claude 3.5 Sonnet actually do when given 00:02:47.880 | 
The first benchmark that Anthropic cites in the paper is the OS world benchmark. 00:02:52.120 | 
Naturally, I read that paper to get some more context on what the 3.5 model can actually 00:03:00.320 | 
It's over 350 tasks covering professional use, office use and daily use like shopping. 00:03:05.160 | 
One small detail from this paper that I think many analysts might miss is how they derived 00:03:11.080 | 
Not only were these tasks like changing the settings in a set of slides in a presentation 00:03:19.880 | 
These were computer science majors, college students who possess basic software usage 00:03:24.120 | 
skills but have not been exposed to these exact samples or the software before. 00:03:28.600 | 
So when you hear about human performance accuracy on this benchmark being around 72%, that's 00:03:36.320 | 
Or going back to the Sonnet paper, they're not comparing Claude 3.5 Sonnet to a bunch 00:03:43.960 | 
Anyway, if given 50 steps, Claude 3.5 Sonnet, the new one, gets 22% versus these comp sci 00:03:53.920 | 
Now I can see someone at the back raising their hand saying, well, the benchmark states 00:03:58.120 | 
it has to be done in 15 steps and that's where Claude 3.5 Sonnet new gets 15%. 00:04:03.240 | 
What if the humans were given 50 steps, which is where Claude gets 22%. 00:04:07.160 | 
But still, my only point is that if you were to compare Claude 3.5 Sonnet's performance 00:04:12.520 | 
to literally an average human, someone not particularly versed in computers, I think 00:04:20.080 | 
And in software engineering, Claude 3.5 Sonnet would absolutely crush that old biddy. 00:04:24.920 | 
No, but to be slightly more serious, in the benchmark that OpenAI itself created, SWE 00:04:30.440 | 
Bench, Software Engineering Bench Verified, Claude 3.5 Sonnet, the new one, I'm going 00:04:34.480 | 
to have to keep saying new, gets 49% and that beats O1 Preview. 00:04:38.800 | 
I dug into the O1 paper to get this comparison where you can see even pre-mitigation O1 Preview 00:04:45.400 | 
gets 38.4% in this software engineering benchmark compared to post mitigation after safety training 00:04:53.840 | 
Now, of course, it's hard to have an apples to apples comparison because it depends on 00:04:57.980 | 
prompting and scaffolding, but you can see as of today, the new 3.5 Sonnet is the best 00:05:04.840 | 
As you might expect, I'm not just going to rely on published benchmark performance though, 00:05:09.480 | 
I ran my own benchmark, SimpleBench on the new 3.5 Sonnet and I'm going to give you the 00:05:15.760 | 
In this chart though, you can see the comparison between the new Claude 3.5 Sonnet and the 00:05:21.400 | 
original Claude 3.5 Sonnet, as well as interestingly, GPT 4.0, but not O1 Preview. 00:05:26.960 | 
So what most people want to know is, is it better? 00:05:34.740 | 
If you're asking it a challenging science question, it will do better than the previous 00:05:43.760 | 
In coding, as we've seen, it's simply better than the original 3.5 Sonnet. 00:05:51.100 | 
And even in visual question answering, answering questions from tables, charts, and graphs, 00:05:56.600 | 
it's slightly better than the previous version. 00:05:59.120 | 
And you're about to see further results confirming how much of a step forward it is. 00:06:05.640 | 
And so for the vast majority of people, those are the headlines. 00:06:09.400 | 
But for the LLM nerds like me and some of you watching, we probably would have liked 00:06:17.040 | 
And that does seem to be an industry trend where for these benchmark figures, companies 00:06:23.880 | 
Which rival models they include and which they don't. 00:06:26.360 | 
Here's just another quick example with the release of the mini-Strahl models, 3B and 00:06:33.000 | 
That's the small French outfit that is holding the torch for mainland Europe in AI. 00:06:37.720 | 
Anyway, in the small model category, here are the sets of results that they released. 00:06:42.940 | 
This was just a week ago, by the way, and all seems fine and good until the principal 00:06:47.860 | 
researcher at Google DeepMind pointed out that they missed out Gemma 2 9B from DeepMind. 00:06:54.680 | 
Adding that row in makes the results look slightly different. 00:06:58.040 | 
I'm not trying to make this a major point, so I'm going to move on in just a moment. 00:07:01.640 | 
But just to give you an idea, let's take GPQA where the new Sonic gets 65%. 00:07:07.680 | 
O1 Preview gets 78.3 and the full O1, interestingly, slightly down 78.0. 00:07:14.880 | 
On the MMMU, which I think is a really great benchmark, again, testing whether you can 00:07:19.880 | 
read from graphs, tables, charts and the like. 00:07:23.800 | 
We have Claude 3.5 Sonic getting 70.4 and the new O1, which admittedly isn't out yet, 00:07:34.160 | 
Finally, in a particularly famous high school math competition, we have the new Sonic getting 00:07:39.480 | 
16%, O1 Preview getting 56.7% and O1 getting 83%. 00:07:46.200 | 
I think the new Sonic is going to be massively underestimated. 00:07:50.080 | 
I think he's actually really good at reasoning. 00:07:52.420 | 
Maybe not these super calculation heavy, computation heavy things like O1 can do, just general 00:08:00.440 | 
As you might expect, I've already tested it hundreds of times and I can attest to that 00:08:04.880 | 
I just wanted to make the simple point that sometimes the model providers choose the comparisons 00:08:11.080 | 
On that point, Anthropic say this, "Our evaluation tables exclude OpenAI's O1 model 00:08:16.240 | 
family as they depend on extensive pre-response computation time, unlike the typical models. 00:08:22.760 | 
This fundamental difference in operation makes performance comparisons difficult and outside 00:08:29.040 | 
Well, one of the leaders on the work on O1, Jerry Tworek said this, "Being hard to compare 00:08:35.200 | 
against because of fundamental differences is the highest form of flattery." 00:08:40.920 | 
What about the intriguing TauBench at the bottom? 00:08:47.120 | 
Well, of course I dug into that paper and noticed that one of the authors is my friend 00:08:53.080 | 
The short version is that it's a benchmark that tests if an AI agent can do shopping 00:08:57.320 | 
for you or book an airline ticket, but it has some really interesting nuggets. 00:09:01.160 | 
They have a pass to the power of K, not pass at K, pass to the power of K. 00:09:09.300 | 
If you hear something like pass at eight, that means does a model get it right once 00:09:16.240 | 
If you hear pass to the power of K, which I think is a brilliant device, that's did 00:09:26.240 | 
It's no good getting it right once out of eight, then you'll be flying to bloody Mongolia. 00:09:32.560 | 
Oh, and by the way, the company behind this benchmark is now valued at over 4 billion 00:09:38.000 | 
I believe they're trying to automate customer service and that reminds me with these valuations. 00:09:41.740 | 
It's a bit like they might only have a 3% chance or 4% chance of automating customer 00:09:47.080 | 
service, but if they do and they get the financial rewards for doing so, then they're going to 00:09:53.280 | 
So for me, when I hear a $4 billion valuation, I don't think they're literally worth 4 00:09:58.440 | 
I think, just my opinion of course, I think it's like 90% chance they're worth very 00:10:03.380 | 
little or a small amount, but then a 10% chance or 4% chance they're worth trillions. 00:10:08.720 | 
Anyway, the tasks are kind of as you'd expect, it will be helping users cancel or modify 00:10:14.360 | 
pending orders, returning delivered orders, modifying user addresses, that kind of thing 00:10:20.780 | 
And then for the Tao Airline part of the benchmark, which of course is much higher stakes, the 00:10:25.520 | 
agent has to help users book, modify or cancel flight reservations. 00:10:32.440 | 
And this is what I meant in the introduction when I said, not all is as it seems. 00:10:37.200 | 
For these set of results, I really think it's worth paying attention because the next 6 00:10:42.000 | 
months, 18 months could be dominated by AI agents and the related benchmarks. 00:10:49.440 | 
Yes, Claude 3.5 Sonnet, the new one, is significantly better than the previous version, but I admire 00:10:55.760 | 
Anthropic for putting out these results because they don't always shine the best light on 00:11:01.120 | 
At least I'm talking for computer use, again, reasoning amazing. 00:11:04.680 | 
Because for the airline tasks, 46% given one try isn't amazing. 00:11:13.840 | 
I think it sums up maybe the next 6 to 18 months. 00:11:17.480 | 
What about pass to the power of K, where K is, say, 8? 00:11:21.520 | 
I just want to restate what that means for extra clarity. 00:11:24.440 | 
To pass to the power of 8, you have to get it right and then right and then right 8 times. 00:11:30.000 | 
One mistake once on your 7th try screws up the entire trial. 00:11:34.640 | 
And so what we get for Sonnet, and I suspect all language models, including O1, is a reverse 00:11:41.520 | 
As you scale up the number of attempts, performance consistently drops. 00:11:46.320 | 
This I suspect is the kind of scaling that currently AI CEOs don't really want you to 00:11:52.400 | 
This was for the slightly easier retail task. 00:11:55.640 | 
And if you just ask it once, look, we get 70% for the new Sonnet. 00:12:07.760 | 
I'm just saying this reminds me that reliability is the single barrier left, I feel, to massive 00:12:15.900 | 
Speaking specifically about LLMs here, they can, quote, achieve harder and harder tasks, 00:12:23.720 | 
But that won't mean that much until the reliability on basic tasks gets better. 00:12:29.920 | 
Mind you, I don't think that's terribly far away, especially when we scale up test time 00:12:37.520 | 
Interestingly, the new Sonnet is actually slightly worse at refusals. 00:12:40.960 | 
In other words, it correctly refuses wild chat toxic requests slightly less often. 00:12:47.020 | 
It incorrectly refuses innocent requests slightly more often than the previous model. 00:12:55.260 | 
Do you remember from the '01 release how it actually performed slightly worse in creative 00:13:02.740 | 
Well, that's not the case for the new 3.5 Sonnet. 00:13:07.180 | 
It crushes 58% of the time, at least, the original Claude 3.5 Sonnet. 00:13:12.420 | 
Interestingly, on multilingual challenges, it's slightly worse than the previous version 00:13:18.620 | 
But what about my own benchmark, SimpleBench? 00:13:20.740 | 
Well, I tried to get crazy ambitious with majority voting and architectures involving 00:13:27.900 | 
But in the end, that just slightly delayed me. 00:13:31.220 | 
So I just want to get it out to you as soon as possible. 00:13:34.020 | 
Myself and the senior ML engineer that I'm working with on SimpleBench did agree that 00:13:38.540 | 
we should probably put out a technical memo, paper or report, something to indicate our 00:13:45.300 | 
And that report has taken slightly longer than I expected, which is why I can't just 00:13:52.220 | 
But I can release the provisional leaderboard. 00:13:55.700 | 
SimpleBench is a test that can be taken by non-specialized humans from any background 00:14:01.820 | 
It asks questions about how things move in space and time and also tests social intelligence. 00:14:08.060 | 
We ran the new Claude 3.5 Sonnet this morning and saw a significant step up from the previous 00:14:15.180 | 
Notice also we have the latest results from the new Gemini 1.5 Pro. 00:14:20.100 | 
The Command R plus, that's the new one, Grok 2, the API became available fairly recently. 00:14:24.900 | 
And for those of you who have been following the channel for a while, you might have noticed 00:14:30.780 | 
That's because we expanded the scope out to nine people this time. 00:14:34.820 | 
The human top performance is still around 96%, I think it's 95.7%. 00:14:40.300 | 
We expanded the scope and dropped the requirement to be a graduate. 00:14:44.300 | 
This was nine human beings from truly eclectic backgrounds. 00:14:47.660 | 
The only commonality is that they spoke English natively. 00:14:53.580 | 
The average of five means we ran for the models, the benchmark five times in full and averaged 00:15:01.340 | 
You might remember what I was saying about self consistency. 00:15:03.700 | 
I don't want to get into it here, but that's not the same as just averaging the results. 00:15:08.540 | 
It was very complicated and it actually hurt model performance. 00:15:11.840 | 
That's a story for another day, but I wanted models to shine as much as they could. 00:15:15.880 | 
And some of you may be wondering at this point, if I followed the recent debates in the last 00:15:20.460 | 
couple of weeks about models reasoning and whether prompting can overcome those limitations. 00:15:25.460 | 
Yes, it can boost performance, including to slightly above 50%, but we're still nowhere 00:15:34.280 | 
I will say to some of you who are new to the channel that through prompting and a particular 00:15:38.460 | 
scaffolding, myself and Josh, an ML engineer came up with smart GPT in May of last year, 00:15:44.140 | 
which got at the time a record 89.0% on the MMLU. 00:15:48.880 | 
So yes, I know that prompting can help, but simple bench is a pretty resilient benchmark. 00:15:54.180 | 
These questions were of course vetted multiple times, exhaustively, including by doctoral 00:16:00.500 | 
And they weren't like the somewhat flippant example I gave a couple of videos ago about 00:16:06.620 | 
I kind of meant that one as a quick illustration, but obviously that backfired. 00:16:12.260 | 
I should probably scroll down because we did also do GPT 4.0 mini and you can see the performance. 00:16:17.500 | 
So yes, in short, even in my own benchmark, I can confirm that the new Sonnet is a marked 00:16:23.660 | 
I'm not the only one who has been impressed, of course. 00:16:26.220 | 
And earlier in the video, I showed you the reasons not to be overly hyped, but there 00:16:29.900 | 
are some reasons to be just a little bit excited or terrified, depending on your perspective. 00:16:35.020 | 
And I can't resist telling you that Simple Bench will be sponsored by Weights & Biases. 00:16:42.420 | 
I am incredibly selective in who I work with on sponsorships. 00:16:46.500 | 
And yes, they are sponsoring this video and we use Weave for Simple Bench. 00:16:51.280 | 
That's how we could do an incredibly quick turnaround on the new Sonnet results. 00:16:55.900 | 
My own custom link should be showing on screen and will of course be in the description. 00:16:59.780 | 
But don't forget they also have free courses in their AI Academy. 00:17:03.220 | 
Now I can't resist taking a 10 second interlude to point out something you may have heard 00:17:07.760 | 
about already, which is that Claude at one point took a break during a coding demo to 00:17:13.540 | 
peruse photos of the Yellowstone National Park. 00:17:17.180 | 
Whether that's misalignment or massively amusing will of course depend on your perspective. 00:17:22.480 | 
But it must be said that Claude isn't the only thing that was dropped yesterday. 00:17:30.320 | 
While this particular development and the other two I'm about to show you might not 00:17:33.640 | 
completely blow off your socks, they do point to the unrelenting improvement in AI generated 00:17:41.920 | 
With the Runways Act 1, you can see the actress driving the performance, showing emotion and 00:17:50.300 | 
And then that being generated by the model into different character animations in all 00:17:56.180 | 
The lip syncing isn't quite there yet, but nevertheless, here is a 15 second scene. 00:18:00.820 | 
- You really think you can just walk away from all this? 00:18:15.080 | 
- What would you have me do, stay, just watch it all crumble around me? 00:18:21.780 | 
- You can let me know what you think in the comments, but while generations have this 00:18:26.120 | 
AI tinge about them, I wonder if they will struggle to achieve public acceptance. 00:18:32.040 | 
I suspect the impact of this kind of technology in video games will be much greater than in 00:18:37.200 | 
movies and TV, where people want everything to be perfect. 00:18:41.280 | 
Then there's Heijen's new interactive avatar, and I think this one somewhat flew under the 00:18:46.960 | 
I will admit that the lip syncing again is not the best, but did you know you can have 00:18:54.600 | 
Like literally you open the tab and it's a Zoom meeting, you log in and then you're chatting 00:19:00.720 | 
Right at the beginning of this year, I predicted that this would happen this year, but I didn't 00:19:05.200 | 
think it would come from someone like Heijen. 00:19:07.520 | 
I thought it might come from OpenAI, but either way, true Zoom calls with AI models is now 00:19:13.820 | 
Hey Vicky, this is Philip from AI Explained, and I'm actually testing you for a video that's 00:19:22.360 | 
And can you confirm this is a live Zoom call? 00:19:29.120 | 
- I see Philip, just to clarify, this is a simulated role play for conversation practice, 00:19:35.080 | 
However, I'm here to help you navigate any scenarios you want to explore. 00:19:39.440 | 
What specific conversation would you like to role play? 00:19:42.320 | 
- Well, it's kind of live, I mean, you're not alive, but it's live in the sense of I'm 00:19:47.040 | 
speaking to you now and you're reacting in real time, right? 00:19:53.440 | 
- Got it, Philip, you're right, I'm reacting in real time. 00:19:56.220 | 
So let's focus on your video content, what scenario do you want to role play that relates 00:20:01.320 | 
- I'm just going to demonstrate how I can interrupt you, although you just stopped speaking 00:20:05.640 | 
But yeah, your lip syncing is very good, but it's slightly stilted, I can tell you're not 00:20:17.240 | 
- I understand, Philip, I'm here to support your demonstration. 00:20:20.160 | 
If you'd like, we can role play a scenario where you provide feedback on that. 00:20:22.800 | 
- Okay, okay, I'm going to interrupt you now, probably, but can you just give a message 00:20:32.160 | 
- Sure thing, Philip, just let me know what message you'd like me to convey to your viewers. 00:20:37.920 | 
- No, I want you, that was weird, I want you to come up with a message for my YouTube subscribers. 00:20:50.960 | 
If you're enjoying the content, don't forget to like, subscribe and hit that notification 00:21:01.240 | 
- So I think you get the idea now, but the final of the three updates on different modalities 00:21:11.440 | 
Using Gemini 1.5 Pro, they can analyze a source that you feed it for free and generate through 00:21:20.320 | 
Now, you guys already knew that because I've covered that before on this channel, but here's 00:21:25.400 | 
After you choose a file to upload, then after 30 seconds or so, you will see a new button, 00:21:32.720 | 
And I think this is really cool if you haven't already seen it, but you click Customize and 00:21:39.320 | 
Pick one topic out of a video you feed in or a PDF. 00:21:44.980 | 
Explain it in a way that's good for this audience or that audience. 00:21:48.800 | 
Basically customization of a tool that was already pretty cool. 00:21:52.480 | 
Maybe half of you might already know about that capability, but for the other half, do 00:21:57.860 | 
Of course, the big news for me remains the release of the new 3.5 Sonnet. 00:22:02.640 | 
And those were my takeaways from all the associated papers. 00:22:06.280 | 
The timing was great because I've just recovered from what could be either COVID or a heavy 00:22:10.760 | 
cold, but I am back in action and really looking forward to fully releasing SimpleBench. 00:22:16.720 | 
Of course, would love to see you over on Insiders and quite topically from Anthropic. 00:22:21.080 | 
Here's my 35-minute analysis of their CEO's deep dive on a compressed 21st century. 00:22:29.080 | 
Do check it out, but regardless, thank you so much for watching and have a wonderful