back to indexThe New Claude 3.5 Sonnet: Better, Yes, But Not Just in the Way You Might Think
Chapters
0:0 Introduction
0:57 Claude 3.5 Sonnet (New) Paper
2:6 Demo
2:58 OSWorld
4:29 Benchmarks compared + OpenAI Response
8:30 Tau-Bench
13:9 SimpleBench Results
17:5 Yellowstone Detour
17:29 Runway Act-One
18:44 HeyGen Interactive Avatars + Demo
21:6 NotebookLM Update
00:00:00.000 |
The new Claude 3.5 Sonnet from Anthropic, their flagship language model, is a significant 00:00:07.840 |
But wait, no, not particularly because it can use a mouse to perform a basic Google 00:00:13.820 |
I'll dive into a couple of papers to show that not all is as it seems on that front. 00:00:19.280 |
And for now, I can browse Google for National Park photos myself. 00:00:22.640 |
No, for me, the new 3.5 Sonnet is a step forward because, yet again, we have progress on that 00:00:32.440 |
And I do have some new evidence to back that up. 00:00:37.560 |
And so while Anthropic were cooking, Haygen and Runway were also in the kitchen, stirring 00:00:43.480 |
up AIs you can Zoom chat with and create movie scenes alongside. 00:00:49.640 |
Most of this stuff, of course, has only been out a day or at most two. 00:00:56.480 |
And the first TLDR I'll give is don't sleep on the new Claude 3.5 Sonnet. 00:01:01.440 |
Even if you're not interested in an LLM taking over your computer, it's reasoning, coding 00:01:07.120 |
and visual processing abilities are a step forward. 00:01:10.040 |
I'll also touch on one of its key weaknesses in just a few minutes. 00:01:13.360 |
First, of course, the name and somewhat confusingly, it's still 3.5 Sonnet, just with brackets 00:01:21.080 |
Feels like they could have called it at least 3.6 Sonnet, but for now, that new is the giveaway 00:01:27.480 |
It has knowledge of world events up until April of 2024, which is not bad compared to 00:01:37.120 |
Naturally, though, most of the attention will be focused on the new Claude 3.5 Sonnet's 00:01:45.680 |
I think it's safe to say that as of today, that won't see broad public adoption because 00:01:52.240 |
of its unreliability and the long list of things that it can't do, like send emails, 00:01:58.040 |
make purchases, technically, complete captures or edit or manipulate images. 00:02:02.200 |
I feel like it's more anthropic dipping their toe into the water. 00:02:05.760 |
And it looks like it's found something great. 00:02:10.320 |
So how far away is the location from my place? 00:02:18.680 |
Searching for the distance between my area and the hiking location. 00:02:35.200 |
So now it looks like Claude is searching for the sunrise time tomorrow and is now dropping 00:02:41.760 |
But from what we know so far, how does the new Claude 3.5 Sonnet actually do when given 00:02:47.880 |
The first benchmark that Anthropic cites in the paper is the OS world benchmark. 00:02:52.120 |
Naturally, I read that paper to get some more context on what the 3.5 model can actually 00:03:00.320 |
It's over 350 tasks covering professional use, office use and daily use like shopping. 00:03:05.160 |
One small detail from this paper that I think many analysts might miss is how they derived 00:03:11.080 |
Not only were these tasks like changing the settings in a set of slides in a presentation 00:03:19.880 |
These were computer science majors, college students who possess basic software usage 00:03:24.120 |
skills but have not been exposed to these exact samples or the software before. 00:03:28.600 |
So when you hear about human performance accuracy on this benchmark being around 72%, that's 00:03:36.320 |
Or going back to the Sonnet paper, they're not comparing Claude 3.5 Sonnet to a bunch 00:03:43.960 |
Anyway, if given 50 steps, Claude 3.5 Sonnet, the new one, gets 22% versus these comp sci 00:03:53.920 |
Now I can see someone at the back raising their hand saying, well, the benchmark states 00:03:58.120 |
it has to be done in 15 steps and that's where Claude 3.5 Sonnet new gets 15%. 00:04:03.240 |
What if the humans were given 50 steps, which is where Claude gets 22%. 00:04:07.160 |
But still, my only point is that if you were to compare Claude 3.5 Sonnet's performance 00:04:12.520 |
to literally an average human, someone not particularly versed in computers, I think 00:04:20.080 |
And in software engineering, Claude 3.5 Sonnet would absolutely crush that old biddy. 00:04:24.920 |
No, but to be slightly more serious, in the benchmark that OpenAI itself created, SWE 00:04:30.440 |
Bench, Software Engineering Bench Verified, Claude 3.5 Sonnet, the new one, I'm going 00:04:34.480 |
to have to keep saying new, gets 49% and that beats O1 Preview. 00:04:38.800 |
I dug into the O1 paper to get this comparison where you can see even pre-mitigation O1 Preview 00:04:45.400 |
gets 38.4% in this software engineering benchmark compared to post mitigation after safety training 00:04:53.840 |
Now, of course, it's hard to have an apples to apples comparison because it depends on 00:04:57.980 |
prompting and scaffolding, but you can see as of today, the new 3.5 Sonnet is the best 00:05:04.840 |
As you might expect, I'm not just going to rely on published benchmark performance though, 00:05:09.480 |
I ran my own benchmark, SimpleBench on the new 3.5 Sonnet and I'm going to give you the 00:05:15.760 |
In this chart though, you can see the comparison between the new Claude 3.5 Sonnet and the 00:05:21.400 |
original Claude 3.5 Sonnet, as well as interestingly, GPT 4.0, but not O1 Preview. 00:05:26.960 |
So what most people want to know is, is it better? 00:05:34.740 |
If you're asking it a challenging science question, it will do better than the previous 00:05:43.760 |
In coding, as we've seen, it's simply better than the original 3.5 Sonnet. 00:05:51.100 |
And even in visual question answering, answering questions from tables, charts, and graphs, 00:05:56.600 |
it's slightly better than the previous version. 00:05:59.120 |
And you're about to see further results confirming how much of a step forward it is. 00:06:05.640 |
And so for the vast majority of people, those are the headlines. 00:06:09.400 |
But for the LLM nerds like me and some of you watching, we probably would have liked 00:06:17.040 |
And that does seem to be an industry trend where for these benchmark figures, companies 00:06:23.880 |
Which rival models they include and which they don't. 00:06:26.360 |
Here's just another quick example with the release of the mini-Strahl models, 3B and 00:06:33.000 |
That's the small French outfit that is holding the torch for mainland Europe in AI. 00:06:37.720 |
Anyway, in the small model category, here are the sets of results that they released. 00:06:42.940 |
This was just a week ago, by the way, and all seems fine and good until the principal 00:06:47.860 |
researcher at Google DeepMind pointed out that they missed out Gemma 2 9B from DeepMind. 00:06:54.680 |
Adding that row in makes the results look slightly different. 00:06:58.040 |
I'm not trying to make this a major point, so I'm going to move on in just a moment. 00:07:01.640 |
But just to give you an idea, let's take GPQA where the new Sonic gets 65%. 00:07:07.680 |
O1 Preview gets 78.3 and the full O1, interestingly, slightly down 78.0. 00:07:14.880 |
On the MMMU, which I think is a really great benchmark, again, testing whether you can 00:07:19.880 |
read from graphs, tables, charts and the like. 00:07:23.800 |
We have Claude 3.5 Sonic getting 70.4 and the new O1, which admittedly isn't out yet, 00:07:34.160 |
Finally, in a particularly famous high school math competition, we have the new Sonic getting 00:07:39.480 |
16%, O1 Preview getting 56.7% and O1 getting 83%. 00:07:46.200 |
I think the new Sonic is going to be massively underestimated. 00:07:50.080 |
I think he's actually really good at reasoning. 00:07:52.420 |
Maybe not these super calculation heavy, computation heavy things like O1 can do, just general 00:08:00.440 |
As you might expect, I've already tested it hundreds of times and I can attest to that 00:08:04.880 |
I just wanted to make the simple point that sometimes the model providers choose the comparisons 00:08:11.080 |
On that point, Anthropic say this, "Our evaluation tables exclude OpenAI's O1 model 00:08:16.240 |
family as they depend on extensive pre-response computation time, unlike the typical models. 00:08:22.760 |
This fundamental difference in operation makes performance comparisons difficult and outside 00:08:29.040 |
Well, one of the leaders on the work on O1, Jerry Tworek said this, "Being hard to compare 00:08:35.200 |
against because of fundamental differences is the highest form of flattery." 00:08:40.920 |
What about the intriguing TauBench at the bottom? 00:08:47.120 |
Well, of course I dug into that paper and noticed that one of the authors is my friend 00:08:53.080 |
The short version is that it's a benchmark that tests if an AI agent can do shopping 00:08:57.320 |
for you or book an airline ticket, but it has some really interesting nuggets. 00:09:01.160 |
They have a pass to the power of K, not pass at K, pass to the power of K. 00:09:09.300 |
If you hear something like pass at eight, that means does a model get it right once 00:09:16.240 |
If you hear pass to the power of K, which I think is a brilliant device, that's did 00:09:26.240 |
It's no good getting it right once out of eight, then you'll be flying to bloody Mongolia. 00:09:32.560 |
Oh, and by the way, the company behind this benchmark is now valued at over 4 billion 00:09:38.000 |
I believe they're trying to automate customer service and that reminds me with these valuations. 00:09:41.740 |
It's a bit like they might only have a 3% chance or 4% chance of automating customer 00:09:47.080 |
service, but if they do and they get the financial rewards for doing so, then they're going to 00:09:53.280 |
So for me, when I hear a $4 billion valuation, I don't think they're literally worth 4 00:09:58.440 |
I think, just my opinion of course, I think it's like 90% chance they're worth very 00:10:03.380 |
little or a small amount, but then a 10% chance or 4% chance they're worth trillions. 00:10:08.720 |
Anyway, the tasks are kind of as you'd expect, it will be helping users cancel or modify 00:10:14.360 |
pending orders, returning delivered orders, modifying user addresses, that kind of thing 00:10:20.780 |
And then for the Tao Airline part of the benchmark, which of course is much higher stakes, the 00:10:25.520 |
agent has to help users book, modify or cancel flight reservations. 00:10:32.440 |
And this is what I meant in the introduction when I said, not all is as it seems. 00:10:37.200 |
For these set of results, I really think it's worth paying attention because the next 6 00:10:42.000 |
months, 18 months could be dominated by AI agents and the related benchmarks. 00:10:49.440 |
Yes, Claude 3.5 Sonnet, the new one, is significantly better than the previous version, but I admire 00:10:55.760 |
Anthropic for putting out these results because they don't always shine the best light on 00:11:01.120 |
At least I'm talking for computer use, again, reasoning amazing. 00:11:04.680 |
Because for the airline tasks, 46% given one try isn't amazing. 00:11:13.840 |
I think it sums up maybe the next 6 to 18 months. 00:11:17.480 |
What about pass to the power of K, where K is, say, 8? 00:11:21.520 |
I just want to restate what that means for extra clarity. 00:11:24.440 |
To pass to the power of 8, you have to get it right and then right and then right 8 times. 00:11:30.000 |
One mistake once on your 7th try screws up the entire trial. 00:11:34.640 |
And so what we get for Sonnet, and I suspect all language models, including O1, is a reverse 00:11:41.520 |
As you scale up the number of attempts, performance consistently drops. 00:11:46.320 |
This I suspect is the kind of scaling that currently AI CEOs don't really want you to 00:11:52.400 |
This was for the slightly easier retail task. 00:11:55.640 |
And if you just ask it once, look, we get 70% for the new Sonnet. 00:12:07.760 |
I'm just saying this reminds me that reliability is the single barrier left, I feel, to massive 00:12:15.900 |
Speaking specifically about LLMs here, they can, quote, achieve harder and harder tasks, 00:12:23.720 |
But that won't mean that much until the reliability on basic tasks gets better. 00:12:29.920 |
Mind you, I don't think that's terribly far away, especially when we scale up test time 00:12:37.520 |
Interestingly, the new Sonnet is actually slightly worse at refusals. 00:12:40.960 |
In other words, it correctly refuses wild chat toxic requests slightly less often. 00:12:47.020 |
It incorrectly refuses innocent requests slightly more often than the previous model. 00:12:55.260 |
Do you remember from the '01 release how it actually performed slightly worse in creative 00:13:02.740 |
Well, that's not the case for the new 3.5 Sonnet. 00:13:07.180 |
It crushes 58% of the time, at least, the original Claude 3.5 Sonnet. 00:13:12.420 |
Interestingly, on multilingual challenges, it's slightly worse than the previous version 00:13:18.620 |
But what about my own benchmark, SimpleBench? 00:13:20.740 |
Well, I tried to get crazy ambitious with majority voting and architectures involving 00:13:27.900 |
But in the end, that just slightly delayed me. 00:13:31.220 |
So I just want to get it out to you as soon as possible. 00:13:34.020 |
Myself and the senior ML engineer that I'm working with on SimpleBench did agree that 00:13:38.540 |
we should probably put out a technical memo, paper or report, something to indicate our 00:13:45.300 |
And that report has taken slightly longer than I expected, which is why I can't just 00:13:52.220 |
But I can release the provisional leaderboard. 00:13:55.700 |
SimpleBench is a test that can be taken by non-specialized humans from any background 00:14:01.820 |
It asks questions about how things move in space and time and also tests social intelligence. 00:14:08.060 |
We ran the new Claude 3.5 Sonnet this morning and saw a significant step up from the previous 00:14:15.180 |
Notice also we have the latest results from the new Gemini 1.5 Pro. 00:14:20.100 |
The Command R plus, that's the new one, Grok 2, the API became available fairly recently. 00:14:24.900 |
And for those of you who have been following the channel for a while, you might have noticed 00:14:30.780 |
That's because we expanded the scope out to nine people this time. 00:14:34.820 |
The human top performance is still around 96%, I think it's 95.7%. 00:14:40.300 |
We expanded the scope and dropped the requirement to be a graduate. 00:14:44.300 |
This was nine human beings from truly eclectic backgrounds. 00:14:47.660 |
The only commonality is that they spoke English natively. 00:14:53.580 |
The average of five means we ran for the models, the benchmark five times in full and averaged 00:15:01.340 |
You might remember what I was saying about self consistency. 00:15:03.700 |
I don't want to get into it here, but that's not the same as just averaging the results. 00:15:08.540 |
It was very complicated and it actually hurt model performance. 00:15:11.840 |
That's a story for another day, but I wanted models to shine as much as they could. 00:15:15.880 |
And some of you may be wondering at this point, if I followed the recent debates in the last 00:15:20.460 |
couple of weeks about models reasoning and whether prompting can overcome those limitations. 00:15:25.460 |
Yes, it can boost performance, including to slightly above 50%, but we're still nowhere 00:15:34.280 |
I will say to some of you who are new to the channel that through prompting and a particular 00:15:38.460 |
scaffolding, myself and Josh, an ML engineer came up with smart GPT in May of last year, 00:15:44.140 |
which got at the time a record 89.0% on the MMLU. 00:15:48.880 |
So yes, I know that prompting can help, but simple bench is a pretty resilient benchmark. 00:15:54.180 |
These questions were of course vetted multiple times, exhaustively, including by doctoral 00:16:00.500 |
And they weren't like the somewhat flippant example I gave a couple of videos ago about 00:16:06.620 |
I kind of meant that one as a quick illustration, but obviously that backfired. 00:16:12.260 |
I should probably scroll down because we did also do GPT 4.0 mini and you can see the performance. 00:16:17.500 |
So yes, in short, even in my own benchmark, I can confirm that the new Sonnet is a marked 00:16:23.660 |
I'm not the only one who has been impressed, of course. 00:16:26.220 |
And earlier in the video, I showed you the reasons not to be overly hyped, but there 00:16:29.900 |
are some reasons to be just a little bit excited or terrified, depending on your perspective. 00:16:35.020 |
And I can't resist telling you that Simple Bench will be sponsored by Weights & Biases. 00:16:42.420 |
I am incredibly selective in who I work with on sponsorships. 00:16:46.500 |
And yes, they are sponsoring this video and we use Weave for Simple Bench. 00:16:51.280 |
That's how we could do an incredibly quick turnaround on the new Sonnet results. 00:16:55.900 |
My own custom link should be showing on screen and will of course be in the description. 00:16:59.780 |
But don't forget they also have free courses in their AI Academy. 00:17:03.220 |
Now I can't resist taking a 10 second interlude to point out something you may have heard 00:17:07.760 |
about already, which is that Claude at one point took a break during a coding demo to 00:17:13.540 |
peruse photos of the Yellowstone National Park. 00:17:17.180 |
Whether that's misalignment or massively amusing will of course depend on your perspective. 00:17:22.480 |
But it must be said that Claude isn't the only thing that was dropped yesterday. 00:17:30.320 |
While this particular development and the other two I'm about to show you might not 00:17:33.640 |
completely blow off your socks, they do point to the unrelenting improvement in AI generated 00:17:41.920 |
With the Runways Act 1, you can see the actress driving the performance, showing emotion and 00:17:50.300 |
And then that being generated by the model into different character animations in all 00:17:56.180 |
The lip syncing isn't quite there yet, but nevertheless, here is a 15 second scene. 00:18:00.820 |
- You really think you can just walk away from all this? 00:18:15.080 |
- What would you have me do, stay, just watch it all crumble around me? 00:18:21.780 |
- You can let me know what you think in the comments, but while generations have this 00:18:26.120 |
AI tinge about them, I wonder if they will struggle to achieve public acceptance. 00:18:32.040 |
I suspect the impact of this kind of technology in video games will be much greater than in 00:18:37.200 |
movies and TV, where people want everything to be perfect. 00:18:41.280 |
Then there's Heijen's new interactive avatar, and I think this one somewhat flew under the 00:18:46.960 |
I will admit that the lip syncing again is not the best, but did you know you can have 00:18:54.600 |
Like literally you open the tab and it's a Zoom meeting, you log in and then you're chatting 00:19:00.720 |
Right at the beginning of this year, I predicted that this would happen this year, but I didn't 00:19:05.200 |
think it would come from someone like Heijen. 00:19:07.520 |
I thought it might come from OpenAI, but either way, true Zoom calls with AI models is now 00:19:13.820 |
Hey Vicky, this is Philip from AI Explained, and I'm actually testing you for a video that's 00:19:22.360 |
And can you confirm this is a live Zoom call? 00:19:29.120 |
- I see Philip, just to clarify, this is a simulated role play for conversation practice, 00:19:35.080 |
However, I'm here to help you navigate any scenarios you want to explore. 00:19:39.440 |
What specific conversation would you like to role play? 00:19:42.320 |
- Well, it's kind of live, I mean, you're not alive, but it's live in the sense of I'm 00:19:47.040 |
speaking to you now and you're reacting in real time, right? 00:19:53.440 |
- Got it, Philip, you're right, I'm reacting in real time. 00:19:56.220 |
So let's focus on your video content, what scenario do you want to role play that relates 00:20:01.320 |
- I'm just going to demonstrate how I can interrupt you, although you just stopped speaking 00:20:05.640 |
But yeah, your lip syncing is very good, but it's slightly stilted, I can tell you're not 00:20:17.240 |
- I understand, Philip, I'm here to support your demonstration. 00:20:20.160 |
If you'd like, we can role play a scenario where you provide feedback on that. 00:20:22.800 |
- Okay, okay, I'm going to interrupt you now, probably, but can you just give a message 00:20:32.160 |
- Sure thing, Philip, just let me know what message you'd like me to convey to your viewers. 00:20:37.920 |
- No, I want you, that was weird, I want you to come up with a message for my YouTube subscribers. 00:20:50.960 |
If you're enjoying the content, don't forget to like, subscribe and hit that notification 00:21:01.240 |
- So I think you get the idea now, but the final of the three updates on different modalities 00:21:11.440 |
Using Gemini 1.5 Pro, they can analyze a source that you feed it for free and generate through 00:21:20.320 |
Now, you guys already knew that because I've covered that before on this channel, but here's 00:21:25.400 |
After you choose a file to upload, then after 30 seconds or so, you will see a new button, 00:21:32.720 |
And I think this is really cool if you haven't already seen it, but you click Customize and 00:21:39.320 |
Pick one topic out of a video you feed in or a PDF. 00:21:44.980 |
Explain it in a way that's good for this audience or that audience. 00:21:48.800 |
Basically customization of a tool that was already pretty cool. 00:21:52.480 |
Maybe half of you might already know about that capability, but for the other half, do 00:21:57.860 |
Of course, the big news for me remains the release of the new 3.5 Sonnet. 00:22:02.640 |
And those were my takeaways from all the associated papers. 00:22:06.280 |
The timing was great because I've just recovered from what could be either COVID or a heavy 00:22:10.760 |
cold, but I am back in action and really looking forward to fully releasing SimpleBench. 00:22:16.720 |
Of course, would love to see you over on Insiders and quite topically from Anthropic. 00:22:21.080 |
Here's my 35-minute analysis of their CEO's deep dive on a compressed 21st century. 00:22:29.080 |
Do check it out, but regardless, thank you so much for watching and have a wonderful