The New Claude 3.5 Sonnet: Better, Yes, But Not Just in the Way You Might Think

The new Claude 3.5 Sonnet from Anthropic, their flagship language model, is a significant step forward. But wait, no, not particularly because it can use a mouse to perform a basic Google search. I'll dive into a couple of papers to show that not all is as it seems on that front.

And for now, I can browse Google for National Park photos myself. No, for me, the new 3.5 Sonnet is a step forward because, yet again, we have progress on that nebulous front we call reasoning. And I do have some new evidence to back that up. Plus, time waits for no man.

And so while Anthropic were cooking, Haygen and Runway were also in the kitchen, stirring up AIs you can Zoom chat with and create movie scenes alongside. Most of this stuff, of course, has only been out a day or at most two. But here are my first impressions. And the first TLDR I'll give is don't sleep on the new Claude 3.5 Sonnet.

Even if you're not interested in an LLM taking over your computer, it's reasoning, coding and visual processing abilities are a step forward. I'll also touch on one of its key weaknesses in just a few minutes. First, of course, the name and somewhat confusingly, it's still 3.5 Sonnet, just with brackets new.

Feels like they could have called it at least 3.6 Sonnet, but for now, that new is the giveaway you're using a new model. It has knowledge of world events up until April of 2024, which is not bad compared to 01 previews, October 2023. Naturally, though, most of the attention will be focused on the new Claude 3.5 Sonnet's ability to use your computer via an API.

I think it's safe to say that as of today, that won't see broad public adoption because of its unreliability and the long list of things that it can't do, like send emails, make purchases, technically, complete captures or edit or manipulate images. I feel like it's more anthropic dipping their toe into the water.

And it looks like it's found something great. So how far away is the location from my place? It's opening maps. Searching for the distance between my area and the hiking location. Cool. So now it looks like Claude is searching for the sunrise time tomorrow and is now dropping it into my calendar.

But from what we know so far, how does the new Claude 3.5 Sonnet actually do when given tasks? The first benchmark that Anthropic cites in the paper is the OS world benchmark. Naturally, I read that paper to get some more context on what the 3.5 model can actually do.

And I was impressed. It's over 350 tasks covering professional use, office use and daily use like shopping. One small detail from this paper that I think many analysts might miss is how they derived the human average. Not only were these tasks like changing the settings in a set of slides in a presentation new to both the humans and the models.

Look at which humans they used. These were computer science majors, college students who possess basic software usage skills but have not been exposed to these exact samples or the software before. So when you hear about human performance accuracy on this benchmark being around 72%, that's a fairly high bar of human performance.

Or going back to the Sonnet paper, they're not comparing Claude 3.5 Sonnet to a bunch of old biddies. These are comp science majors. Anyway, if given 50 steps, Claude 3.5 Sonnet, the new one, gets 22% versus these comp sci majors getting 72%. Now I can see someone at the back raising their hand saying, well, the benchmark states it has to be done in 15 steps and that's where Claude 3.5 Sonnet new gets 15%.

What if the humans were given 50 steps, which is where Claude gets 22%. But still, my only point is that if you were to compare Claude 3.5 Sonnet's performance to literally an average human, someone not particularly versed in computers, I think the delta would be much smaller. And in software engineering, Claude 3.5 Sonnet would absolutely crush that old biddy.

No, but to be slightly more serious, in the benchmark that OpenAI itself created, SWE Bench, Software Engineering Bench Verified, Claude 3.5 Sonnet, the new one, I'm going to have to keep saying new, gets 49% and that beats O1 Preview. I dug into the O1 paper to get this comparison where you can see even pre-mitigation O1 Preview gets 38.4% in this software engineering benchmark compared to post mitigation after safety training getting 28%.

Now, of course, it's hard to have an apples to apples comparison because it depends on prompting and scaffolding, but you can see as of today, the new 3.5 Sonnet is the best in that benchmark. As you might expect, I'm not just going to rely on published benchmark performance though, I ran my own benchmark, SimpleBench on the new 3.5 Sonnet and I'm going to give you the results in a moment.

In this chart though, you can see the comparison between the new Claude 3.5 Sonnet and the original Claude 3.5 Sonnet, as well as interestingly, GPT 4.0, but not O1 Preview. So what most people want to know is, is it better? Is it smarter? And the answer there is a resounding yes.

If you're asking it a challenging science question, it will do better than the previous Claude 3.5 Sonnet. In general knowledge, again, it knows more. In coding, as we've seen, it's simply better than the original 3.5 Sonnet. Likewise, in mathematics, it's a step up. And even in visual question answering, answering questions from tables, charts, and graphs, it's slightly better than the previous version.

And you're about to see further results confirming how much of a step forward it is. So many congratulations to Anthropic. And so for the vast majority of people, those are the headlines. But for the LLM nerds like me and some of you watching, we probably would have liked to see a direct comparison with O1 Preview.

And that does seem to be an industry trend where for these benchmark figures, companies pick the comparisons they want to make. Which rival models they include and which they don't. Here's just another quick example with the release of the mini-Strahl models, 3B and 8B from Mistral. That's the small French outfit that is holding the torch for mainland Europe in AI.

Anyway, in the small model category, here are the sets of results that they released. This was just a week ago, by the way, and all seems fine and good until the principal researcher at Google DeepMind pointed out that they missed out Gemma 2 9B from DeepMind. Adding that row in makes the results look slightly different.

I'm not trying to make this a major point, so I'm going to move on in just a moment. But just to give you an idea, let's take GPQA where the new Sonic gets 65%. O1 Preview gets 78.3 and the full O1, interestingly, slightly down 78.0. On the MMMU, which I think is a really great benchmark, again, testing whether you can read from graphs, tables, charts and the like.

We have Claude 3.5 Sonic getting 70.4 and the new O1, which admittedly isn't out yet, getting 78.2, you can see here. Finally, in a particularly famous high school math competition, we have the new Sonic getting 16%, O1 Preview getting 56.7% and O1 getting 83%. Let me put it this way.

I think the new Sonic is going to be massively underestimated. I think he's actually really good at reasoning. Maybe not these super calculation heavy, computation heavy things like O1 can do, just general basic reasoning and creative writing. As you might expect, I've already tested it hundreds of times and I can attest to that fact.

I just wanted to make the simple point that sometimes the model providers choose the comparisons they want to showcase. On that point, Anthropic say this, "Our evaluation tables exclude OpenAI's O1 model family as they depend on extensive pre-response computation time, unlike the typical models. This fundamental difference in operation makes performance comparisons difficult and outside the scope of this report." Well, one of the leaders on the work on O1, Jerry Tworek said this, "Being hard to compare against because of fundamental differences is the highest form of flattery." But here's a new one.

What about the intriguing TauBench at the bottom? That's new. Retail? Airline? Well, of course I dug into that paper and noticed that one of the authors is my friend Noah Shin. So that's cool. The short version is that it's a benchmark that tests if an AI agent can do shopping for you or book an airline ticket, but it has some really interesting nuggets.

They have a pass to the power of K, not pass at K, pass to the power of K. So what's the difference there? If you hear something like pass at eight, that means does a model get it right once in eight attempts? If you hear pass to the power of K, which I think is a brilliant device, that's did it pass all eight times.

For agent tasks, of course that's crucial. It's no good getting it right once out of eight, then you'll be flying to bloody Mongolia. It has to be right every single time. Oh, and by the way, the company behind this benchmark is now valued at over 4 billion just casually.

I believe they're trying to automate customer service and that reminds me with these valuations. It's a bit like they might only have a 3% chance or 4% chance of automating customer service, but if they do and they get the financial rewards for doing so, then they're going to be worth like a trillion.

So for me, when I hear a $4 billion valuation, I don't think they're literally worth 4 billion. I think, just my opinion of course, I think it's like 90% chance they're worth very little or a small amount, but then a 10% chance or 4% chance they're worth trillions. Anyway, the tasks are kind of as you'd expect, it will be helping users cancel or modify pending orders, returning delivered orders, modifying user addresses, that kind of thing for the retail benchmark.

And then for the Tao Airline part of the benchmark, which of course is much higher stakes, the agent has to help users book, modify or cancel flight reservations. And here are the headline results. And this is what I meant in the introduction when I said, not all is as it seems.

For these set of results, I really think it's worth paying attention because the next 6 months, 18 months could be dominated by AI agents and the related benchmarks. Do you get it right in one try? Yes, Claude 3.5 Sonnet, the new one, is significantly better than the previous version, but I admire Anthropic for putting out these results because they don't always shine the best light on the new Sonnet.

At least I'm talking for computer use, again, reasoning amazing. Because for the airline tasks, 46% given one try isn't amazing. And here is the most important chart. I think it sums up maybe the next 6 to 18 months. What about pass to the power of K, where K is, say, 8?

I just want to restate what that means for extra clarity. To pass to the power of 8, you have to get it right and then right and then right 8 times. One mistake once on your 7th try screws up the entire trial. And so what we get for Sonnet, and I suspect all language models, including O1, is a reverse scaling law.

As you scale up the number of attempts, performance consistently drops. This I suspect is the kind of scaling that currently AI CEOs don't really want you to focus on. This was for the slightly easier retail task. And if you just ask it once, look, we get 70% for the new Sonnet.

But then 8 times, what is that, about 40%. And it's still going down, of course. Imagine this was 100 tries or 100 customers. I'm just saying this reminds me that reliability is the single barrier left, I feel, to massive economic impact from AI. Speaking specifically about LLMs here, they can, quote, achieve harder and harder tasks, like getting 80% in the GPQA.

But that won't mean that much until the reliability on basic tasks gets better. Mind you, I don't think that's terribly far away, especially when we scale up test time compute. But that's a story for another day. Another quick one. Interestingly, the new Sonnet is actually slightly worse at refusals.

In other words, it correctly refuses wild chat toxic requests slightly less often. It incorrectly refuses innocent requests slightly more often than the previous model. Not super dramatically. I just thought that was interesting to note. Do you remember from the '01 release how it actually performed slightly worse in creative writing compared to GPT-40?

Well, that's not the case for the new 3.5 Sonnet. It crushes 58% of the time, at least, the original Claude 3.5 Sonnet. Interestingly, on multilingual challenges, it's slightly worse than the previous version of 3.5 Sonnet. But what about my own benchmark, SimpleBench? Well, I tried to get crazy ambitious with majority voting and architectures involving self-consistency.

But in the end, that just slightly delayed me. So I just want to get it out to you as soon as possible. Myself and the senior ML engineer that I'm working with on SimpleBench did agree that we should probably put out a technical memo, paper or report, something to indicate our thinking on that matter.

And that report has taken slightly longer than I expected, which is why I can't just as of today release this website. But I can release the provisional leaderboard. SimpleBench is a test that can be taken by non-specialized humans from any background as long as they can speak English. It asks questions about how things move in space and time and also tests social intelligence.

We ran the new Claude 3.5 Sonnet this morning and saw a significant step up from the previous version. Notice also we have the latest results from the new Gemini 1.5 Pro. And where is it? The Command R plus, that's the new one, Grok 2, the API became available fairly recently.

And for those of you who have been following the channel for a while, you might have noticed that the human baseline has dropped a bit. That's because we expanded the scope out to nine people this time. The human top performance is still around 96%, I think it's 95.7%. We expanded the scope and dropped the requirement to be a graduate.

This was nine human beings from truly eclectic backgrounds. The only commonality is that they spoke English natively. That's the average we got, 83.7%. The average of five means we ran for the models, the benchmark five times in full and averaged the results. You might remember what I was saying about self consistency.

I don't want to get into it here, but that's not the same as just averaging the results. It was very complicated and it actually hurt model performance. That's a story for another day, but I wanted models to shine as much as they could. And some of you may be wondering at this point, if I followed the recent debates in the last couple of weeks about models reasoning and whether prompting can overcome those limitations.

Yes, it can boost performance, including to slightly above 50%, but we're still nowhere near to human baseline. And that's with an optimized prompt. I will say to some of you who are new to the channel that through prompting and a particular scaffolding, myself and Josh, an ML engineer came up with smart GPT in May of last year, which got at the time a record 89.0% on the MMLU.

So yes, I know that prompting can help, but simple bench is a pretty resilient benchmark. These questions were of course vetted multiple times, exhaustively, including by doctoral level researchers. And they weren't like the somewhat flippant example I gave a couple of videos ago about a table being tilted. I kind of meant that one as a quick illustration, but obviously that backfired.

So apologies for the confusion. I should probably scroll down because we did also do GPT 4.0 mini and you can see the performance. So yes, in short, even in my own benchmark, I can confirm that the new Sonnet is a marked improvement. I'm not the only one who has been impressed, of course.

And earlier in the video, I showed you the reasons not to be overly hyped, but there are some reasons to be just a little bit excited or terrified, depending on your perspective. And I can't resist telling you that Simple Bench will be sponsored by Weights & Biases. And that isn't a random sponsorship.

I am incredibly selective in who I work with on sponsorships. And yes, they are sponsoring this video and we use Weave for Simple Bench. That's how we could do an incredibly quick turnaround on the new Sonnet results. My own custom link should be showing on screen and will of course be in the description.

But don't forget they also have free courses in their AI Academy. Now I can't resist taking a 10 second interlude to point out something you may have heard about already, which is that Claude at one point took a break during a coding demo to peruse photos of the Yellowstone National Park.

Whether that's misalignment or massively amusing will of course depend on your perspective. But it must be said that Claude isn't the only thing that was dropped yesterday. Here's Act 1 from Runway ML. While this particular development and the other two I'm about to show you might not completely blow off your socks, they do point to the unrelenting improvement in AI generated entertainment that's coming.

With the Runways Act 1, you can see the actress driving the performance, showing emotion and creativity in their performance. And then that being generated by the model into different character animations in all sorts of styles. The lip syncing isn't quite there yet, but nevertheless, here is a 15 second scene.

- You really think you can just walk away from all this? - I have to. There's no other choice. - There's always a choice. You just don't like the alternatives. - What would you have me do, stay, just watch it all crumble around me? - That's what you signed up for.

- You can let me know what you think in the comments, but while generations have this AI tinge about them, I wonder if they will struggle to achieve public acceptance. I suspect the impact of this kind of technology in video games will be much greater than in movies and TV, where people want everything to be perfect.

Then there's Heijen's new interactive avatar, and I think this one somewhat flew under the radar. I will admit that the lip syncing again is not the best, but did you know you can have a Zoom call with one of these avatars? Like literally you open the tab and it's a Zoom meeting, you log in and then you're chatting to this AI model.

Right at the beginning of this year, I predicted that this would happen this year, but I didn't think it would come from someone like Heijen. I thought it might come from OpenAI, but either way, true Zoom calls with AI models is now a reality. Hey Vicky, this is Philip from AI Explained, and I'm actually testing you for a video that's going to go out on YouTube.

And can you confirm this is a live Zoom call? - I see Philip, just to clarify, this is a simulated role play for conversation practice, not a live Zoom call. However, I'm here to help you navigate any scenarios you want to explore. What specific conversation would you like to role play?

- Well, it's kind of live, I mean, you're not alive, but it's live in the sense of I'm speaking to you now and you're reacting in real time, right? - Got it, Philip, you're right, I'm reacting in real time. So let's focus on your video content, what scenario do you want to role play that relates to your project?

- I'm just going to demonstrate how I can interrupt you, although you just stopped speaking immediately. But yeah, your lip syncing is very good, but it's slightly stilted, I can tell you're not real. It's not super easy, but I can tell. - I understand, Philip, I'm here to support your demonstration.

If you'd like, we can role play a scenario where you provide feedback on that. - Okay, okay, I'm going to interrupt you now, probably, but can you just give a message to my viewers? - Sure thing, Philip, just let me know what message you'd like me to convey to your viewers.

- No, I want you-- - I'm ready when you are. - No, I want you, that was weird, I want you to come up with a message for my YouTube subscribers. - Got it, here's the message. Hey everyone, thanks for tuning in. If you're enjoying the content, don't forget to like, subscribe and hit that notification bell.

- Yeah, guys. Do it. Great message, thank you, Vicky. - I didn't catch that. Can you still see me? - So I think you get the idea now, but the final of the three updates on different modalities comes from, of course, Notebook LM. Using Gemini 1.5 Pro, they can analyze a source that you feed it for free and generate through text-to-speech and engaging podcasts.

Now, you guys already knew that because I've covered that before on this channel, but here's something you might not know. After you choose a file to upload, then after 30 seconds or so, you will see a new button, Customize, not just Generate. And I think this is really cool if you haven't already seen it, but you click Customize and then you can focus the podcast conversation.

Pick one topic out of a video you feed in or a PDF. Focus on this and not on that. Explain it in a way that's good for this audience or that audience. Basically customization of a tool that was already pretty cool. Maybe half of you might already know about that capability, but for the other half, do check out Notebook LM.

Of course, the big news for me remains the release of the new 3.5 Sonnet. And those were my takeaways from all the associated papers. The timing was great because I've just recovered from what could be either COVID or a heavy cold, but I am back in action and really looking forward to fully releasing SimpleBench.

Of course, would love to see you over on Insiders and quite topically from Anthropic. Here's my 35-minute analysis of their CEO's deep dive on a compressed 21st century. Do check it out, but regardless, thank you so much for watching and have a wonderful

The New Claude 3.5 Sonnet: Better, Yes, But Not Just in the Way You Might Think

Chapters

Transcript