The New Claude 3.5 Sonnet: Better, Yes, But Not Just in the Way You Might Think

00:00:00.000 | The new Claude 3.5 Sonnet from Anthropic, their flagship language model, is a significant

00:00:06.840 | step forward.

00:00:07.840 | But wait, no, not particularly because it can use a mouse to perform a basic Google

00:00:12.820 | search.

00:00:13.820 | I'll dive into a couple of papers to show that not all is as it seems on that front.

00:00:19.280 | And for now, I can browse Google for National Park photos myself.

00:00:22.640 | No, for me, the new 3.5 Sonnet is a step forward because, yet again, we have progress on that

00:00:29.840 | nebulous front we call reasoning.

00:00:32.440 | And I do have some new evidence to back that up.

00:00:35.620 | Plus, time waits for no man.

00:00:37.560 | And so while Anthropic were cooking, Haygen and Runway were also in the kitchen, stirring

00:00:43.480 | up AIs you can Zoom chat with and create movie scenes alongside.

00:00:49.640 | Most of this stuff, of course, has only been out a day or at most two.

00:00:54.260 | But here are my first impressions.

00:00:56.480 | And the first TLDR I'll give is don't sleep on the new Claude 3.5 Sonnet.

00:01:01.440 | Even if you're not interested in an LLM taking over your computer, it's reasoning, coding

00:01:07.120 | and visual processing abilities are a step forward.

00:01:10.040 | I'll also touch on one of its key weaknesses in just a few minutes.

00:01:13.360 | First, of course, the name and somewhat confusingly, it's still 3.5 Sonnet, just with brackets

00:01:20.080 | new.

00:01:21.080 | Feels like they could have called it at least 3.6 Sonnet, but for now, that new is the giveaway

00:01:25.440 | you're using a new model.

00:01:27.480 | It has knowledge of world events up until April of 2024, which is not bad compared to

00:01:34.120 | 01 previews, October 2023.

00:01:37.120 | Naturally, though, most of the attention will be focused on the new Claude 3.5 Sonnet's

00:01:42.500 | ability to use your computer via an API.

00:01:45.680 | I think it's safe to say that as of today, that won't see broad public adoption because

00:01:52.240 | of its unreliability and the long list of things that it can't do, like send emails,

00:01:58.040 | make purchases, technically, complete captures or edit or manipulate images.

00:02:02.200 | I feel like it's more anthropic dipping their toe into the water.

00:02:05.760 | And it looks like it's found something great.

00:02:10.320 | So how far away is the location from my place?

00:02:12.800 | It's opening maps.

00:02:18.680 | Searching for the distance between my area and the hiking location.

00:02:28.440 | Cool.

00:02:35.200 | So now it looks like Claude is searching for the sunrise time tomorrow and is now dropping

00:02:40.560 | it into my calendar.

00:02:41.760 | But from what we know so far, how does the new Claude 3.5 Sonnet actually do when given

00:02:46.880 | tasks?

00:02:47.880 | The first benchmark that Anthropic cites in the paper is the OS world benchmark.

00:02:52.120 | Naturally, I read that paper to get some more context on what the 3.5 model can actually

00:02:58.320 | do.

00:02:59.320 | And I was impressed.

00:03:00.320 | It's over 350 tasks covering professional use, office use and daily use like shopping.

00:03:05.160 | One small detail from this paper that I think many analysts might miss is how they derived

00:03:09.680 | the human average.

00:03:11.080 | Not only were these tasks like changing the settings in a set of slides in a presentation

00:03:15.960 | new to both the humans and the models.

00:03:18.120 | Look at which humans they used.

00:03:19.880 | These were computer science majors, college students who possess basic software usage

00:03:24.120 | skills but have not been exposed to these exact samples or the software before.

00:03:28.600 | So when you hear about human performance accuracy on this benchmark being around 72%, that's

00:03:33.480 | a fairly high bar of human performance.

00:03:36.320 | Or going back to the Sonnet paper, they're not comparing Claude 3.5 Sonnet to a bunch

00:03:40.920 | of old biddies.

00:03:42.160 | These are comp science majors.

00:03:43.960 | Anyway, if given 50 steps, Claude 3.5 Sonnet, the new one, gets 22% versus these comp sci

00:03:51.440 | majors getting 72%.

00:03:53.920 | Now I can see someone at the back raising their hand saying, well, the benchmark states

00:03:58.120 | it has to be done in 15 steps and that's where Claude 3.5 Sonnet new gets 15%.

00:04:03.240 | What if the humans were given 50 steps, which is where Claude gets 22%.

00:04:07.160 | But still, my only point is that if you were to compare Claude 3.5 Sonnet's performance

00:04:12.520 | to literally an average human, someone not particularly versed in computers, I think

00:04:17.520 | the delta would be much smaller.

00:04:20.080 | And in software engineering, Claude 3.5 Sonnet would absolutely crush that old biddy.

00:04:24.920 | No, but to be slightly more serious, in the benchmark that OpenAI itself created, SWE

00:04:30.440 | Bench, Software Engineering Bench Verified, Claude 3.5 Sonnet, the new one, I'm going

00:04:34.480 | to have to keep saying new, gets 49% and that beats O1 Preview.

00:04:38.800 | I dug into the O1 paper to get this comparison where you can see even pre-mitigation O1 Preview

00:04:45.400 | gets 38.4% in this software engineering benchmark compared to post mitigation after safety training

00:04:52.840 | getting 28%.

00:04:53.840 | Now, of course, it's hard to have an apples to apples comparison because it depends on

00:04:57.980 | prompting and scaffolding, but you can see as of today, the new 3.5 Sonnet is the best

00:05:03.540 | in that benchmark.

00:05:04.840 | As you might expect, I'm not just going to rely on published benchmark performance though,

00:05:09.480 | I ran my own benchmark, SimpleBench on the new 3.5 Sonnet and I'm going to give you the

00:05:14.480 | results in a moment.

00:05:15.760 | In this chart though, you can see the comparison between the new Claude 3.5 Sonnet and the

00:05:21.400 | original Claude 3.5 Sonnet, as well as interestingly, GPT 4.0, but not O1 Preview.

00:05:26.960 | So what most people want to know is, is it better?

00:05:30.520 | Is it smarter?

00:05:31.660 | And the answer there is a resounding yes.

00:05:34.740 | If you're asking it a challenging science question, it will do better than the previous

00:05:39.760 | Claude 3.5 Sonnet.

00:05:41.240 | In general knowledge, again, it knows more.

00:05:43.760 | In coding, as we've seen, it's simply better than the original 3.5 Sonnet.

00:05:48.000 | Likewise, in mathematics, it's a step up.

00:05:51.100 | And even in visual question answering, answering questions from tables, charts, and graphs,

00:05:56.600 | it's slightly better than the previous version.

00:05:59.120 | And you're about to see further results confirming how much of a step forward it is.

00:06:03.400 | So many congratulations to Anthropic.

00:06:05.640 | And so for the vast majority of people, those are the headlines.

00:06:09.400 | But for the LLM nerds like me and some of you watching, we probably would have liked

00:06:13.800 | to see a direct comparison with O1 Preview.

00:06:17.040 | And that does seem to be an industry trend where for these benchmark figures, companies

00:06:21.400 | pick the comparisons they want to make.

00:06:23.880 | Which rival models they include and which they don't.

00:06:26.360 | Here's just another quick example with the release of the mini-Strahl models, 3B and

00:06:31.240 | 8B from Mistral.

00:06:33.000 | That's the small French outfit that is holding the torch for mainland Europe in AI.

00:06:37.720 | Anyway, in the small model category, here are the sets of results that they released.

00:06:42.940 | This was just a week ago, by the way, and all seems fine and good until the principal

00:06:47.860 | researcher at Google DeepMind pointed out that they missed out Gemma 2 9B from DeepMind.

00:06:54.680 | Adding that row in makes the results look slightly different.

00:06:58.040 | I'm not trying to make this a major point, so I'm going to move on in just a moment.

00:07:01.640 | But just to give you an idea, let's take GPQA where the new Sonic gets 65%.

00:07:07.680 | O1 Preview gets 78.3 and the full O1, interestingly, slightly down 78.0.

00:07:14.880 | On the MMMU, which I think is a really great benchmark, again, testing whether you can

00:07:19.880 | read from graphs, tables, charts and the like.

00:07:23.800 | We have Claude 3.5 Sonic getting 70.4 and the new O1, which admittedly isn't out yet,

00:07:31.080 | getting 78.2, you can see here.

00:07:34.160 | Finally, in a particularly famous high school math competition, we have the new Sonic getting

00:07:39.480 | 16%, O1 Preview getting 56.7% and O1 getting 83%.

00:07:45.200 | Let me put it this way.

00:07:46.200 | I think the new Sonic is going to be massively underestimated.

00:07:50.080 | I think he's actually really good at reasoning.

00:07:52.420 | Maybe not these super calculation heavy, computation heavy things like O1 can do, just general

00:07:57.920 | basic reasoning and creative writing.

00:08:00.440 | As you might expect, I've already tested it hundreds of times and I can attest to that

00:08:03.880 | fact.

00:08:04.880 | I just wanted to make the simple point that sometimes the model providers choose the comparisons

00:08:09.080 | they want to showcase.

00:08:11.080 | On that point, Anthropic say this, "Our evaluation tables exclude OpenAI's O1 model

00:08:16.240 | family as they depend on extensive pre-response computation time, unlike the typical models.

00:08:22.760 | This fundamental difference in operation makes performance comparisons difficult and outside

00:08:27.480 | the scope of this report."

00:08:29.040 | Well, one of the leaders on the work on O1, Jerry Tworek said this, "Being hard to compare

00:08:35.200 | against because of fundamental differences is the highest form of flattery."

00:08:39.320 | But here's a new one.

00:08:40.920 | What about the intriguing TauBench at the bottom?

00:08:44.120 | That's new.

00:08:45.120 | Retail?

00:08:46.120 | Airline?

00:08:47.120 | Well, of course I dug into that paper and noticed that one of the authors is my friend

00:08:51.080 | Noah Shin.

00:08:52.080 | So that's cool.

00:08:53.080 | The short version is that it's a benchmark that tests if an AI agent can do shopping

00:08:57.320 | for you or book an airline ticket, but it has some really interesting nuggets.

00:09:01.160 | They have a pass to the power of K, not pass at K, pass to the power of K.

00:09:07.960 | So what's the difference there?

00:09:09.300 | If you hear something like pass at eight, that means does a model get it right once

00:09:14.800 | in eight attempts?

00:09:16.240 | If you hear pass to the power of K, which I think is a brilliant device, that's did

00:09:21.280 | it pass all eight times.

00:09:23.640 | For agent tasks, of course that's crucial.

00:09:26.240 | It's no good getting it right once out of eight, then you'll be flying to bloody Mongolia.

00:09:30.760 | It has to be right every single time.

00:09:32.560 | Oh, and by the way, the company behind this benchmark is now valued at over 4 billion

00:09:37.000 | just casually.

00:09:38.000 | I believe they're trying to automate customer service and that reminds me with these valuations.

00:09:41.740 | It's a bit like they might only have a 3% chance or 4% chance of automating customer

00:09:47.080 | service, but if they do and they get the financial rewards for doing so, then they're going to

00:09:51.880 | be worth like a trillion.

00:09:53.280 | So for me, when I hear a $4 billion valuation, I don't think they're literally worth 4

00:09:57.440 | billion.

00:09:58.440 | I think, just my opinion of course, I think it's like 90% chance they're worth very

00:10:03.380 | little or a small amount, but then a 10% chance or 4% chance they're worth trillions.

00:10:08.720 | Anyway, the tasks are kind of as you'd expect, it will be helping users cancel or modify

00:10:14.360 | pending orders, returning delivered orders, modifying user addresses, that kind of thing

00:10:19.080 | for the retail benchmark.

00:10:20.780 | And then for the Tao Airline part of the benchmark, which of course is much higher stakes, the

00:10:25.520 | agent has to help users book, modify or cancel flight reservations.

00:10:29.920 | And here are the headline results.

00:10:32.440 | And this is what I meant in the introduction when I said, not all is as it seems.

00:10:37.200 | For these set of results, I really think it's worth paying attention because the next 6

00:10:42.000 | months, 18 months could be dominated by AI agents and the related benchmarks.

00:10:46.800 | Do you get it right in one try?

00:10:49.440 | Yes, Claude 3.5 Sonnet, the new one, is significantly better than the previous version, but I admire

00:10:55.760 | Anthropic for putting out these results because they don't always shine the best light on

00:11:00.120 | the new Sonnet.

00:11:01.120 | At least I'm talking for computer use, again, reasoning amazing.

00:11:04.680 | Because for the airline tasks, 46% given one try isn't amazing.

00:11:10.780 | And here is the most important chart.

00:11:13.840 | I think it sums up maybe the next 6 to 18 months.

00:11:17.480 | What about pass to the power of K, where K is, say, 8?

00:11:21.520 | I just want to restate what that means for extra clarity.

00:11:24.440 | To pass to the power of 8, you have to get it right and then right and then right 8 times.

00:11:30.000 | One mistake once on your 7th try screws up the entire trial.

00:11:34.640 | And so what we get for Sonnet, and I suspect all language models, including O1, is a reverse

00:11:40.480 | scaling law.

00:11:41.520 | As you scale up the number of attempts, performance consistently drops.

00:11:46.320 | This I suspect is the kind of scaling that currently AI CEOs don't really want you to

00:11:51.400 | focus on.

00:11:52.400 | This was for the slightly easier retail task.

00:11:55.640 | And if you just ask it once, look, we get 70% for the new Sonnet.

00:11:59.760 | But then 8 times, what is that, about 40%.

00:12:03.120 | And it's still going down, of course.

00:12:04.920 | Imagine this was 100 tries or 100 customers.

00:12:07.760 | I'm just saying this reminds me that reliability is the single barrier left, I feel, to massive

00:12:13.720 | economic impact from AI.

00:12:15.900 | Speaking specifically about LLMs here, they can, quote, achieve harder and harder tasks,

00:12:21.400 | like getting 80% in the GPQA.

00:12:23.720 | But that won't mean that much until the reliability on basic tasks gets better.

00:12:29.920 | Mind you, I don't think that's terribly far away, especially when we scale up test time

00:12:33.760 | compute.

00:12:34.760 | But that's a story for another day.

00:12:36.520 | Another quick one.

00:12:37.520 | Interestingly, the new Sonnet is actually slightly worse at refusals.

00:12:40.960 | In other words, it correctly refuses wild chat toxic requests slightly less often.

00:12:47.020 | It incorrectly refuses innocent requests slightly more often than the previous model.

00:12:52.460 | Not super dramatically.

00:12:53.460 | I just thought that was interesting to note.

00:12:55.260 | Do you remember from the '01 release how it actually performed slightly worse in creative

00:13:00.540 | writing compared to GPT-40?

00:13:02.740 | Well, that's not the case for the new 3.5 Sonnet.

00:13:07.180 | It crushes 58% of the time, at least, the original Claude 3.5 Sonnet.

00:13:12.420 | Interestingly, on multilingual challenges, it's slightly worse than the previous version

00:13:17.180 | of 3.5 Sonnet.

00:13:18.620 | But what about my own benchmark, SimpleBench?

00:13:20.740 | Well, I tried to get crazy ambitious with majority voting and architectures involving

00:13:26.900 | self-consistency.

00:13:27.900 | But in the end, that just slightly delayed me.

00:13:31.220 | So I just want to get it out to you as soon as possible.

00:13:34.020 | Myself and the senior ML engineer that I'm working with on SimpleBench did agree that

00:13:38.540 | we should probably put out a technical memo, paper or report, something to indicate our

00:13:43.900 | thinking on that matter.

00:13:45.300 | And that report has taken slightly longer than I expected, which is why I can't just

00:13:50.100 | as of today release this website.

00:13:52.220 | But I can release the provisional leaderboard.

00:13:55.700 | SimpleBench is a test that can be taken by non-specialized humans from any background

00:14:00.500 | as long as they can speak English.

00:14:01.820 | It asks questions about how things move in space and time and also tests social intelligence.

00:14:08.060 | We ran the new Claude 3.5 Sonnet this morning and saw a significant step up from the previous

00:14:14.180 | version.

00:14:15.180 | Notice also we have the latest results from the new Gemini 1.5 Pro.

00:14:19.100 | And where is it?

00:14:20.100 | The Command R plus, that's the new one, Grok 2, the API became available fairly recently.

00:14:24.900 | And for those of you who have been following the channel for a while, you might have noticed

00:14:28.620 | that the human baseline has dropped a bit.

00:14:30.780 | That's because we expanded the scope out to nine people this time.

00:14:34.820 | The human top performance is still around 96%, I think it's 95.7%.

00:14:40.300 | We expanded the scope and dropped the requirement to be a graduate.

00:14:44.300 | This was nine human beings from truly eclectic backgrounds.

00:14:47.660 | The only commonality is that they spoke English natively.

00:14:50.740 | That's the average we got, 83.7%.

00:14:53.580 | The average of five means we ran for the models, the benchmark five times in full and averaged

00:15:00.340 | the results.

00:15:01.340 | You might remember what I was saying about self consistency.

00:15:03.700 | I don't want to get into it here, but that's not the same as just averaging the results.

00:15:08.540 | It was very complicated and it actually hurt model performance.

00:15:11.840 | That's a story for another day, but I wanted models to shine as much as they could.

00:15:15.880 | And some of you may be wondering at this point, if I followed the recent debates in the last

00:15:20.460 | couple of weeks about models reasoning and whether prompting can overcome those limitations.

00:15:25.460 | Yes, it can boost performance, including to slightly above 50%, but we're still nowhere

00:15:31.040 | near to human baseline.

00:15:32.700 | And that's with an optimized prompt.

00:15:34.280 | I will say to some of you who are new to the channel that through prompting and a particular

00:15:38.460 | scaffolding, myself and Josh, an ML engineer came up with smart GPT in May of last year,

00:15:44.140 | which got at the time a record 89.0% on the MMLU.

00:15:48.880 | So yes, I know that prompting can help, but simple bench is a pretty resilient benchmark.

00:15:54.180 | These questions were of course vetted multiple times, exhaustively, including by doctoral

00:15:59.140 | level researchers.

00:16:00.500 | And they weren't like the somewhat flippant example I gave a couple of videos ago about

00:16:05.340 | a table being tilted.

00:16:06.620 | I kind of meant that one as a quick illustration, but obviously that backfired.

00:16:10.580 | So apologies for the confusion.

00:16:12.260 | I should probably scroll down because we did also do GPT 4.0 mini and you can see the performance.

00:16:17.500 | So yes, in short, even in my own benchmark, I can confirm that the new Sonnet is a marked

00:16:22.660 | improvement.

00:16:23.660 | I'm not the only one who has been impressed, of course.

00:16:26.220 | And earlier in the video, I showed you the reasons not to be overly hyped, but there

00:16:29.900 | are some reasons to be just a little bit excited or terrified, depending on your perspective.

00:16:35.020 | And I can't resist telling you that Simple Bench will be sponsored by Weights & Biases.

00:16:40.740 | And that isn't a random sponsorship.

00:16:42.420 | I am incredibly selective in who I work with on sponsorships.

00:16:46.500 | And yes, they are sponsoring this video and we use Weave for Simple Bench.

00:16:51.280 | That's how we could do an incredibly quick turnaround on the new Sonnet results.

00:16:55.900 | My own custom link should be showing on screen and will of course be in the description.

00:16:59.780 | But don't forget they also have free courses in their AI Academy.

00:17:03.220 | Now I can't resist taking a 10 second interlude to point out something you may have heard

00:17:07.760 | about already, which is that Claude at one point took a break during a coding demo to

00:17:13.540 | peruse photos of the Yellowstone National Park.

00:17:17.180 | Whether that's misalignment or massively amusing will of course depend on your perspective.

00:17:22.480 | But it must be said that Claude isn't the only thing that was dropped yesterday.

00:17:27.300 | Here's Act 1 from Runway ML.

00:17:30.320 | While this particular development and the other two I'm about to show you might not

00:17:33.640 | completely blow off your socks, they do point to the unrelenting improvement in AI generated

00:17:40.420 | entertainment that's coming.

00:17:41.920 | With the Runways Act 1, you can see the actress driving the performance, showing emotion and

00:17:48.420 | creativity in their performance.

00:17:50.300 | And then that being generated by the model into different character animations in all

00:17:54.820 | sorts of styles.

00:17:56.180 | The lip syncing isn't quite there yet, but nevertheless, here is a 15 second scene.

00:18:00.820 | - You really think you can just walk away from all this?

00:18:07.100 | - I have to.

00:18:09.220 | There's no other choice.

00:18:11.260 | - There's always a choice.

00:18:13.080 | You just don't like the alternatives.

00:18:15.080 | - What would you have me do, stay, just watch it all crumble around me?

00:18:20.100 | - That's what you signed up for.

00:18:21.780 | - You can let me know what you think in the comments, but while generations have this

00:18:26.120 | AI tinge about them, I wonder if they will struggle to achieve public acceptance.

00:18:32.040 | I suspect the impact of this kind of technology in video games will be much greater than in

00:18:37.200 | movies and TV, where people want everything to be perfect.

00:18:41.280 | Then there's Heijen's new interactive avatar, and I think this one somewhat flew under the

00:18:45.960 | radar.

00:18:46.960 | I will admit that the lip syncing again is not the best, but did you know you can have

00:18:51.760 | a Zoom call with one of these avatars?

00:18:54.600 | Like literally you open the tab and it's a Zoom meeting, you log in and then you're chatting

00:18:59.280 | to this AI model.

00:19:00.720 | Right at the beginning of this year, I predicted that this would happen this year, but I didn't

00:19:05.200 | think it would come from someone like Heijen.

00:19:07.520 | I thought it might come from OpenAI, but either way, true Zoom calls with AI models is now

00:19:12.820 | a reality.

00:19:13.820 | Hey Vicky, this is Philip from AI Explained, and I'm actually testing you for a video that's

00:19:19.880 | going to go out on YouTube.

00:19:22.360 | And can you confirm this is a live Zoom call?

00:19:29.120 | - I see Philip, just to clarify, this is a simulated role play for conversation practice,

00:19:33.560 | not a live Zoom call.

00:19:35.080 | However, I'm here to help you navigate any scenarios you want to explore.

00:19:39.440 | What specific conversation would you like to role play?

00:19:42.320 | - Well, it's kind of live, I mean, you're not alive, but it's live in the sense of I'm

00:19:47.040 | speaking to you now and you're reacting in real time, right?

00:19:53.440 | - Got it, Philip, you're right, I'm reacting in real time.

00:19:56.220 | So let's focus on your video content, what scenario do you want to role play that relates

00:20:00.320 | to your project?

00:20:01.320 | - I'm just going to demonstrate how I can interrupt you, although you just stopped speaking

00:20:04.640 | immediately.

00:20:05.640 | But yeah, your lip syncing is very good, but it's slightly stilted, I can tell you're not

00:20:10.480 | real.

00:20:11.480 | It's not super easy, but I can tell.

00:20:17.240 | - I understand, Philip, I'm here to support your demonstration.

00:20:20.160 | If you'd like, we can role play a scenario where you provide feedback on that.

00:20:22.800 | - Okay, okay, I'm going to interrupt you now, probably, but can you just give a message

00:20:27.280 | to my viewers?

00:20:32.160 | - Sure thing, Philip, just let me know what message you'd like me to convey to your viewers.

00:20:35.920 | - No, I want you--

00:20:36.920 | - I'm ready when you are.

00:20:37.920 | - No, I want you, that was weird, I want you to come up with a message for my YouTube subscribers.

00:20:48.000 | - Got it, here's the message.

00:20:49.080 | Hey everyone, thanks for tuning in.

00:20:50.960 | If you're enjoying the content, don't forget to like, subscribe and hit that notification

00:20:55.240 | bell.

00:20:56.240 | - Yeah, guys.

00:20:57.240 | Do it.

00:20:58.240 | Great message, thank you, Vicky.

00:20:59.240 | - I didn't catch that.

00:21:00.240 | Can you still see me?

00:21:01.240 | - So I think you get the idea now, but the final of the three updates on different modalities

00:21:09.240 | comes from, of course, Notebook LM.

00:21:11.440 | Using Gemini 1.5 Pro, they can analyze a source that you feed it for free and generate through

00:21:17.800 | text-to-speech and engaging podcasts.

00:21:20.320 | Now, you guys already knew that because I've covered that before on this channel, but here's

00:21:23.720 | something you might not know.

00:21:25.400 | After you choose a file to upload, then after 30 seconds or so, you will see a new button,

00:21:31.480 | Customize, not just Generate.

00:21:32.720 | And I think this is really cool if you haven't already seen it, but you click Customize and

00:21:36.480 | then you can focus the podcast conversation.

00:21:39.320 | Pick one topic out of a video you feed in or a PDF.

00:21:43.040 | Focus on this and not on that.

00:21:44.980 | Explain it in a way that's good for this audience or that audience.

00:21:48.800 | Basically customization of a tool that was already pretty cool.

00:21:52.480 | Maybe half of you might already know about that capability, but for the other half, do

00:21:56.280 | check out Notebook LM.

00:21:57.860 | Of course, the big news for me remains the release of the new 3.5 Sonnet.

00:22:02.640 | And those were my takeaways from all the associated papers.

00:22:06.280 | The timing was great because I've just recovered from what could be either COVID or a heavy

00:22:10.760 | cold, but I am back in action and really looking forward to fully releasing SimpleBench.

00:22:16.720 | Of course, would love to see you over on Insiders and quite topically from Anthropic.

00:22:21.080 | Here's my 35-minute analysis of their CEO's deep dive on a compressed 21st century.

00:22:29.080 | Do check it out, but regardless, thank you so much for watching and have a wonderful

The New Claude 3.5 Sonnet: Better, Yes, But Not Just in the Way You Might Think

Chapters