back to index

The New Claude 3.5 Sonnet: Better, Yes, But Not Just in the Way You Might Think


Chapters

0:0 Introduction
0:57 Claude 3.5 Sonnet (New) Paper
2:6 Demo
2:58 OSWorld
4:29 Benchmarks compared + OpenAI Response
8:30 Tau-Bench
13:9 SimpleBench Results
17:5 Yellowstone Detour
17:29 Runway Act-One
18:44 HeyGen Interactive Avatars + Demo
21:6 NotebookLM Update

Whisper Transcript | Transcript Only Page

00:00:00.000 | The new Claude 3.5 Sonnet from Anthropic, their flagship language model, is a significant
00:00:06.840 | step forward.
00:00:07.840 | But wait, no, not particularly because it can use a mouse to perform a basic Google
00:00:12.820 | search.
00:00:13.820 | I'll dive into a couple of papers to show that not all is as it seems on that front.
00:00:19.280 | And for now, I can browse Google for National Park photos myself.
00:00:22.640 | No, for me, the new 3.5 Sonnet is a step forward because, yet again, we have progress on that
00:00:29.840 | nebulous front we call reasoning.
00:00:32.440 | And I do have some new evidence to back that up.
00:00:35.620 | Plus, time waits for no man.
00:00:37.560 | And so while Anthropic were cooking, Haygen and Runway were also in the kitchen, stirring
00:00:43.480 | up AIs you can Zoom chat with and create movie scenes alongside.
00:00:49.640 | Most of this stuff, of course, has only been out a day or at most two.
00:00:54.260 | But here are my first impressions.
00:00:56.480 | And the first TLDR I'll give is don't sleep on the new Claude 3.5 Sonnet.
00:01:01.440 | Even if you're not interested in an LLM taking over your computer, it's reasoning, coding
00:01:07.120 | and visual processing abilities are a step forward.
00:01:10.040 | I'll also touch on one of its key weaknesses in just a few minutes.
00:01:13.360 | First, of course, the name and somewhat confusingly, it's still 3.5 Sonnet, just with brackets
00:01:21.080 | Feels like they could have called it at least 3.6 Sonnet, but for now, that new is the giveaway
00:01:25.440 | you're using a new model.
00:01:27.480 | It has knowledge of world events up until April of 2024, which is not bad compared to
00:01:34.120 | 01 previews, October 2023.
00:01:37.120 | Naturally, though, most of the attention will be focused on the new Claude 3.5 Sonnet's
00:01:42.500 | ability to use your computer via an API.
00:01:45.680 | I think it's safe to say that as of today, that won't see broad public adoption because
00:01:52.240 | of its unreliability and the long list of things that it can't do, like send emails,
00:01:58.040 | make purchases, technically, complete captures or edit or manipulate images.
00:02:02.200 | I feel like it's more anthropic dipping their toe into the water.
00:02:05.760 | And it looks like it's found something great.
00:02:10.320 | So how far away is the location from my place?
00:02:12.800 | It's opening maps.
00:02:18.680 | Searching for the distance between my area and the hiking location.
00:02:28.440 | Cool.
00:02:35.200 | So now it looks like Claude is searching for the sunrise time tomorrow and is now dropping
00:02:40.560 | it into my calendar.
00:02:41.760 | But from what we know so far, how does the new Claude 3.5 Sonnet actually do when given
00:02:46.880 | tasks?
00:02:47.880 | The first benchmark that Anthropic cites in the paper is the OS world benchmark.
00:02:52.120 | Naturally, I read that paper to get some more context on what the 3.5 model can actually
00:02:59.320 | And I was impressed.
00:03:00.320 | It's over 350 tasks covering professional use, office use and daily use like shopping.
00:03:05.160 | One small detail from this paper that I think many analysts might miss is how they derived
00:03:09.680 | the human average.
00:03:11.080 | Not only were these tasks like changing the settings in a set of slides in a presentation
00:03:15.960 | new to both the humans and the models.
00:03:18.120 | Look at which humans they used.
00:03:19.880 | These were computer science majors, college students who possess basic software usage
00:03:24.120 | skills but have not been exposed to these exact samples or the software before.
00:03:28.600 | So when you hear about human performance accuracy on this benchmark being around 72%, that's
00:03:33.480 | a fairly high bar of human performance.
00:03:36.320 | Or going back to the Sonnet paper, they're not comparing Claude 3.5 Sonnet to a bunch
00:03:40.920 | of old biddies.
00:03:42.160 | These are comp science majors.
00:03:43.960 | Anyway, if given 50 steps, Claude 3.5 Sonnet, the new one, gets 22% versus these comp sci
00:03:51.440 | majors getting 72%.
00:03:53.920 | Now I can see someone at the back raising their hand saying, well, the benchmark states
00:03:58.120 | it has to be done in 15 steps and that's where Claude 3.5 Sonnet new gets 15%.
00:04:03.240 | What if the humans were given 50 steps, which is where Claude gets 22%.
00:04:07.160 | But still, my only point is that if you were to compare Claude 3.5 Sonnet's performance
00:04:12.520 | to literally an average human, someone not particularly versed in computers, I think
00:04:17.520 | the delta would be much smaller.
00:04:20.080 | And in software engineering, Claude 3.5 Sonnet would absolutely crush that old biddy.
00:04:24.920 | No, but to be slightly more serious, in the benchmark that OpenAI itself created, SWE
00:04:30.440 | Bench, Software Engineering Bench Verified, Claude 3.5 Sonnet, the new one, I'm going
00:04:34.480 | to have to keep saying new, gets 49% and that beats O1 Preview.
00:04:38.800 | I dug into the O1 paper to get this comparison where you can see even pre-mitigation O1 Preview
00:04:45.400 | gets 38.4% in this software engineering benchmark compared to post mitigation after safety training
00:04:52.840 | getting 28%.
00:04:53.840 | Now, of course, it's hard to have an apples to apples comparison because it depends on
00:04:57.980 | prompting and scaffolding, but you can see as of today, the new 3.5 Sonnet is the best
00:05:03.540 | in that benchmark.
00:05:04.840 | As you might expect, I'm not just going to rely on published benchmark performance though,
00:05:09.480 | I ran my own benchmark, SimpleBench on the new 3.5 Sonnet and I'm going to give you the
00:05:14.480 | results in a moment.
00:05:15.760 | In this chart though, you can see the comparison between the new Claude 3.5 Sonnet and the
00:05:21.400 | original Claude 3.5 Sonnet, as well as interestingly, GPT 4.0, but not O1 Preview.
00:05:26.960 | So what most people want to know is, is it better?
00:05:30.520 | Is it smarter?
00:05:31.660 | And the answer there is a resounding yes.
00:05:34.740 | If you're asking it a challenging science question, it will do better than the previous
00:05:39.760 | Claude 3.5 Sonnet.
00:05:41.240 | In general knowledge, again, it knows more.
00:05:43.760 | In coding, as we've seen, it's simply better than the original 3.5 Sonnet.
00:05:48.000 | Likewise, in mathematics, it's a step up.
00:05:51.100 | And even in visual question answering, answering questions from tables, charts, and graphs,
00:05:56.600 | it's slightly better than the previous version.
00:05:59.120 | And you're about to see further results confirming how much of a step forward it is.
00:06:03.400 | So many congratulations to Anthropic.
00:06:05.640 | And so for the vast majority of people, those are the headlines.
00:06:09.400 | But for the LLM nerds like me and some of you watching, we probably would have liked
00:06:13.800 | to see a direct comparison with O1 Preview.
00:06:17.040 | And that does seem to be an industry trend where for these benchmark figures, companies
00:06:21.400 | pick the comparisons they want to make.
00:06:23.880 | Which rival models they include and which they don't.
00:06:26.360 | Here's just another quick example with the release of the mini-Strahl models, 3B and
00:06:31.240 | 8B from Mistral.
00:06:33.000 | That's the small French outfit that is holding the torch for mainland Europe in AI.
00:06:37.720 | Anyway, in the small model category, here are the sets of results that they released.
00:06:42.940 | This was just a week ago, by the way, and all seems fine and good until the principal
00:06:47.860 | researcher at Google DeepMind pointed out that they missed out Gemma 2 9B from DeepMind.
00:06:54.680 | Adding that row in makes the results look slightly different.
00:06:58.040 | I'm not trying to make this a major point, so I'm going to move on in just a moment.
00:07:01.640 | But just to give you an idea, let's take GPQA where the new Sonic gets 65%.
00:07:07.680 | O1 Preview gets 78.3 and the full O1, interestingly, slightly down 78.0.
00:07:14.880 | On the MMMU, which I think is a really great benchmark, again, testing whether you can
00:07:19.880 | read from graphs, tables, charts and the like.
00:07:23.800 | We have Claude 3.5 Sonic getting 70.4 and the new O1, which admittedly isn't out yet,
00:07:31.080 | getting 78.2, you can see here.
00:07:34.160 | Finally, in a particularly famous high school math competition, we have the new Sonic getting
00:07:39.480 | 16%, O1 Preview getting 56.7% and O1 getting 83%.
00:07:45.200 | Let me put it this way.
00:07:46.200 | I think the new Sonic is going to be massively underestimated.
00:07:50.080 | I think he's actually really good at reasoning.
00:07:52.420 | Maybe not these super calculation heavy, computation heavy things like O1 can do, just general
00:07:57.920 | basic reasoning and creative writing.
00:08:00.440 | As you might expect, I've already tested it hundreds of times and I can attest to that
00:08:03.880 | fact.
00:08:04.880 | I just wanted to make the simple point that sometimes the model providers choose the comparisons
00:08:09.080 | they want to showcase.
00:08:11.080 | On that point, Anthropic say this, "Our evaluation tables exclude OpenAI's O1 model
00:08:16.240 | family as they depend on extensive pre-response computation time, unlike the typical models.
00:08:22.760 | This fundamental difference in operation makes performance comparisons difficult and outside
00:08:27.480 | the scope of this report."
00:08:29.040 | Well, one of the leaders on the work on O1, Jerry Tworek said this, "Being hard to compare
00:08:35.200 | against because of fundamental differences is the highest form of flattery."
00:08:39.320 | But here's a new one.
00:08:40.920 | What about the intriguing TauBench at the bottom?
00:08:44.120 | That's new.
00:08:45.120 | Retail?
00:08:46.120 | Airline?
00:08:47.120 | Well, of course I dug into that paper and noticed that one of the authors is my friend
00:08:51.080 | Noah Shin.
00:08:52.080 | So that's cool.
00:08:53.080 | The short version is that it's a benchmark that tests if an AI agent can do shopping
00:08:57.320 | for you or book an airline ticket, but it has some really interesting nuggets.
00:09:01.160 | They have a pass to the power of K, not pass at K, pass to the power of K.
00:09:07.960 | So what's the difference there?
00:09:09.300 | If you hear something like pass at eight, that means does a model get it right once
00:09:14.800 | in eight attempts?
00:09:16.240 | If you hear pass to the power of K, which I think is a brilliant device, that's did
00:09:21.280 | it pass all eight times.
00:09:23.640 | For agent tasks, of course that's crucial.
00:09:26.240 | It's no good getting it right once out of eight, then you'll be flying to bloody Mongolia.
00:09:30.760 | It has to be right every single time.
00:09:32.560 | Oh, and by the way, the company behind this benchmark is now valued at over 4 billion
00:09:37.000 | just casually.
00:09:38.000 | I believe they're trying to automate customer service and that reminds me with these valuations.
00:09:41.740 | It's a bit like they might only have a 3% chance or 4% chance of automating customer
00:09:47.080 | service, but if they do and they get the financial rewards for doing so, then they're going to
00:09:51.880 | be worth like a trillion.
00:09:53.280 | So for me, when I hear a $4 billion valuation, I don't think they're literally worth 4
00:09:57.440 | billion.
00:09:58.440 | I think, just my opinion of course, I think it's like 90% chance they're worth very
00:10:03.380 | little or a small amount, but then a 10% chance or 4% chance they're worth trillions.
00:10:08.720 | Anyway, the tasks are kind of as you'd expect, it will be helping users cancel or modify
00:10:14.360 | pending orders, returning delivered orders, modifying user addresses, that kind of thing
00:10:19.080 | for the retail benchmark.
00:10:20.780 | And then for the Tao Airline part of the benchmark, which of course is much higher stakes, the
00:10:25.520 | agent has to help users book, modify or cancel flight reservations.
00:10:29.920 | And here are the headline results.
00:10:32.440 | And this is what I meant in the introduction when I said, not all is as it seems.
00:10:37.200 | For these set of results, I really think it's worth paying attention because the next 6
00:10:42.000 | months, 18 months could be dominated by AI agents and the related benchmarks.
00:10:46.800 | Do you get it right in one try?
00:10:49.440 | Yes, Claude 3.5 Sonnet, the new one, is significantly better than the previous version, but I admire
00:10:55.760 | Anthropic for putting out these results because they don't always shine the best light on
00:11:00.120 | the new Sonnet.
00:11:01.120 | At least I'm talking for computer use, again, reasoning amazing.
00:11:04.680 | Because for the airline tasks, 46% given one try isn't amazing.
00:11:10.780 | And here is the most important chart.
00:11:13.840 | I think it sums up maybe the next 6 to 18 months.
00:11:17.480 | What about pass to the power of K, where K is, say, 8?
00:11:21.520 | I just want to restate what that means for extra clarity.
00:11:24.440 | To pass to the power of 8, you have to get it right and then right and then right 8 times.
00:11:30.000 | One mistake once on your 7th try screws up the entire trial.
00:11:34.640 | And so what we get for Sonnet, and I suspect all language models, including O1, is a reverse
00:11:40.480 | scaling law.
00:11:41.520 | As you scale up the number of attempts, performance consistently drops.
00:11:46.320 | This I suspect is the kind of scaling that currently AI CEOs don't really want you to
00:11:51.400 | focus on.
00:11:52.400 | This was for the slightly easier retail task.
00:11:55.640 | And if you just ask it once, look, we get 70% for the new Sonnet.
00:11:59.760 | But then 8 times, what is that, about 40%.
00:12:03.120 | And it's still going down, of course.
00:12:04.920 | Imagine this was 100 tries or 100 customers.
00:12:07.760 | I'm just saying this reminds me that reliability is the single barrier left, I feel, to massive
00:12:13.720 | economic impact from AI.
00:12:15.900 | Speaking specifically about LLMs here, they can, quote, achieve harder and harder tasks,
00:12:21.400 | like getting 80% in the GPQA.
00:12:23.720 | But that won't mean that much until the reliability on basic tasks gets better.
00:12:29.920 | Mind you, I don't think that's terribly far away, especially when we scale up test time
00:12:33.760 | compute.
00:12:34.760 | But that's a story for another day.
00:12:36.520 | Another quick one.
00:12:37.520 | Interestingly, the new Sonnet is actually slightly worse at refusals.
00:12:40.960 | In other words, it correctly refuses wild chat toxic requests slightly less often.
00:12:47.020 | It incorrectly refuses innocent requests slightly more often than the previous model.
00:12:52.460 | Not super dramatically.
00:12:53.460 | I just thought that was interesting to note.
00:12:55.260 | Do you remember from the '01 release how it actually performed slightly worse in creative
00:13:00.540 | writing compared to GPT-40?
00:13:02.740 | Well, that's not the case for the new 3.5 Sonnet.
00:13:07.180 | It crushes 58% of the time, at least, the original Claude 3.5 Sonnet.
00:13:12.420 | Interestingly, on multilingual challenges, it's slightly worse than the previous version
00:13:17.180 | of 3.5 Sonnet.
00:13:18.620 | But what about my own benchmark, SimpleBench?
00:13:20.740 | Well, I tried to get crazy ambitious with majority voting and architectures involving
00:13:26.900 | self-consistency.
00:13:27.900 | But in the end, that just slightly delayed me.
00:13:31.220 | So I just want to get it out to you as soon as possible.
00:13:34.020 | Myself and the senior ML engineer that I'm working with on SimpleBench did agree that
00:13:38.540 | we should probably put out a technical memo, paper or report, something to indicate our
00:13:43.900 | thinking on that matter.
00:13:45.300 | And that report has taken slightly longer than I expected, which is why I can't just
00:13:50.100 | as of today release this website.
00:13:52.220 | But I can release the provisional leaderboard.
00:13:55.700 | SimpleBench is a test that can be taken by non-specialized humans from any background
00:14:00.500 | as long as they can speak English.
00:14:01.820 | It asks questions about how things move in space and time and also tests social intelligence.
00:14:08.060 | We ran the new Claude 3.5 Sonnet this morning and saw a significant step up from the previous
00:14:14.180 | version.
00:14:15.180 | Notice also we have the latest results from the new Gemini 1.5 Pro.
00:14:19.100 | And where is it?
00:14:20.100 | The Command R plus, that's the new one, Grok 2, the API became available fairly recently.
00:14:24.900 | And for those of you who have been following the channel for a while, you might have noticed
00:14:28.620 | that the human baseline has dropped a bit.
00:14:30.780 | That's because we expanded the scope out to nine people this time.
00:14:34.820 | The human top performance is still around 96%, I think it's 95.7%.
00:14:40.300 | We expanded the scope and dropped the requirement to be a graduate.
00:14:44.300 | This was nine human beings from truly eclectic backgrounds.
00:14:47.660 | The only commonality is that they spoke English natively.
00:14:50.740 | That's the average we got, 83.7%.
00:14:53.580 | The average of five means we ran for the models, the benchmark five times in full and averaged
00:15:00.340 | the results.
00:15:01.340 | You might remember what I was saying about self consistency.
00:15:03.700 | I don't want to get into it here, but that's not the same as just averaging the results.
00:15:08.540 | It was very complicated and it actually hurt model performance.
00:15:11.840 | That's a story for another day, but I wanted models to shine as much as they could.
00:15:15.880 | And some of you may be wondering at this point, if I followed the recent debates in the last
00:15:20.460 | couple of weeks about models reasoning and whether prompting can overcome those limitations.
00:15:25.460 | Yes, it can boost performance, including to slightly above 50%, but we're still nowhere
00:15:31.040 | near to human baseline.
00:15:32.700 | And that's with an optimized prompt.
00:15:34.280 | I will say to some of you who are new to the channel that through prompting and a particular
00:15:38.460 | scaffolding, myself and Josh, an ML engineer came up with smart GPT in May of last year,
00:15:44.140 | which got at the time a record 89.0% on the MMLU.
00:15:48.880 | So yes, I know that prompting can help, but simple bench is a pretty resilient benchmark.
00:15:54.180 | These questions were of course vetted multiple times, exhaustively, including by doctoral
00:15:59.140 | level researchers.
00:16:00.500 | And they weren't like the somewhat flippant example I gave a couple of videos ago about
00:16:05.340 | a table being tilted.
00:16:06.620 | I kind of meant that one as a quick illustration, but obviously that backfired.
00:16:10.580 | So apologies for the confusion.
00:16:12.260 | I should probably scroll down because we did also do GPT 4.0 mini and you can see the performance.
00:16:17.500 | So yes, in short, even in my own benchmark, I can confirm that the new Sonnet is a marked
00:16:22.660 | improvement.
00:16:23.660 | I'm not the only one who has been impressed, of course.
00:16:26.220 | And earlier in the video, I showed you the reasons not to be overly hyped, but there
00:16:29.900 | are some reasons to be just a little bit excited or terrified, depending on your perspective.
00:16:35.020 | And I can't resist telling you that Simple Bench will be sponsored by Weights & Biases.
00:16:40.740 | And that isn't a random sponsorship.
00:16:42.420 | I am incredibly selective in who I work with on sponsorships.
00:16:46.500 | And yes, they are sponsoring this video and we use Weave for Simple Bench.
00:16:51.280 | That's how we could do an incredibly quick turnaround on the new Sonnet results.
00:16:55.900 | My own custom link should be showing on screen and will of course be in the description.
00:16:59.780 | But don't forget they also have free courses in their AI Academy.
00:17:03.220 | Now I can't resist taking a 10 second interlude to point out something you may have heard
00:17:07.760 | about already, which is that Claude at one point took a break during a coding demo to
00:17:13.540 | peruse photos of the Yellowstone National Park.
00:17:17.180 | Whether that's misalignment or massively amusing will of course depend on your perspective.
00:17:22.480 | But it must be said that Claude isn't the only thing that was dropped yesterday.
00:17:27.300 | Here's Act 1 from Runway ML.
00:17:30.320 | While this particular development and the other two I'm about to show you might not
00:17:33.640 | completely blow off your socks, they do point to the unrelenting improvement in AI generated
00:17:40.420 | entertainment that's coming.
00:17:41.920 | With the Runways Act 1, you can see the actress driving the performance, showing emotion and
00:17:48.420 | creativity in their performance.
00:17:50.300 | And then that being generated by the model into different character animations in all
00:17:54.820 | sorts of styles.
00:17:56.180 | The lip syncing isn't quite there yet, but nevertheless, here is a 15 second scene.
00:18:00.820 | - You really think you can just walk away from all this?
00:18:07.100 | - I have to.
00:18:09.220 | There's no other choice.
00:18:11.260 | - There's always a choice.
00:18:13.080 | You just don't like the alternatives.
00:18:15.080 | - What would you have me do, stay, just watch it all crumble around me?
00:18:20.100 | - That's what you signed up for.
00:18:21.780 | - You can let me know what you think in the comments, but while generations have this
00:18:26.120 | AI tinge about them, I wonder if they will struggle to achieve public acceptance.
00:18:32.040 | I suspect the impact of this kind of technology in video games will be much greater than in
00:18:37.200 | movies and TV, where people want everything to be perfect.
00:18:41.280 | Then there's Heijen's new interactive avatar, and I think this one somewhat flew under the
00:18:45.960 | radar.
00:18:46.960 | I will admit that the lip syncing again is not the best, but did you know you can have
00:18:51.760 | a Zoom call with one of these avatars?
00:18:54.600 | Like literally you open the tab and it's a Zoom meeting, you log in and then you're chatting
00:18:59.280 | to this AI model.
00:19:00.720 | Right at the beginning of this year, I predicted that this would happen this year, but I didn't
00:19:05.200 | think it would come from someone like Heijen.
00:19:07.520 | I thought it might come from OpenAI, but either way, true Zoom calls with AI models is now
00:19:12.820 | a reality.
00:19:13.820 | Hey Vicky, this is Philip from AI Explained, and I'm actually testing you for a video that's
00:19:19.880 | going to go out on YouTube.
00:19:22.360 | And can you confirm this is a live Zoom call?
00:19:29.120 | - I see Philip, just to clarify, this is a simulated role play for conversation practice,
00:19:33.560 | not a live Zoom call.
00:19:35.080 | However, I'm here to help you navigate any scenarios you want to explore.
00:19:39.440 | What specific conversation would you like to role play?
00:19:42.320 | - Well, it's kind of live, I mean, you're not alive, but it's live in the sense of I'm
00:19:47.040 | speaking to you now and you're reacting in real time, right?
00:19:53.440 | - Got it, Philip, you're right, I'm reacting in real time.
00:19:56.220 | So let's focus on your video content, what scenario do you want to role play that relates
00:20:00.320 | to your project?
00:20:01.320 | - I'm just going to demonstrate how I can interrupt you, although you just stopped speaking
00:20:04.640 | immediately.
00:20:05.640 | But yeah, your lip syncing is very good, but it's slightly stilted, I can tell you're not
00:20:10.480 | real.
00:20:11.480 | It's not super easy, but I can tell.
00:20:17.240 | - I understand, Philip, I'm here to support your demonstration.
00:20:20.160 | If you'd like, we can role play a scenario where you provide feedback on that.
00:20:22.800 | - Okay, okay, I'm going to interrupt you now, probably, but can you just give a message
00:20:27.280 | to my viewers?
00:20:32.160 | - Sure thing, Philip, just let me know what message you'd like me to convey to your viewers.
00:20:35.920 | - No, I want you--
00:20:36.920 | - I'm ready when you are.
00:20:37.920 | - No, I want you, that was weird, I want you to come up with a message for my YouTube subscribers.
00:20:48.000 | - Got it, here's the message.
00:20:49.080 | Hey everyone, thanks for tuning in.
00:20:50.960 | If you're enjoying the content, don't forget to like, subscribe and hit that notification
00:20:55.240 | bell.
00:20:56.240 | - Yeah, guys.
00:20:57.240 | Do it.
00:20:58.240 | Great message, thank you, Vicky.
00:20:59.240 | - I didn't catch that.
00:21:00.240 | Can you still see me?
00:21:01.240 | - So I think you get the idea now, but the final of the three updates on different modalities
00:21:09.240 | comes from, of course, Notebook LM.
00:21:11.440 | Using Gemini 1.5 Pro, they can analyze a source that you feed it for free and generate through
00:21:17.800 | text-to-speech and engaging podcasts.
00:21:20.320 | Now, you guys already knew that because I've covered that before on this channel, but here's
00:21:23.720 | something you might not know.
00:21:25.400 | After you choose a file to upload, then after 30 seconds or so, you will see a new button,
00:21:31.480 | Customize, not just Generate.
00:21:32.720 | And I think this is really cool if you haven't already seen it, but you click Customize and
00:21:36.480 | then you can focus the podcast conversation.
00:21:39.320 | Pick one topic out of a video you feed in or a PDF.
00:21:43.040 | Focus on this and not on that.
00:21:44.980 | Explain it in a way that's good for this audience or that audience.
00:21:48.800 | Basically customization of a tool that was already pretty cool.
00:21:52.480 | Maybe half of you might already know about that capability, but for the other half, do
00:21:56.280 | check out Notebook LM.
00:21:57.860 | Of course, the big news for me remains the release of the new 3.5 Sonnet.
00:22:02.640 | And those were my takeaways from all the associated papers.
00:22:06.280 | The timing was great because I've just recovered from what could be either COVID or a heavy
00:22:10.760 | cold, but I am back in action and really looking forward to fully releasing SimpleBench.
00:22:16.720 | Of course, would love to see you over on Insiders and quite topically from Anthropic.
00:22:21.080 | Here's my 35-minute analysis of their CEO's deep dive on a compressed 21st century.
00:22:29.080 | Do check it out, but regardless, thank you so much for watching and have a wonderful