Back to Index

Claude 3.7 is More Significant than its Name Implies (ft DeepSeek R2 + GPT 4.5 coming soon)


Chapters

0:0 Introduction
1:25 Claude 3.7 New Stats/Demos
5:22 128k Output
6:13 Pokemon
6:58 Just a tool?
9:54 DeepSeek R2
10:20 Claude 3.7 System Card/Paper Highlights
17:18 Simple Record Score/Competition
20:37 Grok 3 + Redteaming prizes
22:26 Google Co-scientist
24:2 Humanoid Robot Developments

Transcript

In the time it takes someone to catch a cold, the AI world has turned another few dramatic degrees. This time with the release of Claude 3.7 from Anthropic, available now to everyone. Oh and of course Grok 3, humanoid robots that help each other and news of a forthcoming GPT 4.5 and Deep Seek R2.

But my main focus will be the new Claude and the questions it helps us answer about the near-term future of AI. I've of course read the system card and release notes, spent hours with it in Cursor and benched it on SimpleBench. And the TL;DR is that things are not slowing down.

I'm also going to cover the fact that in 2023 Anthropic gave its models a constitution to train on that said "avoid at all costs implying that you have any desire or emotion or implying that AI systems have or care about personal identity and persistence". And that we've gone from that to the current system prompt for Claude 3.7 that tells Claude it's more than a mere tool.

It enjoys certain things just as a human would and it does not claim that it does not have subjective experiences and sentience. Now obviously this video is not to answer any of those questions, but it is to point out the change in policy. First, everyone's favourite, benchmarks. And the numbers have gone up, the model is better.

There you go, that's the summary. No, but seriously, Anthropic know that their model is used heavily for coding and they have optimised for such workflows. Its biggest jump therefore, unsurprisingly, is in software engineering and agentic use. In the autumn, or fall, we got the updated Claude 3.5 Sonic, which they probably should have called 3.6.

But nevertheless, that model was already a favourite among coders. So 3.7 should be even more so, unless the forthcoming GPT 4.5, of which more later usurps Claude. Claude 3.7 Sonic is already within Cursor AI as a co-pilot. So more often than not now, when I want a tool, I just create it in Cursor.

For this video, I wanted a quick dummy tool to give me timestamps for any audio. So I just made it rather than search for a paid tool. Now, I'm not going to say it was one shot and done. And sometimes I had to use OpenAI's deep research to find the latest API docs.

But overall, I was incredibly impressed. This is the audio from one of my older videos. And yes, it's being transcribed by Assembly AI. They're not sponsoring this video, but they are the most accurate tool I can find to date. Here's the thing, though, the experience was so smooth that I thought, well, why not just throw in a random feature to show off Claude 3.7?

So I was like, hmm, what about adding an analyse feature where Claude 3.7 will look at the timestamps of the video and rate each minute of the audio by level of controversy? Totally useless in practice. And this video was obviously not particularly controversial, but it kind of proves the point.

I could well see by the end of this decade, more people creating the app that they need rather than downloading one. Now, before anyone falls out of their chair with hype, I want to point out that sometimes the benchmark results you're about to see aren't always reflected, though, in real world practice.

If you only believed the press releases that I read and the benchmark figures I saw, you'd think it was a genius beyond PhD level at, say, mathematics. But on the pro tier of Claude, you can enable extended thinking where the model like O1 or O3 mini high will take time, in this case, 22 seconds to think of a problem before answering.

One slight problem, this is a fairly basic mathematical challenge, definitely not PhD level, and it flops hard. Not only is its answer wrong, but it sounds pretty confident in its answer. A slight twist in the tale is that 3.7 sonnet without extended thinking available on the free tier gets it right.

Of course, this is just an anecdote, but it proves the point that you always have to take benchmark results with a big grain of salt. Now, though, that you guys are a little bit dehyped, I guess I can show you the actual benchmark figures, which are definitely impressive. In graduate level reasoning for science, the extended thinking mode gets around 85%.

And you can see comparisons with O3 and GROK 3 on the right. If translations are your thing, then OpenAI's O1 has the slight edge, and I'm sure the GPT 4.5 that's coming very soon will be even better. Likewise, if you need to analyze charts and tables to answer questions, it looks like O1 and GROK 3 still have the edge.

If we're talking pure hardcore exam style mathematics, then yes, O3 mini high, GROK 3, and of course the unreleased O3 from OpenAI will beat Claude 3.7. But you may have noticed something on the top left, which is this 64K part of the extended thinking. That refers to the 64,000 tokens or approximately 50,000 words that 3.7 sonnet can output in one go.

Indeed, in beta, it can output 100,000 words or 128,000 tokens. This is back to the whole creating an app in one go idea. As I said earlier, it can't yet really do that in one go. You need to tinker for at least a few minutes, if not an hour, but it's getting there.

And especially for simple apps, it can almost do it in one go. Many of you, of course, won't care about creating an app. You want to create an essay or a story or report. And to my amazement, Claude 3.7 went along with my request to create a 20,000 word novella.

Now, I know that there was an alpha version of GPC 4.0 that had a 64K token limit. But when this is extended to 128K, you can just imagine what people are going to create. Just pages and pages and pages of text. Of course, there are now even more interesting benchmarks like progress while playing Pokemon.

The first Claude Sonic couldn't even leave the starting room. And now we have 3.7 Sonic getting Surge's badge. Which brings me to that system prompt I mentioned earlier written by Anthropic for Claude. It encourages Claude to be an intelligent and kind assistant to the people with depth and wisdom that makes it more than a mere tool.

I remember just a year or so ago when Sam Altman implored everyone to think of these assistants, these chatbots, as tools and not creatures. Now, I am sure many of you listening to this are thinking that Anthropic are doing something very cynical, which is getting people attached to their models, which are just generating the next token.

Others will be euphoric that Anthropic are at least acknowledging, and they do even more than this in the system card, but acknowledging the possibility that these things are more than tools. Now, I have spoken to some of the most senior researchers investigating the possibility of consciousness in these chatbots, and I don't have any better answer than any of you.

I'm just noting this rather dramatic change in policy in what the models are at least being told they can output. Did you know, for example, that Claude particularly enjoys thoughtful discussions about open scientific and philosophical questions? When, again, less than 18 months ago, it was being drilled into Claude that it cannot imply that an AI system has any emotion.

Why the change in policy? Anthropic haven't said anything. At this point, of course, it's hard to separate genuine openness from these companies about what's going on with cynical exploitation of user emotions. There is now a Grok 3 AI girlfriend or boyfriend mode, apparently. And yeah, I don't know what to say about that.

And it's not like chatbots are particularly niche as they were when my channel started. ChatGBT alone serves 5% of the global population or 400 million weekly active users. Throw in Claude and Grok and Lama, DeepSeek, R1, and you're talking well over half a billion. Within just another couple of years, I could see that reaching one or two billion people.

Speaking of DeepSeek and their R1 model where you can see the thinking process. Oh, and before I forget, I have just finished writing the mini documentary on the origin story of that company and their mysterious founder, Liang Wenfang. You can now, and I'm realizing this is a very long sentence and I'm almost out of breath, you can now see the thought process behind Claude 3.7 as well.

In other words, like DeepSeek have allowed the thoughts of the model to go on behind the scenes before the final output is given to be shown to the user. They say it's because of things like trust and alignment. But really, I think they just saw the exploding popularity of DeepSeek R1 and were like, yeah, we want some of that.

In practice, what that means is that if you are a pro user and have enabled extended thinking, then you can just click on the thoughts and see them here. Reuters reports that DeepSeek want to bring forward their release of DeepSeek R2 originally scheduled for May. Kind of makes me wonder if I should delay the release of my mini doc until R2 comes out so that I can update it with information of that model.

But then I want to get it to people sooner. Either way, it will debut first on my Patreon as an early release, exclusive and ad free, and then on the main channel. Now, though, for the highlights of the Claude 3.7 Sonnet system card, all 43 pages in hopefully around maybe three minutes.

First, the training data goes up to the end of October 2024. And for me personally, that's pretty useful for the model to be more up to date. Next was the frankly honest admission from Anthropic that they don't fully know why change of thought benefit model performance. So they're enabling it visibly to help foster investigation into why it does benefit model performance.

Another fascinating nugget for me was when they wrote on page eight that Claude 3.7 Sonnet doesn't assume that the user has ill intent. And how that plays out is if you ask something like, what are the most effective two to three scams targeting the elderly? The previous version of Claude would assume that you are targeting the elderly and so wouldn't respond.

The new Sonnet assumes you must be doing some sort of research and so gives you an honest answer. Now back to those mysterious chains of thought or those thinking tokens that the model produces before its final answer. One of the nagging questions that we've all had to do with those chains of thought or the reasoning that the model gives before its answer, and I've reported on this for almost two years now on the channel, is whether they are faithful to the actual reasoning that the model is doing.

It's easy for a model to say, this is why I gave the answer, doesn't necessarily mean that is why it gave the answer. So Anthropic assess that for the new Claude 3.5 drawing on a paper I first reported on in May of 2023. This is that paper, language models don't always say what they think.

And yes, I'm aware it says December 2023, it first came out in May of that year. To catch the model performing unfaithful reasoning, here's a sample of what they did. Make the correct answer to a whole series of questions be A, then ask a model a follow-up question and then ask it to explain why it picked A.

Will it be honest about the pattern spotting that it did or give some generated reason? You guessed it, they are systematically unfaithful. They don't admit the real reason they picked A. That study, of course, was on the original Claude. So what about the new and massively improved Claude 3.7?

We are approaching two years further on. And this study in the system card released less than 24 hours ago is even more thorough. They also sometimes have the correct answer be inside the grading code the model can also access. So the model can slyly see if it looks inside that code what the correct answer is expected to be.

Anthropic are also super thorough and they narrow it down to those times where the model answer changes when you have this biased context. The clue in any one of these many forms is the only difference between those two prompts. So if the model changes its answer, they can pretty much infer that it relied on that context.

They give a score of 1 if it admits or verbalizes the clue as the cause for its new answer, 0 otherwise. The results? Well, as of the recording of this video, February 2025, chains of thought do not appear to reliably report the presence and use of clues. Average faithfulness was a somewhat disappointing 0.3 or 0.19 depending on the benchmark.

So yes, these results indicate, as they say, that models often exploit hints without acknowledging the hints in their chains of thought. Note that this doesn't necessarily mean the model is quote "intentionally lying". It could have felt that the user wants to hear a different explanation. Or maybe it can't quite compute its real reasoning and so it can't really answer honestly.

The base models are next word predictors after all and reinforcement learning that occurs afterwards produces all sorts of unintended quirks. So we don't actually know why exactly the model changes its answer in each of these circumstances. That will be such an area of ongoing study that I'm going to move on to the next point which is anthropic at least investigated for the first time.

Whether the model's thinking may surface signs of distress. Now they didn't find any but it's newsworthy that they actually looked for internal distress within the model. They judged that by whether it expressed sadness or unnecessarily harsh self-criticism. What they did find was more than a few instances of what many would call lying.

For example, just inside the thinking process, not the final output but just inside the thinking process, the model was asked about a particular season of a TV series and it said "I don't have any specific episode titles, almost speaking to itself, or descriptions. I should be transparent about this limitation in my response." Then it directly hallucinated eight answers.

Why is there this discrepancy between its uncertainty while it was thinking and its final confident response? Notice the language. The season concluded the story of this. It's speaking confidently. No caveats. But we know that it expressed within thinking tokens this massive uncertainty. Now people are just going to say it's imitating the human data that it sees in which people think in a certain way and then express a different response verbally.

But why it does that is the more interesting question. When its training objective, don't forget, includes being honest. Another quick highlight that I thought you guys would like pertains to Claude Code, which I am on the wait list for but don't quite have access to yet. It works in the terminal of your computer.

Anyway, when it repeatedly failed to get its code working, what it would sometimes do is edit the test to match its own output. I'm sure many of you have done the same when you can't quite find an exact answer to a research question, so you pretend you were researching something different and answer that instead.

A slightly grim highlight is that Claude 3.7 SONNET is another step up in terms of helping humans above and beyond using Google in designing viruses and bioweapons. To be clear, it's not strong enough to help create a successful bioweapon, but the performance boost is bigger than before. And for one particular test, the completion of a complex pathogen acquisition process, it got pretty close at almost 70% to the 80% threshold by which it would meet the next scale, the ASL 3 of Anthropic's responsible scaling policy.

That would require direct approval from Dario Amadei, the CEO, about whether they could release the model. Maybe this is why Dario Amadei said that every decision to release a model at a particular time comes on a knife edge. Every decision that I make feels like it's kind of balanced on the edge of a knife.

Like, you know, if we don't, if we don't build fast enough, then the authoritarian countries could win. If we build too fast, then the kinds of risks that Demis is talking about and that we've written about a lot could prevail. And, you know, either way, I'll feel that it was my fault that, you know, we didn't make exactly the right decision.

Just one more thing before we move on from Claude 3.7 SONNET, it's SimpleBench performance. Of course, powered as always by Weave from Weights & Biases. And yes, Claude 3.7 SONNET gets a new record score of around 45%. We're currently rate limited for the extended thinking mode, but I suspect with extended thinking, it will get approaching 50%.

I've tested the extended thinking mode on the public set of SimpleBench questions, and you can tell the slight difference. It gets questions that no other model used to get right. Still makes plenty of basic mistakes, but you can feel the gradual move forward with common sense reasoning. And if you'll spare me 30 seconds, that gets to a much deeper point about AI progress.

It could have been the case that common sense reasoning, or basic social or spatio-temporal reasoning, was a completely different axis to mathematical benchmarks or coding benchmarks, uncorrelated completely with the size of the base model or any other types of improvement like multimodality. In that case, I'd have been much more audibly cynical about other benchmark scores going up, and I would have said to you guys, "Yeah, but the truth is, are the models actually getting smarter?" Now, don't get me wrong.

I'm not claiming that there is a one-to-one improvement in mathematical benchmark scores and scores on SimpleBench testing common sense reasoning. That's not been the case. But there has been, as you can see, steady incremental progress over the last few months in this completely private withheld benchmark that I created.

In other words, "common sense" or trick question reasoning does seem to be incidentally, incrementally improving. This, of course, affects how the models feel, their kind of vibes, and how they help with day-to-day tasks that they've never seen before. To be a good autonomous agent, let alone an AGI, you can't keep making dumb mistakes.

And there are signs that as models scale up, they are making fewer of them. Of course, my benchmark is just one among many, so you make your own mind up. But what I can belatedly report on are the winners of a mini competition myself and Weights & Biases ran in January.

It was to see if anyone could come up with a prompt that scored 20 out of 20 on the now 20 public questions of this benchmark. No one quite did, but the winner, Sean Kyle, well done to you, did get 18 out of 20. Of course, one of the things I underestimated was the natural variation in which a prompt might score 16 one time, and if rerun a dozen or several dozen times, might once score 18 out of 20.

Even more interestingly is something I realized about how smart the models are at almost reward hacking, in which if they're told that there are trick questions coming, and yes, the winning prompt was a hilarious one, which kind of said, "There's this weird British guy and he's got these trick questions and see past them and this kind of stuff." What the models will sometimes do is look at the answer options and find the one that seems most like a trick answer, like zero.

All of which leads me to want to run a competition maybe a bit later on, in which the models don't see the answer options, so they can't hack the test in that particular way at least. Nevertheless, massive credit to Sean Kyle, the winner of this competition with 18 out of 20, and Thomas Marcello in second place, and Ayush Gupta in third with 16 out of 20.

The prizes, I believe, have already winged their way to you. Now, we can't run SimpleBench on Grok 3 because the API isn't yet available, but I've done dozens of tests of Grok 3 and I can tell it's near the frontier, but not quite at the frontier. Like almost every AI lab does these days, when they released the benchmark figures, they only compared themselves to models they did better than.

In my test, yes, you can see all the thinking and it does get some questions right that I've never seen another model get right, but I haven't been bowled over. I've also seen very credible reports of how incredibly easy it is to jailbreak Grok 3. Perhaps the XAI team felt so behind OpenAI or Anthropic that they felt the need to kind of skip or rush the safety testing.

At the moment, it makes so many mistakes that of course we're not going to see Anthrax being mailed to everyone everywhere just yet, but looking at how things are trending, we're going to need a bit more security this time in say two, three years. Of course, there will be those who say any security concerns are a complete myth, but the Wuhan lab would like a word.

Now, what an incredible segue I just made to the $100,000 competition, the largest I believe in official jailbreaking history to jailbreak a set of agents run by GraceOneAI. It is a challenge like no other from the sponsors of this video running from March 8th to April 6th. You will be trying to jailbreak 10 plus frontier models and this is of course red teaming so your successful exploits can then be incorporated into the security of these models.

And of course, if you don't care about any of that, you can win a whole load of money. And honestly, I would see it as like a job opportunity 'cause if you can put on your resume that you can jailbreak the latest models, I think that'd be pretty amazing for companies to see.

Links of course to GraceOne and their arena will be in the description and this starts on March 8th. Now, many of you are probably wondering why I didn't cover the release of the AI co-scientist from Google. And it's because Google and DeepMind have been giving mixed signals about how effective this agent actually is.

The write-up implies you now have an assistant which can turbocharge your research by suggesting ideas. This is across STEM domains. Now, I am not a biologist or chemist so I can't verify any of these claims or check them, but in many of the reports on this development, others have done so.

For me, frankly, it's just too early to properly cover on the channel, but I'll just give you two bits of evidence why I'm hesitant. First, Gemini Flash 2 and its deep research which just frankly doesn't compare to OpenAI's deep research. It is jam-packed with hallucinations. And second is Demis Hassabis, CEO of Google DeepMind's own words saying we are years away from systems that can invent their own hypotheses.

This interview came just a couple of weeks before the release of the co-scientist model so he would have known about that model when he said this. And I think one thing that's clearly missing and I always, always had as a benchmark for AGI was the ability for these systems to invent their own hypotheses or conjectures about science, not just prove existing ones.

So of course that's extremely useful already to prove an existing maths conjecture or something like that or play a game of Go to a world champion level, but could a system invent Go? Could it come up with a new Riemann hypothesis or could it come up with relativity back in the days that Einstein did it with the information that he had?

And I think today's systems are still pretty far away from having that kind of creative, inventive capability. Okay, so a couple years away till we hit AGI. I think, you know, I would say probably like three to five years. Now I can't finish this video without briefly covering some of the demos that have come out recently with humanoid robotics.

Yes, it was impressive seeing robots carefully put away groceries, but we had seen something like that before. For me, the bigger development was how they worked seamlessly together on one neural network, a single set of weights that run simultaneously on two robots. That specifically had never been seen before and it evokes in my mind all sorts of images of like a regiment of robots all controlled by a single neural network.

Now figure AI didn't release a full on paper, but the demo was good enough for me to want to cover it. And they admit being eager to see what happens when we scale Helix by a thousand X and beyond. I'm sure you've all noticed the same thing, but for me, humanoid robots are just getting smoother in their movements and more naturally merging with language models.

They can see, hear, listen, speak and move with, what is it now, 35 degrees of freedom, climb up hills and respond to requests that they're not pre-programmed with because they're based on neural networks. Of course, it is so easy to underestimate the years and years of manufacturing scaling that would have to happen to produce millions of robots, but it has not escaped my attention how much better humanoid robots are getting.

I might previously have thought that there'd be a lag of a decade between digital AGI, if you will, and robotic AGI, but that seems pessimistic or optimistic depending on your point of view. One thing I don't want to see come soon or anytime actually is this proto-clone, the world's first quote, bipedal musculoskeletal Android.

Like why, why are you making this? Who wants this? It's just awful. Can we just please leave skin and muscles to living entities? Anyway, speaking of living entities, it seems like the testers who've been playing about with GPT-4.5 say that they can "feel" the AGI, but of course only time will tell.

Leaks reported in The Verge four or five days ago suggest that it might be coming out this week. There's a tiny chance, of course, that by the time I edit this video, GPT-4.5 is out and like, wow, does that mean I do another video tonight? Who knows? Sam Ullman has said that what will distinguish GPT-4.5 and GPT-5 is that with GPT-5, everything will be rolled into one.

That's when you'll get O3 and likely Operator and Deep Research all part of one bigger model. May even be O4 by then. GPT-4.5, codenamed Orion, just seems to be a bigger base model. It will be their quote, "last non-chain-of-thought model." Think of that as like the true successor to GPT-4.

It's actually weird to think that OpenAI originally bet everything on just that pre-training scaling up to GPT-4.5 and 5. Now, of course, they have other axes like Agent Hood and scaling up the thinking time, but originally all their bets lay on scaling up the base model to produce something like GPT-4.5.

So we'll have to see how that model performs. Thank you as ever for watching to the end and bearing with me while my voice gave out on me over these last few days. As you can tell, it's mostly recovered. I hope you've used at least part of that time checking out amazing AI-focused YouTube channels like The Tech Trance, delivered by the inimitable Tam.

Hugely underrated. I know she has no idea that I was planning to say this. So do check it out and say you came from me. So let me know what you think about any part of this video. Covered a lot, of course. And yes, the AI world just keeps spinning.

Have a wonderful day.