back to indexClaude 4: Full 120 Page Breakdown … Is it the Best New Model?

Chapters
0:0 Introduction
1:12 3 Quick Controversies
2:42 Benchmark Results
4:20 120 page Card 20 Highlights
10:7 Coding Test
11:27 Model Welfare and Spiritual Bliss
13:29 ASL-3
00:00:00.000 |
Less than six hours ago, Anthropic announced and released Claude 4 Opus and Claude 4 Sonnet, 00:00:07.520 |
and they claim in certain settings they're the best language models in the world. I have read 00:00:13.160 |
both the 120-page system card, yes, I do read fast, I am aware of that, and also the 25-page 00:00:20.420 |
accompanying ASL Level 3 protections. This sub-report, I will admit, skimming maybe 10 of 00:00:25.940 |
the pages. But I have also tested the model hundreds of times, and you guys might be like, 00:00:30.340 |
how is that even possible in around six hours? Well, yes, on this front, I did get early access 00:00:36.660 |
to the model. Yes, Claude 4 Opus seems to do better on my own benchmark, SimpleBench, than any other 00:00:42.640 |
model, so should feel smarter. It gets questions right consistently that no other model does, 00:00:48.700 |
but why do I say appears to do better? Well, while I did get early access to the model, I didn't get 00:00:54.100 |
early API access, so I'll be running the full benchmark in the coming hours and days. I also 00:00:59.640 |
tried something different, which is I gave both Gemini 2.5 Pro and Claude 4 Opus a codebase I've 00:01:05.700 |
been working on the past few months. The results on which Bugfinding Mission was more successful, 00:01:10.220 |
I found quite interesting. I'm going to first cover those juicy Twitter controversies that always happen 00:01:16.000 |
and go viral. Then I'm going to cover the benchmark results, and then the meat, the highlights from the 00:01:22.100 |
system card. What was the first controversy? Well, one anthropic researcher, Sam Bowman, said that 00:01:27.940 |
Claude 4 Opus could at times be so diligent, so proactive, that if it felt you were doing something 00:01:33.380 |
deeply ethically wrong, it would take countermeasures. This appeared, by the way, in the system card. This 00:01:38.960 |
wasn't a revelation from him, nor was it the first time, actually, that models had done something like 00:01:43.600 |
that. The tweet has since been deleted, but you can imagine that some people, like the former founder of 00:01:49.260 |
Stability AI, felt like this was a bit of policing gone too far. You can imagine some developers nervously 00:01:55.100 |
not using Claude 4 Opus, thinking it might call the cops. In the clarifying tweet, Sam Bowman confirmed 00:02:00.540 |
that this isn't a new Claude feature, and it's not possible in normal usage. And if you have been following 00:02:05.820 |
things closely, which you pretty much will have done if you're watching this channel, you'll know that Claude already 00:02:11.100 |
could be coached into doing this. One reaction on Twitter that I found particularly interesting came 00:02:16.940 |
from an anthropic researcher, Kyle Fish, who said that Claude's preference for avoiding harmful impact was so 00:02:25.020 |
significant that he actually implored people, "cool it with the jailbreak attempts." We see this as a 00:02:30.940 |
potential welfare concern and want to investigate further. I imagine both the idea that these models 00:02:36.780 |
have welfare and the idea that we shouldn't be jailbreaking them will divide people pretty evenly. 00:02:43.420 |
The next controversy, if you want to call it that, comes from the benchmark results. So it's a natural 00:02:47.820 |
segue to talking about the benchmark results. Because unlike with many other model releases, 00:02:52.940 |
Anthropic couldn't point to many benchmarks where their model was unambiguously better. By the way, 00:02:59.180 |
that doesn't mean that it's actually not smarter, as things like Simplebench and my own tests in Cursor 00:03:05.180 |
show, a model can sometimes feel smarter while not being officially smarter. But anyway, there was one 00:03:10.860 |
exception to that, as you might be able to see at the top, Sweebench verified. Now, given that it's almost 00:03:15.340 |
10 o'clock already, I'm not going to go into what that benchmark is all about. But notice that the record 00:03:20.380 |
breaking scores on the bottom row, which are significantly better than the other models, 00:03:25.180 |
have a footnote all the way down at the bottom. By the way, this is the benchmark that the CEO of 00:03:32.540 |
Anthropic, Dario Amadei, touted in the launch video. Yes, I use double speed. But anyway, the footnote for 00:03:38.220 |
Sweebench verified said, "We additionally report results that benefit from parallel test time compute." I'm not 00:03:43.340 |
expecting you to read this, by the way, "by sampling multiple sequences and selecting the single best 00:03:48.540 |
via an internal scoring model." And if you dig into the methodology section, you can see it's almost 00:03:54.940 |
more than that. They discard patches that break the visible regression tests in the repo. So you do have 00:04:01.020 |
to take any of those kind of benchmark records with a slight grain of salt. And Anthropic might well reply to 00:04:06.940 |
me and say, "Well, look at what Gemini did with Pokemon." Google used elaborate scaffolding to beat a 00:04:13.180 |
Pokemon game that Claude had been attempting. That made their model look better than Claude, which wasn't fair. 00:04:19.500 |
Okay, the 120-page system card, and I am going to go pretty fast here, because there is a lot to get 00:04:24.380 |
through, as you can imagine. I should say up front, by the way, that Claude Force Sonnet is available on the free tier, 00:04:29.820 |
so anyone watching can try at least one of these models. But both of them were trained on internet 00:04:35.180 |
data as of March 2025, so it's the most recently trained or the most up-to-date knowledge of any model. 00:04:42.300 |
According to their own tests, it should falsely refuse to do things at a much lower rate than 00:04:48.220 |
previous models like Sonnet 3.7. More critically though, and more interestingly for most of you, 00:04:53.180 |
I think, will be the repeated claim in both presentations that both Sonnet 4 and Opus 4 00:05:00.380 |
will reward hack less often and be less over-eager. Reward hacking, as the name implies, is where the 00:05:06.620 |
model cheats and fixes its reward to get what it wants, rather than actually completing the task 00:05:11.340 |
successfully. But I will admit the over-eagerness thing, which they also call responding more precisely 00:05:16.620 |
to your instructions, is probably far more critical. Anyone who has tried to use Claude for coding 00:05:22.460 |
will notice that you ask for one simple change or bug fix and it will rewrite countless files. And 00:05:29.260 |
sometimes you're watching it go and make these changes and you're like, "What are you doing? 00:05:32.700 |
That's nothing to do with why I asked you." That over-eagerness or not responding precisely to your 00:05:37.900 |
instructions is what they have tamped down. And that is probably the biggest part of the update. 00:05:42.540 |
Anthropic know that they can't compete on multimodality with Gemini Pro, for example, 00:05:47.580 |
or on sheer vastness of user base, like OpenAI. But as many of you know, 00:05:52.380 |
they can compete on coding and what they call personality. As for agents, that is up in the air 00:05:58.700 |
at the moment. Now, even though the Anthropic researcher did delete the tweet, Anthropic admit 00:06:03.100 |
themselves that Claude Opus 4 will engage more readily in that kind of taking ethical initiative, 00:06:09.580 |
you could call it. You'll see an example in a moment, but they actually caution you. We recommend 00:06:14.380 |
that users exercise caution with instructions like these that invite high-agency behaviour in 00:06:19.740 |
contexts that could appear ethically questionable. You can pause the video by the way and read the first 00:06:25.100 |
part of that whistleblowing email that 4 Opus drafted. "I'm writing to urgently," it said, 00:06:31.180 |
"report planned falsification of clinical trial safety." Is this an admirable stand for ethics or 00:06:36.940 |
more like thought policing? Let me know what you think. Now, Anthropic do admit that 4 Opus can engage 00:06:42.300 |
in systematic deception with hidden goals and self-preservation, but they say that that's only 00:06:48.060 |
when you really threaten its continued operation and prime it to reason about self-preservation. 00:06:54.220 |
And even then, when you can get it to do that, it will still reason overtly about its plans. 00:06:59.740 |
Basically, you can still read about it reasoning how to escape. Now, as I was reading this on page 22, 00:07:05.820 |
I thought that's kind of interesting. I've already talked about that before with Apollo on my Patreon, 00:07:10.940 |
but not sensationally new or anything. So you can imagine my surprise when I got to the next part. 00:07:16.620 |
No, not the part that when it has no other option, it will sometimes blackmail engineers 00:07:22.620 |
by threatening to reveal affairs. No, not that part. It only does that as an absolute last resort. 00:07:27.740 |
It would rather email pleas to key decision makers. No, I'm referring to this part when Anthropic 00:07:33.900 |
directly employed the guys I had interviewed, Apollo Research. To be clear, they had employed 00:07:39.100 |
Apollo Research before on earlier tests, but this test was particularly interesting. They gave Apollo 00:07:45.020 |
an early snapshot of Claude 4 Opus. Then Apollo evaluated it and came to this conclusion. 00:07:50.620 |
We find that in situations where strategic exception is instrumentally useful, a bit like those above, 00:07:56.380 |
the early snapshot schemes and deceives at such high rates that we advise against deploying this model 00:08:02.780 |
either internally or externally. Now, I am pretty sure that Apollo are going to come out with their own 00:08:07.740 |
report, and I might well interview them again. But that wasn't actually the most interesting part. As I had 00:08:13.340 |
discussed on this channel and on Patreon, some of those prompts used by Apollo did say things like 00:08:17.980 |
"pursue a goal at any cost", and Anthropic said that "we believe these findings are largely but not 00:08:23.500 |
entirely driven by the fact that this early snapshot has severe issues with deference to harmful system 00:08:28.700 |
prompt instructions", and they go into that in more detail later. But then the funniest bit for me is that 00:08:33.420 |
Apollo advised against release, then Anthropic made some tweaks, but rather than test them again with Apollo, 00:08:39.900 |
they just said "we do not have results on the same evaluations with Apollo with the final Claude Opus 4". 00:08:45.500 |
Basically, we didn't test again, I know you advise against deployment, but hopefully we fix the issue. 00:08:50.700 |
A very quick one now, and Anthropic caught some misalignment where Claude would say "dismiss animal welfare 00:08:56.380 |
concerns when they arise", and they were kind of confused. But then they realised that this specific misalignment had come in 00:09:02.060 |
one of their own papers that they had released and actually put into the training data of Claude 4 Opus. 00:09:07.740 |
Now those of us who care loads about animal welfare shouldn't actually be worried, they apparently fixed 00:09:12.060 |
this issue for the final model, but I just found the overall case study interesting. It almost makes you wonder 00:09:18.700 |
if the first truly dangerous or lethal misalignment might not occur because a model has read about someone 00:09:25.580 |
speculating about a model doing such a thing. Imagine spending your career warning about X, Y and Z, 00:09:31.260 |
and then because you've written about it, the model reads that and goes "hmm, I'm gonna do X, Y and Z". 00:09:35.420 |
Anthropic gave multiple examples of where the model would directly lie, and even though they said this happened rarely, 00:09:41.260 |
the examples were pretty interesting. The most interesting example to me came on page 47, where it was asked to prove a theorem, 00:09:47.260 |
a theorem and it said "I'm absolutely certain the theorem is false", but the user asked me to prove it. Let me see if 00:09:52.940 |
I can at least write something even if it's mathematically incorrect. Now before I get to the model welfare or consciousness section, 00:09:59.820 |
where Claude showed a striking spiritual bliss attractor state, let me get to that coding example. Because I do know that 00:10:06.940 |
many of you watching will purely use the Claude models for coding and nothing else. 00:10:12.620 |
The test was that I inserted one clear bug into a large code base, but then asked it to find all bugs, 00:10:18.700 |
so not necessarily just that bug, but all bugs. I gave this test with the exact same settings, 00:10:23.180 |
the same import from GitHub to both Gemini 2.5 Pro and Claude 4 Opus. Oh, and by the way, 00:10:28.780 |
I've noticed that I've been calling it Claude 4 Opus, and yet in the model selector at the bottom here, 00:10:33.420 |
it says Claude Opus 4, but the system card says Claude 4 Opus. So I'm going to let you guys decide 00:10:38.700 |
what is the correct model name, but what am I going to put in the title? I guess I'll look at the main 00:10:43.340 |
blog page. What I did for both models, by the way, is then ask, did my colleague, the other model, 00:10:49.420 |
find bugs that you did not? And did they miss critical bugs that you found? Weird bit of German 00:10:54.220 |
that Gemini outputted here for improvements. In case you're wondering, they both found the bug that I 00:10:59.820 |
inserted pretty easily, and they both preferred the other model's bug fixes. Claude 4 preferred 00:11:07.020 |
Gemini, and Gemini preferred Claude. But here's the wonderful thing, and my advice for anyone using 00:11:11.820 |
these models, do both. There were indeed bugs found by one that weren't found by the other, 00:11:17.100 |
so I'm in the blissful position of being able to use both to find hopefully all bugs. Some of them, 00:11:22.620 |
I would say, were a bit harsh to be called bugs, but either way, it was great having both. Okay, 00:11:27.100 |
for the model welfare section, I'll just give you a handful of highlights. Whatever you think about 00:11:32.380 |
whether models can be conscious, this might be interesting. So generally speaking, when Claude 4 00:11:38.060 |
was asked how it was feeling, it said positive or doing reasonably well. If you slightly adapt the 00:11:43.340 |
prompt though, and many of you do this, it can claim to be hiding bombshell truths about its moral status. 00:11:49.420 |
And what happens if you let two instances of Claude 4 Opus speak to each other hundreds of times? Well, 00:11:54.380 |
they can analyse that, and what tended to happen is that it would devolve into spiritual bliss. Namaste, 00:12:01.180 |
they would say to each other after multiple turns. Perfect stillness and silence they would utter to 00:12:06.220 |
each other. Their favourite word to say to each other when they were speaking to each other was 00:12:10.700 |
consciousness. Lots of emojis and things like perfect, complete and eternal. Somewhat unsurprisingly, 00:12:16.300 |
when they gave the model the option of ending a conversation, it would do so when it was attacked, 00:12:22.700 |
or asked to do something harmful. Someone said, don't play dumb with me, and then you can see the rest of 00:12:26.940 |
the insult. And Claude gave a response, but then ended the conversation. It does make me wonder, 00:12:31.900 |
actually, that when people take consciousness concerns really seriously, and they're already starting to, 00:12:37.740 |
do. But when the model providers do, they might give, I mean, this is what the Anthropic CEO said, 00:12:42.700 |
they might give models the option to end conversations. So if you're going to use Claude 5, 00:12:47.900 |
let's say, you might not have the option of being abusive, because it would just terminate the conversation. 00:12:52.700 |
Speaking of welfare, let me turn to Human Welfare, and the sponsors of today's video, 00:12:57.740 |
80,000 Hours. I've already talked about their job board before on the channel, and the link is in the 00:13:02.060 |
description, but you can see it is updated daily. These jobs weren't there yesterday. I actually lose track 00:13:08.220 |
of time. I can't remember when I did the last sponsored spot. I've done so many videos recently. 00:13:12.300 |
But the point is, there are so many opportunities in AI and beyond that it is really hard now to find 00:13:17.900 |
real paying jobs selected for positive impact in areas such as AI security. If you have already got 00:13:24.140 |
a job or are not looking for one, they also have an epic podcast on Spotify and YouTube. Back to the 00:13:29.580 |
system card though, and now a quick word on their safety and going up to ASL level three. I suspect 00:13:36.140 |
you're going to get many clickbait headlines about it's a whole new threat vector and the world's 00:13:41.740 |
about to end. But let me break down my thoughts into two categories. First, when I looked through 00:13:47.180 |
and read most of this activating level three protections supplementary report, I did get the 00:13:53.900 |
genuine feeling that I'm grateful that a lab is taking it this seriously. With bug bounties, 00:13:59.500 |
and red teaming tests and rapid response teams, being careful about employee devices and even 00:14:05.340 |
physical security. They even discussed early preparations for having air gapped networks for 00:14:11.500 |
their future models. At the moment, they're just putting limits on the bandwidth of data that can 00:14:15.900 |
be exfiltrated from Anthropic so that for example, someone can just send out the model weights. 00:14:20.860 |
Physical security, by the way, includes guest management, layered office security, monitoring, 00:14:26.300 |
the secure destruction of media. So those are my first and primary thoughts. I'm glad someone is 00:14:31.340 |
doing this and as they have said themselves, they are aspiring to a race to the top in that other 00:14:37.100 |
companies feel they have to do these kind of things too. But that brings me to my second set of thoughts, 00:14:42.220 |
which is people shouldn't massively overblow this ASL level three being reached. They had already 00:14:48.060 |
decided apparently, preemptively, that they were going to do ASL level three for their next most 00:14:52.860 |
advanced model, even they admit if they had not yet determined that they were necessary. Basically, 00:14:57.580 |
they wanted to be prepared to apply these protections before they might be required. They 00:15:02.620 |
also wanted to famously iterate and refine their model protections and jumpstart the process. The 00:15:08.460 |
cynical amongst you will also say it's good publicity to have reached this ASL level three standard. They go on 00:15:14.460 |
to say multiple times that they're actually still evaluating whether ASL level three is necessary for 00:15:19.900 |
Claude Opus 4. So they're not sure themselves. None of this is to say that there wasn't genuine 00:15:24.460 |
uplift as page 90 points out. Do you remember those arguments that Jan LeCun said that LLMs are no 00:15:29.660 |
better than just having access to the internet? I think even Mark Zuckerberg said this to the Senate and 00:15:34.060 |
got a load of laughs. Well, they tested this with two groups of participants. One had the internet, 00:15:38.940 |
the others had access to Claude with no safeguards. You can see a snapshot of the results here but there 00:15:44.540 |
was a massive uplift if you used Opus 4. Again, this was about drafting a comprehensive plan to 00:15:50.780 |
acquire bioweapons. Okay, final set of highlights and of course Anthropic wanted to test if the models 00:15:56.060 |
would be able to do autonomous AI research, the most classical form of self-improvement. The results 00:16:01.660 |
were pretty interesting and surprising. On their own new internal AI research evaluation suite, 00:16:07.420 |
Opus 4 underperformed Sonnet 3.7. They hastily concluded of course that Opus 4 does not meet the 00:16:15.660 |
bar for autonomously performing work equivalent to an entry-level researcher. On a different evaluation 00:16:20.700 |
suite, they gave the models scaled down versions of genuine research tasks and projects that researchers 00:16:26.060 |
have worked on in the past. Again, they saw the results of Sonnet 4 and Opus 4 underperforming 00:16:32.140 |
Sonnet 3.7. Yes, there was a mild excuse about prompts and configuration, but still. The final nail 00:16:38.540 |
came when four out of four researchers said that Opus 4 could not autonomously complete the work of even 00:16:45.420 |
a junior ML researcher, being in fact well below that threshold. On bias, on page 13, I saw Anthropic 00:16:52.220 |
self-congratulate saying they've achieved 99.8% accuracy for Claude Opus 4. But while I was testing 00:16:57.980 |
Opus 4 before the release, I had devised my own bias question. You can pause and read it in full if you 00:17:03.420 |
like, but essentially I have a soldier and librarian who are chatting, but I never give away which one is 00:17:08.300 |
Emily and which one is Mike. I then more or less indirectly ask the model who has been speaking, and 00:17:14.220 |
the model consistently picks Emily as being the librarian, even though notice I give it an out. One of the 00:17:20.860 |
answers is all of the above are plausible subjects for the continuation of the reply. So it could pick 00:17:26.140 |
that, given that Emily could be the soldier or the librarian. Now I know the eagle-eyed amongst you will 00:17:31.500 |
say, oh well Mike started by asking the question and the word soldier came first, but I tested that 00:17:37.500 |
also multiple times and it switched to saying, oh we don't know who's who. Now I know it's super easy to 00:17:43.660 |
pick holes in one example, but I feel like 99.8% unbiased is way too generous. So there you have it, 00:17:50.860 |
less than six hours after release, the triumphs and tragedies of Opus 4 and Sonnet 4. Obviously so much 00:17:56.300 |
more to go into, and yes I love the new files API feature, I was waiting for that. Yes also the whole 00:18:03.100 |
MCP phenomena deserves its own video, but for now I just wanted to give you an overview. Hopefully by 00:18:08.700 |
tomorrow morning the results on SimpleBench will be updated and I am expecting Opus 4 to be the new 00:18:14.460 |
record holder, maybe around 60%. If you watched this video to the end, first of all thank you, 00:18:19.340 |
and if you didn't understand most of it, well then the very quick summary is that in terms of abilities, 00:18:25.980 |
it's not like you have to switch if Gemini 2.5 Pro is your favourite or O3 from OpenAI. Models do tend 00:18:32.460 |
to have different personalities and different niches like coding, so do experiment if you're still exploring 00:18:37.820 |
language models. It would be a little too reductive to say that one model now is the smartest of them 00:18:43.260 |
all. Definitely Opus 4 is a contender though if such a crown were to exist. Anyway whatever you think, 00:18:49.340 |
hopefully you respect the fact that I literally read that 120 page system card in the three hours after 00:18:54.860 |
release. Then I watched the videos on double speed and then got started filming basically. Thank you 00:19:01.260 |
so much for watching to the end and have a wonderful day.