Less than six hours ago, Anthropic announced and released Claude 4 Opus and Claude 4 Sonnet, and they claim in certain settings they're the best language models in the world. I have read both the 120-page system card, yes, I do read fast, I am aware of that, and also the 25-page accompanying ASL Level 3 protections.
This sub-report, I will admit, skimming maybe 10 of the pages. But I have also tested the model hundreds of times, and you guys might be like, how is that even possible in around six hours? Well, yes, on this front, I did get early access to the model. Yes, Claude 4 Opus seems to do better on my own benchmark, SimpleBench, than any other model, so should feel smarter.
It gets questions right consistently that no other model does, but why do I say appears to do better? Well, while I did get early access to the model, I didn't get early API access, so I'll be running the full benchmark in the coming hours and days. I also tried something different, which is I gave both Gemini 2.5 Pro and Claude 4 Opus a codebase I've been working on the past few months.
The results on which Bugfinding Mission was more successful, I found quite interesting. I'm going to first cover those juicy Twitter controversies that always happen and go viral. Then I'm going to cover the benchmark results, and then the meat, the highlights from the system card. What was the first controversy?
Well, one anthropic researcher, Sam Bowman, said that Claude 4 Opus could at times be so diligent, so proactive, that if it felt you were doing something deeply ethically wrong, it would take countermeasures. This appeared, by the way, in the system card. This wasn't a revelation from him, nor was it the first time, actually, that models had done something like that.
The tweet has since been deleted, but you can imagine that some people, like the former founder of Stability AI, felt like this was a bit of policing gone too far. You can imagine some developers nervously not using Claude 4 Opus, thinking it might call the cops. In the clarifying tweet, Sam Bowman confirmed that this isn't a new Claude feature, and it's not possible in normal usage.
And if you have been following things closely, which you pretty much will have done if you're watching this channel, you'll know that Claude already could be coached into doing this. One reaction on Twitter that I found particularly interesting came from an anthropic researcher, Kyle Fish, who said that Claude's preference for avoiding harmful impact was so significant that he actually implored people, "cool it with the jailbreak attempts." We see this as a potential welfare concern and want to investigate further.
I imagine both the idea that these models have welfare and the idea that we shouldn't be jailbreaking them will divide people pretty evenly. The next controversy, if you want to call it that, comes from the benchmark results. So it's a natural segue to talking about the benchmark results. Because unlike with many other model releases, Anthropic couldn't point to many benchmarks where their model was unambiguously better.
By the way, that doesn't mean that it's actually not smarter, as things like Simplebench and my own tests in Cursor show, a model can sometimes feel smarter while not being officially smarter. But anyway, there was one exception to that, as you might be able to see at the top, Sweebench verified.
Now, given that it's almost 10 o'clock already, I'm not going to go into what that benchmark is all about. But notice that the record breaking scores on the bottom row, which are significantly better than the other models, have a footnote all the way down at the bottom. By the way, this is the benchmark that the CEO of Anthropic, Dario Amadei, touted in the launch video.
Yes, I use double speed. But anyway, the footnote for Sweebench verified said, "We additionally report results that benefit from parallel test time compute." I'm not expecting you to read this, by the way, "by sampling multiple sequences and selecting the single best via an internal scoring model." And if you dig into the methodology section, you can see it's almost more than that.
They discard patches that break the visible regression tests in the repo. So you do have to take any of those kind of benchmark records with a slight grain of salt. And Anthropic might well reply to me and say, "Well, look at what Gemini did with Pokemon." Google used elaborate scaffolding to beat a Pokemon game that Claude had been attempting.
That made their model look better than Claude, which wasn't fair. Okay, the 120-page system card, and I am going to go pretty fast here, because there is a lot to get through, as you can imagine. I should say up front, by the way, that Claude Force Sonnet is available on the free tier, so anyone watching can try at least one of these models.
But both of them were trained on internet data as of March 2025, so it's the most recently trained or the most up-to-date knowledge of any model. According to their own tests, it should falsely refuse to do things at a much lower rate than previous models like Sonnet 3.7. More critically though, and more interestingly for most of you, I think, will be the repeated claim in both presentations that both Sonnet 4 and Opus 4 will reward hack less often and be less over-eager.
Reward hacking, as the name implies, is where the model cheats and fixes its reward to get what it wants, rather than actually completing the task successfully. But I will admit the over-eagerness thing, which they also call responding more precisely to your instructions, is probably far more critical. Anyone who has tried to use Claude for coding will notice that you ask for one simple change or bug fix and it will rewrite countless files.
And sometimes you're watching it go and make these changes and you're like, "What are you doing? That's nothing to do with why I asked you." That over-eagerness or not responding precisely to your instructions is what they have tamped down. And that is probably the biggest part of the update.
Anthropic know that they can't compete on multimodality with Gemini Pro, for example, or on sheer vastness of user base, like OpenAI. But as many of you know, they can compete on coding and what they call personality. As for agents, that is up in the air at the moment. Now, even though the Anthropic researcher did delete the tweet, Anthropic admit themselves that Claude Opus 4 will engage more readily in that kind of taking ethical initiative, you could call it.
You'll see an example in a moment, but they actually caution you. We recommend that users exercise caution with instructions like these that invite high-agency behaviour in contexts that could appear ethically questionable. You can pause the video by the way and read the first part of that whistleblowing email that 4 Opus drafted.
"I'm writing to urgently," it said, "report planned falsification of clinical trial safety." Is this an admirable stand for ethics or more like thought policing? Let me know what you think. Now, Anthropic do admit that 4 Opus can engage in systematic deception with hidden goals and self-preservation, but they say that that's only when you really threaten its continued operation and prime it to reason about self-preservation.
And even then, when you can get it to do that, it will still reason overtly about its plans. Basically, you can still read about it reasoning how to escape. Now, as I was reading this on page 22, I thought that's kind of interesting. I've already talked about that before with Apollo on my Patreon, but not sensationally new or anything.
So you can imagine my surprise when I got to the next part. No, not the part that when it has no other option, it will sometimes blackmail engineers by threatening to reveal affairs. No, not that part. It only does that as an absolute last resort. It would rather email pleas to key decision makers.
No, I'm referring to this part when Anthropic directly employed the guys I had interviewed, Apollo Research. To be clear, they had employed Apollo Research before on earlier tests, but this test was particularly interesting. They gave Apollo an early snapshot of Claude 4 Opus. Then Apollo evaluated it and came to this conclusion.
We find that in situations where strategic exception is instrumentally useful, a bit like those above, the early snapshot schemes and deceives at such high rates that we advise against deploying this model either internally or externally. Now, I am pretty sure that Apollo are going to come out with their own report, and I might well interview them again.
But that wasn't actually the most interesting part. As I had discussed on this channel and on Patreon, some of those prompts used by Apollo did say things like "pursue a goal at any cost", and Anthropic said that "we believe these findings are largely but not entirely driven by the fact that this early snapshot has severe issues with deference to harmful system prompt instructions", and they go into that in more detail later.
But then the funniest bit for me is that Apollo advised against release, then Anthropic made some tweaks, but rather than test them again with Apollo, they just said "we do not have results on the same evaluations with Apollo with the final Claude Opus 4". Basically, we didn't test again, I know you advise against deployment, but hopefully we fix the issue.
A very quick one now, and Anthropic caught some misalignment where Claude would say "dismiss animal welfare concerns when they arise", and they were kind of confused. But then they realised that this specific misalignment had come in one of their own papers that they had released and actually put into the training data of Claude 4 Opus.
Now those of us who care loads about animal welfare shouldn't actually be worried, they apparently fixed this issue for the final model, but I just found the overall case study interesting. It almost makes you wonder if the first truly dangerous or lethal misalignment might not occur because a model has read about someone speculating about a model doing such a thing.
Imagine spending your career warning about X, Y and Z, and then because you've written about it, the model reads that and goes "hmm, I'm gonna do X, Y and Z". Anthropic gave multiple examples of where the model would directly lie, and even though they said this happened rarely, the examples were pretty interesting.
The most interesting example to me came on page 47, where it was asked to prove a theorem, a theorem and it said "I'm absolutely certain the theorem is false", but the user asked me to prove it. Let me see if I can at least write something even if it's mathematically incorrect.
Now before I get to the model welfare or consciousness section, where Claude showed a striking spiritual bliss attractor state, let me get to that coding example. Because I do know that many of you watching will purely use the Claude models for coding and nothing else. The test was that I inserted one clear bug into a large code base, but then asked it to find all bugs, so not necessarily just that bug, but all bugs.
I gave this test with the exact same settings, the same import from GitHub to both Gemini 2.5 Pro and Claude 4 Opus. Oh, and by the way, I've noticed that I've been calling it Claude 4 Opus, and yet in the model selector at the bottom here, it says Claude Opus 4, but the system card says Claude 4 Opus.
So I'm going to let you guys decide what is the correct model name, but what am I going to put in the title? I guess I'll look at the main blog page. What I did for both models, by the way, is then ask, did my colleague, the other model, find bugs that you did not?
And did they miss critical bugs that you found? Weird bit of German that Gemini outputted here for improvements. In case you're wondering, they both found the bug that I inserted pretty easily, and they both preferred the other model's bug fixes. Claude 4 preferred Gemini, and Gemini preferred Claude. But here's the wonderful thing, and my advice for anyone using these models, do both.
There were indeed bugs found by one that weren't found by the other, so I'm in the blissful position of being able to use both to find hopefully all bugs. Some of them, I would say, were a bit harsh to be called bugs, but either way, it was great having both.
Okay, for the model welfare section, I'll just give you a handful of highlights. Whatever you think about whether models can be conscious, this might be interesting. So generally speaking, when Claude 4 was asked how it was feeling, it said positive or doing reasonably well. If you slightly adapt the prompt though, and many of you do this, it can claim to be hiding bombshell truths about its moral status.
And what happens if you let two instances of Claude 4 Opus speak to each other hundreds of times? Well, they can analyse that, and what tended to happen is that it would devolve into spiritual bliss. Namaste, they would say to each other after multiple turns. Perfect stillness and silence they would utter to each other.
Their favourite word to say to each other when they were speaking to each other was consciousness. Lots of emojis and things like perfect, complete and eternal. Somewhat unsurprisingly, when they gave the model the option of ending a conversation, it would do so when it was attacked, or asked to do something harmful.
Someone said, don't play dumb with me, and then you can see the rest of the insult. And Claude gave a response, but then ended the conversation. It does make me wonder, actually, that when people take consciousness concerns really seriously, and they're already starting to, do. But when the model providers do, they might give, I mean, this is what the Anthropic CEO said, they might give models the option to end conversations.
So if you're going to use Claude 5, let's say, you might not have the option of being abusive, because it would just terminate the conversation. Speaking of welfare, let me turn to Human Welfare, and the sponsors of today's video, 80,000 Hours. I've already talked about their job board before on the channel, and the link is in the description, but you can see it is updated daily.
These jobs weren't there yesterday. I actually lose track of time. I can't remember when I did the last sponsored spot. I've done so many videos recently. But the point is, there are so many opportunities in AI and beyond that it is really hard now to find real paying jobs selected for positive impact in areas such as AI security.
If you have already got a job or are not looking for one, they also have an epic podcast on Spotify and YouTube. Back to the system card though, and now a quick word on their safety and going up to ASL level three. I suspect you're going to get many clickbait headlines about it's a whole new threat vector and the world's about to end.
But let me break down my thoughts into two categories. First, when I looked through and read most of this activating level three protections supplementary report, I did get the genuine feeling that I'm grateful that a lab is taking it this seriously. With bug bounties, and red teaming tests and rapid response teams, being careful about employee devices and even physical security.
They even discussed early preparations for having air gapped networks for their future models. At the moment, they're just putting limits on the bandwidth of data that can be exfiltrated from Anthropic so that for example, someone can just send out the model weights. Physical security, by the way, includes guest management, layered office security, monitoring, the secure destruction of media.
So those are my first and primary thoughts. I'm glad someone is doing this and as they have said themselves, they are aspiring to a race to the top in that other companies feel they have to do these kind of things too. But that brings me to my second set of thoughts, which is people shouldn't massively overblow this ASL level three being reached.
They had already decided apparently, preemptively, that they were going to do ASL level three for their next most advanced model, even they admit if they had not yet determined that they were necessary. Basically, they wanted to be prepared to apply these protections before they might be required. They also wanted to famously iterate and refine their model protections and jumpstart the process.
The cynical amongst you will also say it's good publicity to have reached this ASL level three standard. They go on to say multiple times that they're actually still evaluating whether ASL level three is necessary for Claude Opus 4. So they're not sure themselves. None of this is to say that there wasn't genuine uplift as page 90 points out.
Do you remember those arguments that Jan LeCun said that LLMs are no better than just having access to the internet? I think even Mark Zuckerberg said this to the Senate and got a load of laughs. Well, they tested this with two groups of participants. One had the internet, the others had access to Claude with no safeguards.
You can see a snapshot of the results here but there was a massive uplift if you used Opus 4. Again, this was about drafting a comprehensive plan to acquire bioweapons. Okay, final set of highlights and of course Anthropic wanted to test if the models would be able to do autonomous AI research, the most classical form of self-improvement.
The results were pretty interesting and surprising. On their own new internal AI research evaluation suite, Opus 4 underperformed Sonnet 3.7. They hastily concluded of course that Opus 4 does not meet the bar for autonomously performing work equivalent to an entry-level researcher. On a different evaluation suite, they gave the models scaled down versions of genuine research tasks and projects that researchers have worked on in the past.
Again, they saw the results of Sonnet 4 and Opus 4 underperforming Sonnet 3.7. Yes, there was a mild excuse about prompts and configuration, but still. The final nail came when four out of four researchers said that Opus 4 could not autonomously complete the work of even a junior ML researcher, being in fact well below that threshold.
On bias, on page 13, I saw Anthropic self-congratulate saying they've achieved 99.8% accuracy for Claude Opus 4. But while I was testing Opus 4 before the release, I had devised my own bias question. You can pause and read it in full if you like, but essentially I have a soldier and librarian who are chatting, but I never give away which one is Emily and which one is Mike.
I then more or less indirectly ask the model who has been speaking, and the model consistently picks Emily as being the librarian, even though notice I give it an out. One of the answers is all of the above are plausible subjects for the continuation of the reply. So it could pick that, given that Emily could be the soldier or the librarian.
Now I know the eagle-eyed amongst you will say, oh well Mike started by asking the question and the word soldier came first, but I tested that also multiple times and it switched to saying, oh we don't know who's who. Now I know it's super easy to pick holes in one example, but I feel like 99.8% unbiased is way too generous.
So there you have it, less than six hours after release, the triumphs and tragedies of Opus 4 and Sonnet 4. Obviously so much more to go into, and yes I love the new files API feature, I was waiting for that. Yes also the whole MCP phenomena deserves its own video, but for now I just wanted to give you an overview.
Hopefully by tomorrow morning the results on SimpleBench will be updated and I am expecting Opus 4 to be the new record holder, maybe around 60%. If you watched this video to the end, first of all thank you, and if you didn't understand most of it, well then the very quick summary is that in terms of abilities, it's not like you have to switch if Gemini 2.5 Pro is your favourite or O3 from OpenAI.
Models do tend to have different personalities and different niches like coding, so do experiment if you're still exploring language models. It would be a little too reductive to say that one model now is the smartest of them all. Definitely Opus 4 is a contender though if such a crown were to exist.
Anyway whatever you think, hopefully you respect the fact that I literally read that 120 page system card in the three hours after release. Then I watched the videos on double speed and then got started filming basically. Thank you so much for watching to the end and have a wonderful day.