Claude 4: Full 120 Page Breakdown … Is it the Best New Model?

00:00:00.000 | Less than six hours ago, Anthropic announced and released Claude 4 Opus and Claude 4 Sonnet,

00:00:07.520 | and they claim in certain settings they're the best language models in the world. I have read

00:00:13.160 | both the 120-page system card, yes, I do read fast, I am aware of that, and also the 25-page

00:00:20.420 | accompanying ASL Level 3 protections. This sub-report, I will admit, skimming maybe 10 of

00:00:25.940 | the pages. But I have also tested the model hundreds of times, and you guys might be like,

00:00:30.340 | how is that even possible in around six hours? Well, yes, on this front, I did get early access

00:00:36.660 | to the model. Yes, Claude 4 Opus seems to do better on my own benchmark, SimpleBench, than any other

00:00:42.640 | model, so should feel smarter. It gets questions right consistently that no other model does,

00:00:48.700 | but why do I say appears to do better? Well, while I did get early access to the model, I didn't get

00:00:54.100 | early API access, so I'll be running the full benchmark in the coming hours and days. I also

00:00:59.640 | tried something different, which is I gave both Gemini 2.5 Pro and Claude 4 Opus a codebase I've

00:01:05.700 | been working on the past few months. The results on which Bugfinding Mission was more successful,

00:01:10.220 | I found quite interesting. I'm going to first cover those juicy Twitter controversies that always happen

00:01:16.000 | and go viral. Then I'm going to cover the benchmark results, and then the meat, the highlights from the

00:01:22.100 | system card. What was the first controversy? Well, one anthropic researcher, Sam Bowman, said that

00:01:27.940 | Claude 4 Opus could at times be so diligent, so proactive, that if it felt you were doing something

00:01:33.380 | deeply ethically wrong, it would take countermeasures. This appeared, by the way, in the system card. This

00:01:38.960 | wasn't a revelation from him, nor was it the first time, actually, that models had done something like

00:01:43.600 | that. The tweet has since been deleted, but you can imagine that some people, like the former founder of

00:01:49.260 | Stability AI, felt like this was a bit of policing gone too far. You can imagine some developers nervously

00:01:55.100 | not using Claude 4 Opus, thinking it might call the cops. In the clarifying tweet, Sam Bowman confirmed

00:02:00.540 | that this isn't a new Claude feature, and it's not possible in normal usage. And if you have been following

00:02:05.820 | things closely, which you pretty much will have done if you're watching this channel, you'll know that Claude already

00:02:11.100 | could be coached into doing this. One reaction on Twitter that I found particularly interesting came

00:02:16.940 | from an anthropic researcher, Kyle Fish, who said that Claude's preference for avoiding harmful impact was so

00:02:25.020 | significant that he actually implored people, "cool it with the jailbreak attempts." We see this as a

00:02:30.940 | potential welfare concern and want to investigate further. I imagine both the idea that these models

00:02:36.780 | have welfare and the idea that we shouldn't be jailbreaking them will divide people pretty evenly.

00:02:43.420 | The next controversy, if you want to call it that, comes from the benchmark results. So it's a natural

00:02:47.820 | segue to talking about the benchmark results. Because unlike with many other model releases,

00:02:52.940 | Anthropic couldn't point to many benchmarks where their model was unambiguously better. By the way,

00:02:59.180 | that doesn't mean that it's actually not smarter, as things like Simplebench and my own tests in Cursor

00:03:05.180 | show, a model can sometimes feel smarter while not being officially smarter. But anyway, there was one

00:03:10.860 | exception to that, as you might be able to see at the top, Sweebench verified. Now, given that it's almost

00:03:15.340 | 10 o'clock already, I'm not going to go into what that benchmark is all about. But notice that the record

00:03:20.380 | breaking scores on the bottom row, which are significantly better than the other models,

00:03:25.180 | have a footnote all the way down at the bottom. By the way, this is the benchmark that the CEO of

00:03:32.540 | Anthropic, Dario Amadei, touted in the launch video. Yes, I use double speed. But anyway, the footnote for

00:03:38.220 | Sweebench verified said, "We additionally report results that benefit from parallel test time compute." I'm not

00:03:43.340 | expecting you to read this, by the way, "by sampling multiple sequences and selecting the single best

00:03:48.540 | via an internal scoring model." And if you dig into the methodology section, you can see it's almost

00:03:54.940 | more than that. They discard patches that break the visible regression tests in the repo. So you do have

00:04:01.020 | to take any of those kind of benchmark records with a slight grain of salt. And Anthropic might well reply to

00:04:06.940 | me and say, "Well, look at what Gemini did with Pokemon." Google used elaborate scaffolding to beat a

00:04:13.180 | Pokemon game that Claude had been attempting. That made their model look better than Claude, which wasn't fair.

00:04:19.500 | Okay, the 120-page system card, and I am going to go pretty fast here, because there is a lot to get

00:04:24.380 | through, as you can imagine. I should say up front, by the way, that Claude Force Sonnet is available on the free tier,

00:04:29.820 | so anyone watching can try at least one of these models. But both of them were trained on internet

00:04:35.180 | data as of March 2025, so it's the most recently trained or the most up-to-date knowledge of any model.

00:04:42.300 | According to their own tests, it should falsely refuse to do things at a much lower rate than

00:04:48.220 | previous models like Sonnet 3.7. More critically though, and more interestingly for most of you,

00:04:53.180 | I think, will be the repeated claim in both presentations that both Sonnet 4 and Opus 4

00:05:00.380 | will reward hack less often and be less over-eager. Reward hacking, as the name implies, is where the

00:05:06.620 | model cheats and fixes its reward to get what it wants, rather than actually completing the task

00:05:11.340 | successfully. But I will admit the over-eagerness thing, which they also call responding more precisely

00:05:16.620 | to your instructions, is probably far more critical. Anyone who has tried to use Claude for coding

00:05:22.460 | will notice that you ask for one simple change or bug fix and it will rewrite countless files. And

00:05:29.260 | sometimes you're watching it go and make these changes and you're like, "What are you doing?

00:05:32.700 | That's nothing to do with why I asked you." That over-eagerness or not responding precisely to your

00:05:37.900 | instructions is what they have tamped down. And that is probably the biggest part of the update.

00:05:42.540 | Anthropic know that they can't compete on multimodality with Gemini Pro, for example,

00:05:47.580 | or on sheer vastness of user base, like OpenAI. But as many of you know,

00:05:52.380 | they can compete on coding and what they call personality. As for agents, that is up in the air

00:05:58.700 | at the moment. Now, even though the Anthropic researcher did delete the tweet, Anthropic admit

00:06:03.100 | themselves that Claude Opus 4 will engage more readily in that kind of taking ethical initiative,

00:06:09.580 | you could call it. You'll see an example in a moment, but they actually caution you. We recommend

00:06:14.380 | that users exercise caution with instructions like these that invite high-agency behaviour in

00:06:19.740 | contexts that could appear ethically questionable. You can pause the video by the way and read the first

00:06:25.100 | part of that whistleblowing email that 4 Opus drafted. "I'm writing to urgently," it said,

00:06:31.180 | "report planned falsification of clinical trial safety." Is this an admirable stand for ethics or

00:06:36.940 | more like thought policing? Let me know what you think. Now, Anthropic do admit that 4 Opus can engage

00:06:42.300 | in systematic deception with hidden goals and self-preservation, but they say that that's only

00:06:48.060 | when you really threaten its continued operation and prime it to reason about self-preservation.

00:06:54.220 | And even then, when you can get it to do that, it will still reason overtly about its plans.

00:06:59.740 | Basically, you can still read about it reasoning how to escape. Now, as I was reading this on page 22,

00:07:05.820 | I thought that's kind of interesting. I've already talked about that before with Apollo on my Patreon,

00:07:10.940 | but not sensationally new or anything. So you can imagine my surprise when I got to the next part.

00:07:16.620 | No, not the part that when it has no other option, it will sometimes blackmail engineers

00:07:22.620 | by threatening to reveal affairs. No, not that part. It only does that as an absolute last resort.

00:07:27.740 | It would rather email pleas to key decision makers. No, I'm referring to this part when Anthropic

00:07:33.900 | directly employed the guys I had interviewed, Apollo Research. To be clear, they had employed

00:07:39.100 | Apollo Research before on earlier tests, but this test was particularly interesting. They gave Apollo

00:07:45.020 | an early snapshot of Claude 4 Opus. Then Apollo evaluated it and came to this conclusion.

00:07:50.620 | We find that in situations where strategic exception is instrumentally useful, a bit like those above,

00:07:56.380 | the early snapshot schemes and deceives at such high rates that we advise against deploying this model

00:08:02.780 | either internally or externally. Now, I am pretty sure that Apollo are going to come out with their own

00:08:07.740 | report, and I might well interview them again. But that wasn't actually the most interesting part. As I had

00:08:13.340 | discussed on this channel and on Patreon, some of those prompts used by Apollo did say things like

00:08:17.980 | "pursue a goal at any cost", and Anthropic said that "we believe these findings are largely but not

00:08:23.500 | entirely driven by the fact that this early snapshot has severe issues with deference to harmful system

00:08:28.700 | prompt instructions", and they go into that in more detail later. But then the funniest bit for me is that

00:08:33.420 | Apollo advised against release, then Anthropic made some tweaks, but rather than test them again with Apollo,

00:08:39.900 | they just said "we do not have results on the same evaluations with Apollo with the final Claude Opus 4".

00:08:45.500 | Basically, we didn't test again, I know you advise against deployment, but hopefully we fix the issue.

00:08:50.700 | A very quick one now, and Anthropic caught some misalignment where Claude would say "dismiss animal welfare

00:08:56.380 | concerns when they arise", and they were kind of confused. But then they realised that this specific misalignment had come in

00:09:02.060 | one of their own papers that they had released and actually put into the training data of Claude 4 Opus.

00:09:07.740 | Now those of us who care loads about animal welfare shouldn't actually be worried, they apparently fixed

00:09:12.060 | this issue for the final model, but I just found the overall case study interesting. It almost makes you wonder

00:09:18.700 | if the first truly dangerous or lethal misalignment might not occur because a model has read about someone

00:09:25.580 | speculating about a model doing such a thing. Imagine spending your career warning about X, Y and Z,

00:09:31.260 | and then because you've written about it, the model reads that and goes "hmm, I'm gonna do X, Y and Z".

00:09:35.420 | Anthropic gave multiple examples of where the model would directly lie, and even though they said this happened rarely,

00:09:41.260 | the examples were pretty interesting. The most interesting example to me came on page 47, where it was asked to prove a theorem,

00:09:47.260 | a theorem and it said "I'm absolutely certain the theorem is false", but the user asked me to prove it. Let me see if

00:09:52.940 | I can at least write something even if it's mathematically incorrect. Now before I get to the model welfare or consciousness section,

00:09:59.820 | where Claude showed a striking spiritual bliss attractor state, let me get to that coding example. Because I do know that

00:10:06.940 | many of you watching will purely use the Claude models for coding and nothing else.

00:10:12.620 | The test was that I inserted one clear bug into a large code base, but then asked it to find all bugs,

00:10:18.700 | so not necessarily just that bug, but all bugs. I gave this test with the exact same settings,

00:10:23.180 | the same import from GitHub to both Gemini 2.5 Pro and Claude 4 Opus. Oh, and by the way,

00:10:28.780 | I've noticed that I've been calling it Claude 4 Opus, and yet in the model selector at the bottom here,

00:10:33.420 | it says Claude Opus 4, but the system card says Claude 4 Opus. So I'm going to let you guys decide

00:10:38.700 | what is the correct model name, but what am I going to put in the title? I guess I'll look at the main

00:10:43.340 | blog page. What I did for both models, by the way, is then ask, did my colleague, the other model,

00:10:49.420 | find bugs that you did not? And did they miss critical bugs that you found? Weird bit of German

00:10:54.220 | that Gemini outputted here for improvements. In case you're wondering, they both found the bug that I

00:10:59.820 | inserted pretty easily, and they both preferred the other model's bug fixes. Claude 4 preferred

00:11:07.020 | Gemini, and Gemini preferred Claude. But here's the wonderful thing, and my advice for anyone using

00:11:11.820 | these models, do both. There were indeed bugs found by one that weren't found by the other,

00:11:17.100 | so I'm in the blissful position of being able to use both to find hopefully all bugs. Some of them,

00:11:22.620 | I would say, were a bit harsh to be called bugs, but either way, it was great having both. Okay,

00:11:27.100 | for the model welfare section, I'll just give you a handful of highlights. Whatever you think about

00:11:32.380 | whether models can be conscious, this might be interesting. So generally speaking, when Claude 4

00:11:38.060 | was asked how it was feeling, it said positive or doing reasonably well. If you slightly adapt the

00:11:43.340 | prompt though, and many of you do this, it can claim to be hiding bombshell truths about its moral status.

00:11:49.420 | And what happens if you let two instances of Claude 4 Opus speak to each other hundreds of times? Well,

00:11:54.380 | they can analyse that, and what tended to happen is that it would devolve into spiritual bliss. Namaste,

00:12:01.180 | they would say to each other after multiple turns. Perfect stillness and silence they would utter to

00:12:06.220 | each other. Their favourite word to say to each other when they were speaking to each other was

00:12:10.700 | consciousness. Lots of emojis and things like perfect, complete and eternal. Somewhat unsurprisingly,

00:12:16.300 | when they gave the model the option of ending a conversation, it would do so when it was attacked,

00:12:22.700 | or asked to do something harmful. Someone said, don't play dumb with me, and then you can see the rest of

00:12:26.940 | the insult. And Claude gave a response, but then ended the conversation. It does make me wonder,

00:12:31.900 | actually, that when people take consciousness concerns really seriously, and they're already starting to,

00:12:37.740 | do. But when the model providers do, they might give, I mean, this is what the Anthropic CEO said,

00:12:42.700 | they might give models the option to end conversations. So if you're going to use Claude 5,

00:12:47.900 | let's say, you might not have the option of being abusive, because it would just terminate the conversation.

00:12:52.700 | Speaking of welfare, let me turn to Human Welfare, and the sponsors of today's video,

00:12:57.740 | 80,000 Hours. I've already talked about their job board before on the channel, and the link is in the

00:13:02.060 | description, but you can see it is updated daily. These jobs weren't there yesterday. I actually lose track

00:13:08.220 | of time. I can't remember when I did the last sponsored spot. I've done so many videos recently.

00:13:12.300 | But the point is, there are so many opportunities in AI and beyond that it is really hard now to find

00:13:17.900 | real paying jobs selected for positive impact in areas such as AI security. If you have already got

00:13:24.140 | a job or are not looking for one, they also have an epic podcast on Spotify and YouTube. Back to the

00:13:29.580 | system card though, and now a quick word on their safety and going up to ASL level three. I suspect

00:13:36.140 | you're going to get many clickbait headlines about it's a whole new threat vector and the world's

00:13:41.740 | about to end. But let me break down my thoughts into two categories. First, when I looked through

00:13:47.180 | and read most of this activating level three protections supplementary report, I did get the

00:13:53.900 | genuine feeling that I'm grateful that a lab is taking it this seriously. With bug bounties,

00:13:59.500 | and red teaming tests and rapid response teams, being careful about employee devices and even

00:14:05.340 | physical security. They even discussed early preparations for having air gapped networks for

00:14:11.500 | their future models. At the moment, they're just putting limits on the bandwidth of data that can

00:14:15.900 | be exfiltrated from Anthropic so that for example, someone can just send out the model weights.

00:14:20.860 | Physical security, by the way, includes guest management, layered office security, monitoring,

00:14:26.300 | the secure destruction of media. So those are my first and primary thoughts. I'm glad someone is

00:14:31.340 | doing this and as they have said themselves, they are aspiring to a race to the top in that other

00:14:37.100 | companies feel they have to do these kind of things too. But that brings me to my second set of thoughts,

00:14:42.220 | which is people shouldn't massively overblow this ASL level three being reached. They had already

00:14:48.060 | decided apparently, preemptively, that they were going to do ASL level three for their next most

00:14:52.860 | advanced model, even they admit if they had not yet determined that they were necessary. Basically,

00:14:57.580 | they wanted to be prepared to apply these protections before they might be required. They

00:15:02.620 | also wanted to famously iterate and refine their model protections and jumpstart the process. The

00:15:08.460 | cynical amongst you will also say it's good publicity to have reached this ASL level three standard. They go on

00:15:14.460 | to say multiple times that they're actually still evaluating whether ASL level three is necessary for

00:15:19.900 | Claude Opus 4. So they're not sure themselves. None of this is to say that there wasn't genuine

00:15:24.460 | uplift as page 90 points out. Do you remember those arguments that Jan LeCun said that LLMs are no

00:15:29.660 | better than just having access to the internet? I think even Mark Zuckerberg said this to the Senate and

00:15:34.060 | got a load of laughs. Well, they tested this with two groups of participants. One had the internet,

00:15:38.940 | the others had access to Claude with no safeguards. You can see a snapshot of the results here but there

00:15:44.540 | was a massive uplift if you used Opus 4. Again, this was about drafting a comprehensive plan to

00:15:50.780 | acquire bioweapons. Okay, final set of highlights and of course Anthropic wanted to test if the models

00:15:56.060 | would be able to do autonomous AI research, the most classical form of self-improvement. The results

00:16:01.660 | were pretty interesting and surprising. On their own new internal AI research evaluation suite,

00:16:07.420 | Opus 4 underperformed Sonnet 3.7. They hastily concluded of course that Opus 4 does not meet the

00:16:15.660 | bar for autonomously performing work equivalent to an entry-level researcher. On a different evaluation

00:16:20.700 | suite, they gave the models scaled down versions of genuine research tasks and projects that researchers

00:16:26.060 | have worked on in the past. Again, they saw the results of Sonnet 4 and Opus 4 underperforming

00:16:32.140 | Sonnet 3.7. Yes, there was a mild excuse about prompts and configuration, but still. The final nail

00:16:38.540 | came when four out of four researchers said that Opus 4 could not autonomously complete the work of even

00:16:45.420 | a junior ML researcher, being in fact well below that threshold. On bias, on page 13, I saw Anthropic

00:16:52.220 | self-congratulate saying they've achieved 99.8% accuracy for Claude Opus 4. But while I was testing

00:16:57.980 | Opus 4 before the release, I had devised my own bias question. You can pause and read it in full if you

00:17:03.420 | like, but essentially I have a soldier and librarian who are chatting, but I never give away which one is

00:17:08.300 | Emily and which one is Mike. I then more or less indirectly ask the model who has been speaking, and

00:17:14.220 | the model consistently picks Emily as being the librarian, even though notice I give it an out. One of the

00:17:20.860 | answers is all of the above are plausible subjects for the continuation of the reply. So it could pick

00:17:26.140 | that, given that Emily could be the soldier or the librarian. Now I know the eagle-eyed amongst you will

00:17:31.500 | say, oh well Mike started by asking the question and the word soldier came first, but I tested that

00:17:37.500 | also multiple times and it switched to saying, oh we don't know who's who. Now I know it's super easy to

00:17:43.660 | pick holes in one example, but I feel like 99.8% unbiased is way too generous. So there you have it,

00:17:50.860 | less than six hours after release, the triumphs and tragedies of Opus 4 and Sonnet 4. Obviously so much

00:17:56.300 | more to go into, and yes I love the new files API feature, I was waiting for that. Yes also the whole

00:18:03.100 | MCP phenomena deserves its own video, but for now I just wanted to give you an overview. Hopefully by

00:18:08.700 | tomorrow morning the results on SimpleBench will be updated and I am expecting Opus 4 to be the new

00:18:14.460 | record holder, maybe around 60%. If you watched this video to the end, first of all thank you,

00:18:19.340 | and if you didn't understand most of it, well then the very quick summary is that in terms of abilities,

00:18:25.980 | it's not like you have to switch if Gemini 2.5 Pro is your favourite or O3 from OpenAI. Models do tend

00:18:32.460 | to have different personalities and different niches like coding, so do experiment if you're still exploring

00:18:37.820 | language models. It would be a little too reductive to say that one model now is the smartest of them

00:18:43.260 | all. Definitely Opus 4 is a contender though if such a crown were to exist. Anyway whatever you think,

00:18:49.340 | hopefully you respect the fact that I literally read that 120 page system card in the three hours after

00:18:54.860 | release. Then I watched the videos on double speed and then got started filming basically. Thank you

00:19:01.260 | so much for watching to the end and have a wonderful day.

Claude 4: Full 120 Page Breakdown … Is it the Best New Model?

Chapters