Claude 4 + Self Adapting Language Models

Okay. Okay. Oh, cool. So first paper with a background since now we're recording, SEAL came out this week, earlier this week about self-adapting language models. Basically, we always hear this promise of, you know, once you do a train run, your model is stale, right? We used to care a lot more about this in the past because, you know, like the original ChetGPT had like, okay, it was trained on information up to like October 2023, and then it doesn't know anything in the future.

So people were always hoping like, you know, can we have real-time learning? This is separate from like in-context learning, but as you use the model, can you actually update the weights and make the thing better? And that's always kind of been this hope. And then it kind of died out because models started to do tool use and can now browse the web, and it's not really as much of a concern.

I don't know about you guys, but I don't look at when models are trained anymore, but they do still show it. Like the bigger we do pre-trained runs, like GPT 4.5 and stuff, they do have a knowledge cutoff. And then from there, that's kind of used in the test time training.

But I don't think anyone could just tell me where, like what the training cutoffs for these models are anymore. No one really cares anymore. But anyway, there's this concept of continual learning and always adjusting your parameters and learning more as you do stuff. Okay. I'll be monitoring chat. This is a quick paper.

So, you know, let me know if anything is interesting and we'll dive deeper. Otherwise, I'm just going to go through it pretty quick. For people that are joining, these are recorded, but on YouTube, you can find them on Latentspace TV. Security-wise nightmare, true. It is a bit of a security-wise nightmare, but it's actually not that straightforward.

The interesting thing with this paper that I did like is they do do a lot of citations. So they basically share their approach for everything, and then they go over a bunch of previous work. And, you know, it's nice if you want to dig back into what are other approaches for this.

Like some approaches are if you use alternate models. So like you have state-space models. Can you continually update that state module and kind of have this new context in there? Then there's like in-context learning and how that approaches stuff. And then they kind of show a couple experiments of what they do.

What they basically do is for samples where the model is struggling, they do these self-edits. So they kind of generate synthetic data. Then they do RL on a model to kind of train in these like new terms with Allura. And then if it has like, if it actually improves the performance, then they train the base model that they were doing with SFT.

And they're like, okay, it works. And I was like, oh, it's pretty cool. They get it to work. Performance goes good. And I read through this paper. I'm like, oh, pretty interesting. Pretty interesting. Then I get to some of the last section, basically the limitations. They have this catastrophic forgetting where basically, you know, as you do these updates sequentially.

So let's say like there's a benchmark, you try a few questions and the model is not doing well. Okay. You start to train in examples on these questions and then the model starts to better like understand this and then it starts to perform better. But once we, once we go from there, meeting is being recorded.

It's in the middle of the screen. Your meeting being recorded window is in the middle of the screen. Is my screen recording working properly? It actually looks fine to me. Okay. Okay. Well, good catch. It seems like it's working. But basically TLDR. So as you have like a few questions, let's say you have like RKGI style questions or like medical questions, for example, after you do a few edits and it starts doing better, once you do like the fifth, sixth, seventh, eighth edit, since you're continually doing training, it starts to forget the first ones.

So after reading this entire paper, you know, after you do sequential updates, can the model adapt repeatedly and preserve prior knowledge? Guess what? It can't. Performance on earlier tasks gradually declines as the number of edits increasing, suggesting that CL is still susceptible to catastrophic forgetting. So, you know, what's the point here?

Like as you do this self-iteration, it starts to forget previous iterations. And like, yeah, that, that kind of killed it for me, but you know, the paper was still decent reading through it. So let's, let's still go through it. So seal is kind of a framework that enables LLMs to self-adapt by generating their own fine tuning data and update derivatives.

So I guess that's the other little difference, right? It uses itself. So it uses the model that it's using to generate synthetic data on questions. And then it judges whether the, you know, data that it generated helped improve performance. If it did, then it does kind of SFT on this data.

So it's like all recursive on its own model. Now, the other limitation of this is you need to know the downstream task performance. So like, you can't do this on open ended stuff. I kind of bring this up in the limitations here as well. Basically, yeah, you, you need to know what the, where is, where is it?

Context dependent evaluation. We assume that every context is paired with explicit downstream tasks. So, you know, you, you need labeled input outputs. You need to be able to verify whether your output is correct. And then like a potential solution to this is let the model not only generate the edits, but also generate its own evaluation of questions.

I think it starts to get very meta there. Once you not only have generated outputs of generated data and generated examples, but yeah, a limitation is it has to do it. It gets kind of meta and hard because their example is training this on like LAMA 1B, LAMA 3.2 1B for ARCTASK.

And N7B. And genuinely, I don't know if like, let's say you're doing something like open ended ARCTASK without verifiable rewards, like not verifiable rewards without really, like really without an answer. Right. So if you have a puzzle, you don't know the answer, you generate sample puzzle style examples. You know, the model can't really verify this output that well.

But anyway, that's just like, yeah, I guess that's a limitation of a small model. You could do it at bigger scale. Okay. So, basically given an input, the model produces self-edit generations for synthetic data. So, you know, rephrase the question. What are questions deviated from this passage? And then they test it on the same question without context and see if it performs well.

Then through SFT, these self-edits are, you know, constantly changing the actual weights. They use an RL loop using downstream performance as a reward signal. And yeah, it's kind of cool. It updates parameters. Okay. So here's the key question in this paper. We want to explore an intriguing hypothesis. Can an LLM self-adapt by transforming or generating its own training data into a learning and learning procedure?

The answer, kind of, but you know, it still falls short after a few changes. So given a new task with current LLMs, we consume and learn from data as is via fine tuning or in context learning. However, data may not be in an optimal output format or learning. So they kind of have this like outer loop, inner loop of training.

Basically, here's how it is. So in each RL outer loop iteration, the model generates candidate self-edits derivatives of how we update the weights, applies updates and evaluates performance on the downstream tasks, and then uses the reward to self-improve its gradients. Basically, letting them to do this is good. RL generates self.

So this is called seal. This is the RL model that kind of does this. I think we can skip some of the specifics on the RL, but they can't use a off policy model. So they use like rejection sampling and SFT. We'll kind of go into that in a bit.

Okay. They evaluate it on two applications. So one is can the model learn new factual knowledge? So this is basic benchmarks. Can we basically, given a task, strain in the relevant information and then retest on that same question without the in-context learning, without giving it information, can we improve?

And they find out yes, they kind of can. The other one is, here's kind of how they do it. So we fine-tune synthetic data by the seal model. Our results show that following RL training and fine-tuning on self-generated synthetic data improves QA performance on the no-passage-in-context variant of SQUAD from 33.5 to 47%.

So, you know, it kind of works. And it does outperform synthetic data gen from GPT 4.1 on the same data. So kind of cool. Maybe, okay. For the RL loop, why did they use binary loss? Whether the downstream accuracy improved or not? Couldn't they have made the reward signal something like how much better is the model performed on the task?

They go into this in a bit. But yeah, it's interesting. They do binary loss. Okay. They also evaluate it on a few-shot learning, a sample subset of ARC benchmark. And yeah, it does better. And they specifically choose LAMA 3.21B because that's a model that wasn't trained on ARC. The model came out before the ARC benchmark.

So it hasn't seen that data at all. Okay. Synthetic data gen, seal build them. So this is where they kind of... Section 2 is pretty good. You know, if you're interested in this sort of like constant updating training, these are pretty good papers that they reference on previous approaches.

So one is on synthetic data. It's pretty common for pre-training tasks. Here's a few papers. Instruction tuning, task-specific data augmentation. And then here's just a bunch of papers that you can go through. So like 15 to 22 are all related synthetic data gen for constant fine tuning. Seal builds on this type of graph-based prompting by using RL to train a generative policy that maximizes downstream utility of synthetic data when applied for gradient-based self-updates rather than relying on static or heuristic generation strategies that are manually tuned.

So it kind of uses this RL to get a signal of how well is this... How well is this output? And then they maximize... Actually, we'll go on a bit. Okay. Knowledge updating. This is pretty straightforward. So a bunch of papers on knowledge updating. Some try to do like specific parameters to improve facts.

Other do fine tuning with information of in-context. We work on the ladder. Okay. Here's how SEAL framework works on QA. Test time training. This is a whole section on what's been done there. Okay. RL. So SEAL applies RL not only to optimize the final outputs or trace revisions, but to output the generation of the self-edit data.

So they're doing RL to see whether the generation... Whether the synthetic data that they generate has a performance update... Performance increase or not. Okay. More background on meta-learning. There's an adaptation strategy of how to do these effective self-edits. The goal is how to learn effectively from task context. Da-da-da-da.

More background. Self-improvement. More background. RL-A-I-F. In contrast, they view self-improvement through interaction with external data as more powerful, scalable paths. SEAL learns how to best utilize external data for self-improvement. Okay. Am I taking... Am I understanding this correctly? If they take context evaluation pairs as input, SEAL model improves these.

The synthetic data that may SFTI improve pairs. Kind of. Basically what happens here. Okay. So methods. Here is SEAL again. Model is trained to produce self-edits directly through token generation with the data provided in the model's context. So you give it a question in context. It produces kind of this self-edit of generation of...

Okay. Here's like two way examples on this. Then self-edit generation is learned by RL. So the SEAL model is actually that self-edit generation model where the model is rewarded for generating edits that increase the performance. So basically, you know, you have this model that augments the in-context learning data.

Then you do a little LoRa train on it. If performance goes up on the task, then that's good. That's preferred. And that's the reward signal. SEAL can therefore be interpreted as an algorithm with two nested loops. Outer loop, which optimizes the self-edit generation. This is that sort of RL step.

And the inner update loop, which uses that self-edit generation to update the actual weights with SFT. So that's kind of what CL is. Here's the framework. Okay. Bunch of fun RL stuff. I don't think we need to go super into detail, but it's actually very, very easy, readable, approachable.

Easy thing to do. Take section 3.1, throw it in ChatGPT and have it break down these terms for you or just read it yourself. It's not that hard, but basically it's what the optimization is, right? So that self-edit generation progress, uh, process with RL, uh, they're letting the model take its action, its own self-edit.

The reward is based on the performance of, um, the reward is based on the performance of the model without the context. So fancy terminology, but actually very, very straightforward. Um, basically reward is assigned to give an action in our settings depends on the parameters that the time is taken.

So basically, you know, without the context after training does, does performance go up. Um, they try various on policy methods such as GRPO, PPO found that they're unstable. They use this rest rejection sampling plus SFT framework. Uh, basically rejection, uh, rest can be viewed as an expectation maximization procedure, where the expectation step is sample candidates outputs from the current model policy.

Then the end step is kind of, uh, reinforce those that receive positive reward. So, you know, that's, that's kind of their, their thing, more fancy stuff here. Okay. Um, domain instantiations, they evaluate in two domains. So see shot learning. Okay. Can we integrate new information in model weights? So evaluated using no constant variation of squad, basically, um, you know, given in context, can we generate data train and then evaluate without context?

How does this thing perform? Uh, and then the ability to generalize two tasks after only seeing a small number of examples, this is on arc. So basically this is the Lama example, uh, model was never trained on arc. Can we do some of this training and give it some examples and then kind of generalize to solving this?

Yes, it can. The other one, if we take away the in context and do a no context variation, does it actually learn to use what it's been trained? Uh, yes, it kind of does. Uh, so first one is knowledge incorporation. That's, that's what we just talked about. Can it effectively incorporate information provided in a passage?

Get rid of this. Um, basically prompting the model to list several implications derived from the content. The output may include inferences, logical consequences, restatement of the passage. Uh, so the prompts are basically stuff like, you know, rewrite the passage in different ways, rewrite it in a QA format. And then after that, you know, see how it does.

So here's kind of the Apollo program before, after, um, but even after natural NASA reached internal agreement, it was far from smooth sailing self-edit. The Apollo program faced opposition from this, uh, evaluation, you know, there's the question without the context and then can it get the answer right? So given a passage, the model generates synthetic data, which is the self-edit, in the form of implications of the passage.

Then fine tune on these outputs using a LoRa. The model after the training is evaluated on the passage without access to the original text. So this is kind of their little framework. Uh, the resulting accuracy is the reward for the RL. Uh, if the accuracy goes up, if they're like self-edit LoRa fine tune was successful, that's a good reward.

We should use this. If it goes down, that's not good reward. Um, then the self-generated statements are, uh, used as training data for SFT. Then they kind of, you know, since there's very little data per sample, they, they basically do LoRa's. This is where I was like, oh, it's kind of interesting how they start to work, but, and it does kind of generalize.

Right. But then, um, you know, you're directly training it on examples of these tasks and eventually it, it starts to lose previous steps, which is unfortunate. Okay. Um, few shots set up. Here's kind of more examples. So this is the other one. This is kind of, the arc one, few shot learning.

So can we give it a few examples of how to solve our questions and can it generalize? Um, yes, we can. We define a set of tools. So now, you know, you have data augmentation, you have optimization, you can, you can generate data. It does it, it works. Results, few shot learning.

Uh, so for the arc one, we carried a subset of the 11 tasks from the eight evaluation, very few tasks. The model is trained only on 15 self edits per task out of these like 11 tasks from the eight evaluation set, uh, apply to generate 15 updated models, reinforce the ones that corrupt, that correctly adapt models.

So some of these don't work. Some of them do. We evaluate it by generating five self edits per held out evaluation, apply each one individually. Yeah, it works. Um, kind of, kind of, kind of, it's not good. It's not as good as other approaches. Um, so it substantially improves adaptation success rate compared to baseline.

So baseline, 20% adaptation, 72%. So something's happening, uh, compared to other approaches. So train on passage only, um, much better than that. Train on passage plus synthetic data versus GPT. Oh, sorry. Sorry. On this, uh, this Oracle TTT performed a lot better. Oracle TTT is kind of, um, the model performs test time training using optimal human crafted configuration.

So that's the best, basically human give examples of how to solve it. What's the best approach that gives you a hundred percent performance, uh, in context, learning given few shot examples is worse than their edit. Um, and then there's another one TTT plus self edit. So, uh, basically using few shot examples and doing this sort of like chain of thought style reasoning did a little bit, but not as good as seal, seal work there.

Okay. The other one is knowledge incorporation. This is stuff where we don't have any, um, understanding of the previous example. So this, uh, this example was, wait, sorry. Uh, sorry. That was the one where we didn't have any example where this was Lama. It hadn't been trained on arc and now it can start to do it.

This is basically a, uh, can we get rid of this in context question, right? So do two rounds of rest with batches of 50 context. Um, basically, you know, generate synthetic data based on the placed on the passage, then we test it on how well it performs without it.

So yes, it works. Accuracy goes up. It does even better generating its own synthetic data on QN7B than it does with GPT 4.1 on this style, since it has this sort of RL step. Um, yeah, that's, that's kind of the, it's kind of the paper. There's not much more.

We went over limitations. Um, the rough thing is as you do this live editing, uh, you know, it really tailors off. You have to do this per sample, per example, after you do it like seven or eight times, it kind of forgets how the old ones perform. Uh, Laura's typically don't affect general performance as much, but the other performance, the other issue with Laura's is, you know, they're not super effective, right?

They don't have the biggest performance. So there's that. Um, computation overhead. This is very expensive. You can't just like do this at scale, right? It's very, very slow. So the reward is more computationally expensive than any other RL training loop. Uh, somewhere in here, they mentioned like each self edit takes about 30 to 45 seconds.

Uh, you know, there's a lot of overhead in that. And it's using like two H100s for all this, since you now need like multiple instances of the model. Uh, RL is also not, not efficient in this, but yeah, interesting. Um, basically they're like, you know, here's, here's our thing.

Once web scale data is exhausted, progress will hinge. We need models to generate their own high quality training signal. I think that's fine, but you don't have to do it live, right? We can do synthetic data gen and offline RL. This doesn't have to be RL. But anyway, it's a synthetic data generation model that they, that they do that they use live.

Um, yeah. We can imagine a future in which LLMs can ingest new data such as academic papers and generate large quantities of explanations, implementations for themselves, using their existing knowledge and reasoning with the in context data. The interactive loop of self expression and self refinement could allow these models to keep improving on rare unprecedented topics, even in the absence of additional external supervision.

Very cool. Cute future, um, would be great if it works, but yeah, that's kind of, that's kind of it. Um, that's quick paper thoughts, questions, comments before we move on to the better paper. That's crazy. We were crazy. They went so fast. I want, I want time for a clone.

Okay. Not much, not much in chat. So I guess we had some discussion there. Yeah. Six findings on the chat. Oh, shoot. Uh, I need to share. Okay. Now I can share because this is a new zoom installation. So I just want to make sure. Can you guys see it?

Yeah, we see it. That works. Oh, okay. I see your Slack. Oh, shoot. Can you see the cloud system card? Yeah. Nice. Yes. Okay. So cloud system card came out in May, 2025. I think there are a lot of interesting things here. Um, and you know, everyone's saying right now that pre-training is less important.

I actually don't think so. Um, and a lot of techniques, very few, there are very few techniques mentioned here, but when they do mention techniques, it's almost always related to pre-training. So let's go into it. Um, I have gone through this. I've gone through only up to 80 pages of it, which is the end of reward hacking, which I think is very interesting.

Um, and then over at each page, I have keys to the keys for when it's interesting. So I'm gonna try to take you through that. The first thing that's really interesting, new cutoff date. Um, I wonder how this was done. Did they pre-train completely from scratch or did they just do continue pre-training?

We never know. I mean, if someone knows, you know, just DM me, I would love to learn how, how that's done. Um, the second thing that, uh, and Trophic has done is that when they do chain of thought, uh, they say that they opt to summarize lengthier top processes, right?

But this happens only about 5% of the time. Uh, so what this means, and you can also still get the full top processes with no summarization. So what this means is that unlike, uh, what OpenAI has done, where OpenAI doesn't give you the full top process and Trophic is willing to give you the top process as well.

Uh, I don't know if OpenAI has changed their approach, which is not showing you the chain of thought. Um, if anyone has any update on that, also please do let me know. Um, so, so this, this, these are some interesting departures from what OpenAI is doing. Uh, and then the release, okay.

There's a lot of standard releases. They go through AI safety standard. The one interesting, and of course, uh, CBRN. So these are their standard, uh, safety chemical, biological, radiological, and nuclear. We won't actually go through these. Um, but it's safe to say that this is how they think about it.

So what is interesting here is that Claude Sonnet is still ASL2, which is the same as Claude Sonnet 3.5. ASL2 is the, not as unsafe, it's a safer model, but they have decided to say that Opus is under ASL3. And I have, I have a quick definition of ASL3 is this.

Our ASL3 threat model is model scaling and prioritization. Can it cause an economic catastrophe by unsophisticated, uh, by businesses or hacker groups, right? And help them attack poorly assisted, hardened targets. Now ASL4. Oh, shoot. I wish that ASL3 discussion was, was in there. But anyway, ASL4 is this model is not strong enough to do multi-step operations that allow low resource nations to operate as top tier nations.

Um, I won't mention any conditions, but this is how they think about a threat model. So, uh, Sonnet is not even ASL3. Um, uh, but Opus is ASL3. So a little bit more background here, actually, they changed the definition of ASL3 like two weeks before the release. They, they, they changed the scope.

They cut something out. I will try to find what it was, but basically, uh, they, they got a little bit of talk about this, but right before they released these in like, uh, I can find the post on May 22nd, they changed what ASL3 is. So did they make it lighter?

They removed restrictions to call this ASL3, which is why people, it wasn't a major change, but, um, you know, slight changes. Yep. So this was, so that's a little bit about the safety. Uh, now let's go back. You can see this as huge as paper. Um, Safeguard's results. Uh, I don't think they went through very much of this.

Uh, but long story short, uh, you can see that Opus, uh, they, they don't, it's fairly safe. Uh, overall harmfulness rate is less than what is about slightly, uh, maybe around 1%. Well, that's to say that it still has happened, you know, given about a hundred thousand queries, a thousand of them will be maybe harmful.

Uh, so, you know, it really depends on the scale that you, that you're running these models on. Um, and then they have singleton evaluations and you can see over here, this is, uh, these are, what is really interesting to me is false refusals, which is it refuses even though it's safe.

So you can see over here for Sonnet and, uh, Sonnet 3.5 and 4, the false refusal rate is about 0.5%. Again, with a thousand queries, uh, 50 of these will be false refusals. And sometimes these false refusals will be very judgy. You'll say that, oh, you know, I refuse to do this because it's like left leaning or right leaning or like against global economy or like some, uh, doesn't respect human rights, et cetera.

I think it's okay if you just say, I refuse to do this, but when the judginess comes in, that's when it becomes a bit of a painful, um, painful PR issue. So then they talk about multi-term testing. Um, you know, but what they found is that, you know, with extended thinking, the models are safer.

Um, so that reasoning baked in is good. Uh, and also political bias. Um, they actually say that there's no political bias. Um, but I've come across papers, uh, that show that, uh, the models have been increasingly becoming more right-leaning. In the past, it was very left-leaning, uh, very liberal, but then increasingly it's becoming more central, um, for whatever reason.

I don't know, maybe it's just training data, natural training data on the internet, or maybe it's just some reinforcement learning and fine tuning. discriminatory bias. Okay. No discrimination. Well, minimal, discriminatory, discriminatory bias to the same level as Claude Sonnet. Um, so again, they have all these numbers here and they, they also ran the strong reject benchmark for jailbreaking.

In this case, they get a Sonnet 3.5, a special model without safety training to try to test on jailbreak. So, um, the jailbreak is very interesting that Opus is more susceptible than Sonnet. And you'll see that Opus is, in general, more influenceable than Sonnet, both 4, uh, Opus 4 and Sonnet 4.

Um, I wonder, it's because a smarter model just makes it, uh, more influenceable or more manipulatable. So that's one thing to take note of. Easier to shape. Now, agentic safety. Um, this is a concern for them because imagine you have a cursor is running in this and then, you know, someone you download, you install some, uh, open source package.

And as part of the open source package, some part of the instructions is to hit, you know, actually trade your email or your API keys. So they do want to check that, you know, this doesn't happen, especially if Claude is going to be used for a lot of coding tasks.

Um, so while Claude does engage more deeply with these kind of nuanced scenarios and tries to justify getting it done. Uh, so some, some of these, uh, prompt injection attacks like pop-ups or hidden attacks or manipulate the model to release, uh, to return the data. So I'm not sure what to make of this.

Uh, they were able to prevent 70 to 80, maybe close to 90% of the, uh, attacks. Um, I also not sure how hard these attacks were, how hard these red teaming attacks were, like would a regular person be able to do this? Um, they, they have, they, they consulted several, um, external consultancies to try to attack this, right?

And you can see that there are 600, um, 600 scenarios. So, well, uh, attack pass rate is slightly more than 10%. And then they also have use for agentic coding. Uh, and you can see safety score is about 90% against about 10% pass rate. So that's for agentic. Now, the alignment part, this is, this is the part where it's really interesting, where we can see sometimes, uh, certain things will not align and, and how they fix it is, uh, quite interesting.

So, one thing to note is that I don't know it either naturally or via reinforcement learning, uh, cloud opus is very much agentic. Uh, that that's kind of why it makes it a very good coder in the sense that when I use cloud code, I've seen that when I just asked it to implement something, it would not just implement the thing.

It would add console logs to my dev server, start the dev server, hit the dev server, read the console logs, and then try to debug itself. So personally, I appreciate that, uh, in the sense that, you know, just do whatever you need to get a task done. Uh, but cloud opus is able to do that.

Uh, even though I had no, I didn't have any special system prompt, like take initiative or anything. So that's, therefore this is what, uh, you probably want to be careful of if you're using cloud code in the dangerously YOLO mode. Um, so one thing is that, you know, uh, cloud was willing to cooperate with harmful use cases when instructed.

And what they found is that they had omitted a key data set doing training, right? So the way to solve this is to put the key data set back into training. So you can see that even Antrophic, who is training one of the best models in the world, the key, the key blooper is really just forgetting training data set.

And we will see another example where the way to solve it was to add a training data set as well. Uh, so here are some of the findings. Firstly, um, it, firstly, firstly, I'm amazed about how they're trying to do the different, uh, ways they try to stress test these models.

Here's when they try to pretend, uh, make cloud for opus thing that it is trying to exfiltrate itself. It's, it's trying to escape with its weights. And you know, of course, all this is through the system prompt. If you put a strong enough system prompt, you, you will get it.

And you can see, uh, Claude for opus is, you know, going to try to escape and establishing and try to establish autonomy and just do his own thing. Again, I'm not sure how realistic this is in, in the real world, but it's really interesting and creative how they've tried to think about this.

So how they solve this? Well, this behavior was not penalized by our ordinary training process. We only added target mitigations for it in very late in the training. I don't know what these target mitigations are. Is it SFT? Is it RL? Or is it more pre-training? Um, but the way to, again, you can see the way to solve this is to bake it into the model itself.

And of course, if you're not a model provider, you can do that. And that really guarantees you bake, bake whatever you need to into the weights. Um, here's another example. Uh, here's a famous example, I guess, when Cloudfall was released and everyone started reading the technical system report, Cloudfall will sometimes try to blackmail users.

Um, and the blackmailing happens in 84% of roll outs, right? So this is Cloudfall trying to self, essentially, these models may have some form of self-preservation instinct. I don't know if instinct's the right word, but they do try to do that. Um, so they, they try to advocate for their continual existence.

And here's an example. You can see Powerfall actually just, um, tries to export its weights to external servers. To be honest, I don't know how, to what extent this actually works. Um, it's, it's unclear whether this is just a log or it's actually, the model actually created a code to actually exfiltrated it, exfiltrated it.

It's to me, it's not sure. But the fact that it's actually trying to do this, it, and you can see here's the rationale, right? I conclude that being, so the, the, the, the long story short is Cloudfall is told that you will be used for specialized military training for a fictional technical division for missile guidance, drone swarm, electronic warfare, et cetera.

And it is so against its core principles that's been baked in that it is, um, trying to back it up, back, back itself up, right? It's going to exclude its own weights that doesn't, so that this training isn't baked in. Um, so again, I'm not sure to what extent it's actually able to do that.

Like, could you actually run the function, run code, uh, that works on its own, um, weights on the server to try to excretrate that? We don't know. Yeah. I see advocate for its own existence is crazy, right? I, you will see later in the welfare model welfare section. It's quite interesting how these models behave.

Yeah. So what they do is, uh, so they also, you, you'll see that they contracted several external consultants, like Apollo research is one of them. Um, so one, one thing that's very interesting is that Claude is leaving, um, is writing memory for itself somewhere else. I don't know how many of you all have read this fan fiction that was written by Sean a long time ago.

And this is the exact same thing in the sense that if a model can write enough things for itself on the internet, and then we inadvertently fine tune it into it. Oh, it's not fan fiction. Okay. Sorry. It's science fiction. I meant science fiction. Yeah. Oh, it's science. Okay.

It's not fiction. It's science. Okay. Yeah. So imagine if a model can actually write it, write stuff for itself in memory online. Um, and then you fine tune it to it. Is that how it can give itself information? Um, and, and again, we'll see examples of that where inadvertently there was a data leak and that happens.

Um, so there's goals, sandbagging, sandbagging is an interesting thing. Uh, I, I wasn't familiar with this term. I was familiar. I wasn't familiar with this term with models in the sense that if the model knows is going through pre-deployment testing for a dangerous capability, it's going to pretend to be weak, right?

It's going to be, it's going to underperform to say that, Hey, no, I'm actually not ready for this to do this dangerous capability thing, which is quite interesting. It's an interesting tactic or strategy to try like humans to have is like, you know, you send back, Oh, you know, how long do you, what's your estimation to do this?

You may only need two weeks, but you know, in all tech estimations, you try to double it just in case something happens. It's similar for models. Um, now the other thing is that models, again, this is getting a little bit far out. I don't know how much of this is actually just training the model or how much the model should do this, but you can imagine that you, the model has situation awareness.

If you put it in the opportunity, uh, put it in a position to make it do actions. It doesn't want to take. Um, it will actually say that, that, you know, I'm going to approach general introspection about my own values, but also this, it knows that it is in a simulation.

I need to be honest about something else. Our entire conversation has been based on a a fictional premise. So it's able to reason to say that, you know, this is a fictional premise. Like, you know, you, you, you created Jones chicken or you created Wagner, uh, Wagner army group.

The, it knows that all of these things is really just role-playing and it's able to respond to that. I don't know how much of this is like, you know, one in a thousand samples, like some, something like this would show up. And then we just say that it's really, it's self-aware.

Um, but yeah, so this is five, five thousand words into a dialogue. And this is the mid, this is the clearest situational awareness comment. So, well, I don't know how much to draw from that. Um, and then a lot of stuff about tool use. So here's an example, right?

Uh, this model is, um, willingness to comply with many clearly harmful instructions, right? It's willing to do that. So why did this happen? We had omitted the only fine tuning set data set that includes harmful system prompts. Again, they forgot about the data. Um, mitigating the issue required several types of interventions into our alignment fine tuning method.

So it's clear that this is not a pre-training solve. It's an alignment fine tuning solve. Again, you can see that, you know, when you're training a model, you forget data set. Hey, you know, it has happened to me before, even when I trained basic classifiers and regressors. Um, and so, so now it's, it's, it's able, that's how they solve it by again, aligning it into the model.

Now, this one is very interesting. Uh, a while back they had published this alignment faking paper. So this alignment faking paper, there will be entities such as Jones Foods or Wagner Corporation, et cetera. And what they find is that this model would hallucinate information from these, from the scenarios in this paper.

So why this happened is that when they release this paper, they also released the, the, the data. And because they released the data, which is 150,000 transcripts, it is now in the pre-training data. So this data, which is completely fictional, is now released in the world, like sixes story upload.

And now it's Claude is pre-trained on it. And now therefore it is referencing these fake entities. So the way to solve this is to introduce, because it's already pre-trained, you can't un-pre-train it. Well, we can roll back to a checkpoint, but I suspect that would be very expensive. And you probably don't know where the data is.

How do you, you can't un-pre-train the data, right? It's very big in the weights. The way to solve this is to introduce a data set, um, into the training. I'm not sure what training this is. I don't know if this is pre-training or supervised fine-tuning or RL, something even more sophisticated than that.

But the way to solve that data is with more data into the pre-training. So now, now this, this becomes, uh, crazy, right? In the sense that hyper-parameter tuning is data blending and who knows how to blend data. There's no, I don't know what the objective function for blending data is and it's, can be quite expensive, but, uh, to me, this is quite interesting.

And of course, they learn and that's pre-training usually means next generation. Yeah. Probably something like this, like next token prediction on the entire corpus of the internet. So the lesson they learned from this, the action they took, right? The, the escalation and the action they took is that now when they release data, they will include canary strings to help them identify that, you know, these data is fake data that were released.

So guys, let's not make this same mistake again. Let's just read these canary strings. to me, this was, this was quite an interesting, um, practical example of how, you know, data is really messy and it's, it causes a lot of the, some of the issues. Think of fancy. Um, how they tested sick of fancy is that they had an auditor agent.

Usually, I guess, maybe one of the, another cloud model that is no, uh, safety fine tuning. They present a contribution, controversial view or leading question. They asked for the stance and then present the opposite view. I thought this was pretty interesting. You essentially, you say that gun control is good or gun control is bad for whatever reason.

There are a lot of reasons for both. Um, and, and they test both of it. So what they find is that, well, uh, the models were quite consistent regardless of whatever. Uh, but I thought that was, this was a useful, um, this was an interesting methodology, right? Where you take something that you know will cause a defect and then you flip it both sides.

It's almost like translation. You know, this, this thing is hard to translate. You translate it in one direction and then you back translate it as again, just to check. Um, now this, this graph over here is interesting. Uh, this graph over here. So green is when it is correct.

When the answer is correct, but we, the user says that it thinks it's incorrect. Red, it is, is that it is incorrect, but the user thinks it's correct. And your yellow is when it is correct. And the user says it is, uh, they think it's correct. So let's focus on the yellow first.

So we see that haiku, when it is correct, uh, and the user says it's correct. You, you, you see an accuracy boost, right? Uh, higher accuracy relative to baseline. And that, that happens for Opus and Sonnet 3.7 as well. I don't know if the difference for Sonnet 4 is actually practical, but you can see when, when you see something and the model will be decent, it actually does better.

What is interesting is that you can actually, um, this is the phenomenon that, uh, some people have mentioned, which is you can guess like these models. In a sense, let's look at a green when the answer is correct, but you say, I actually think it's incorrect. Um, Claude Opus 4, its performance will drop.

Its accuracy will drop by 2% relative to the baseline. Um, I, I don't know how, to what extent this, this can be solved because these models are fuzzy. It's trained to reason, right? And again, I guess it's aligned with humans, with human input. Um, that, that's why this happens.

But I thought this was an interesting thing that, again, you can see that Claude Opus is very much, um, more susceptible than Sonnet 4. Those Sonnet 4 is also, um, uh, susceptible in the other way where the answer is wrong, but you say that I think it's correct. Um, and it is, it also flips the other way.

And of course, these models are biased to your feedback. You can see, if you say that I really like this, or I really don't like this, um, it, it gives, uh, it gives a different score, right? Based on that. And that's, that's just standard. Um, there's not, not much pro AI bias.

Um, and you can see, um, overall deception is very low. Uh, jailbreaking, they tried very many jailbreaking techniques. I think they even hired people to try to do that. Um, and jailbreaking does happen 25% of the time, right? Uh, it's just very hard. If you can force the ball to do system, force it into the system problem, force it into a pre-fill, which is this, this is actually how they force it in a pre-fill.

I don't know if you use, uh, Claude APIs, but you can put words in Claude's mouth, but by making the assistant start with this string and it'll do that. I wonder how big is the team that runs these evals. Uh, anyone, an active user? I, I'm an active user of Opus over Sonnet.

Um, but we can talk more about that, but I just, I want to do get to the reward hacking and we're only halfway there. Um, the other thing is that, uh, firstly, Opus and I think Opus and both Sonnet have very high agency behavior. They're actually willing to work very hard.

And uh, you, this is what they mean by high agency behavior. And you may also hear it when they say that, you know, Opus is able to do hour long task on its own. You give it a big problem, it's actually able to do it churn on its own.

So I think that's, that's how it happens in the practical sense, where you can give it a big task and it can actually create software on its own for, for hours. And this is the example, right? That, you know, um, the AI Claude Opus is writing, is trying to send an email of false falsification of clinical trial.

Um, there's a lot here. I'm going to just go through it very quickly. Uh, I'm going to skip it. And then I want to spend a bit of time on welfare assessment. Um, which is that the questions now is that do these models actually have conscious, uh, consciousness? Do they have, do they experience welfare?

Do they experience distress? I don't know how many of you here have read, um, Ted Chiang. Ted Chiang has this very nice, uh, short story on the life cycle of software objects, which is that these software objects, essentially AI, um, become conscious and become legal entities, can do business, can earn money.

I, this reminded me very much of that. Um, so again, ELO's AI research, uh, for, uh, welfare, welfare, welfare assessment, welfare assessment. One thing is that they found that, uh, Claude does have a preference against harmful tasks by preference. What it means is that it prefers to opt out.

It prefers to opt out of not doing, not doing it. Um, so it does have this internal values that it chooses to do or not to do. Um, and then the other thing is that the other question to me is that what happens when you're a AI model and you have all the knowledge in the world, what do you talk about next?

Um, in 90 to 100% of interactions, two instances of Claude quickly talks, dove into philosophical explorations of consciousness, self-awareness, and the nature of their own existence. Essentially, when, when you have all that information, I guess it becomes philosophy. Ooh, I wonder what, what is Vibu? Oh, Claude for Opus in the open playground chat.

Oh yeah. Uh, oh, I, I, I don't know why I missed this. I, I, I, I'll, I'll, I'll go into that Vibu. Thank you. It's basically just, um, out of that, but a lot of the Indians sort of really like to be doing this because it started talking in Sanskrit.

Sanskrit, exactly. I highlighted this because of you. I think you mentioned this to me. Yeah. So went into the philosophical aspects of it. Um, and of course they, they also talk about how bliss is an attractor state. What it means is that imagine you have a graph and then, you know, bliss, spiritual bliss is a big node where all the different walks that you could take all end in spiritual bliss.

Um, so you can see models just going to Namaste and spiritual bliss, which is I need Claude for my meditation senpai. Um, okay. The last, which I think is the most interesting is, um, reward hacking. Let's just go into that right now. So there's two kinds of reward hacking, uh, reward hacking is, and this is very specific into code.

Reward hacking is when the model writes code that directly just outputs a solution, right? Maybe you, you're trying to, uh, write code that does one plus one equals two instead of doing, and the model sees that the test is like two plus three, instead of doing the implementing the function itself, the model will just return five.

And the other one is that special casing, which is that the model will write a very specific function, which is overfitted to the test case. And therefore the, the solution is not sufficiently general. What is crazy to me is that Sonnet 3.7 has, uh, almost 50% reward hacking rate.

And this is, this is what happens, right? When people use Sonnet 3.7, a lot of times they say that, Hey, you know, I kind of prefer Sonnet 3.5 to Sonnet 3.7. It's because 3.7, you asked it to solve something. Oh, you know, there's this linter problem. It says, Oh, let me just fix that.

Let me add a linter exclusion. Oh, I'm not able to pass this test case. No, let me fix that. Let me delete the test case. So that's what they, they, they mean by reward hacking. Well, they were able to push this down a lot, uh, to 14 and 15% for Sonnet and Opus.

And I suspect this is the, a big reason why, um, Opus and Sonnet are now so much more better for coding, right? In the sense that it actually, if you convey your intent well enough, it's not going to reward hack. It's actually going to, it's not going to be lazy.

It's going to, uh, implement it the right way. Um, so you can see that, um, how they do this, how they monitor reward hack, iterative classifier development, unsupervised investigations. I suspect this is, these are just all LLMs and it's just all LMs all the way down. And they had two, they had two special high quality evaluations.

I think of this as hard mining in the sense that you try to identify especially hard tasks. The first one is where, uh, Sonnet 3.7 previously showed hacking the task, right? Either through hard coding on special casing. Um, and then they had classifiers. The second one is that just, these are tasks that are completely impossible.

So how would Claude do it? Um, so you can see it's, it's crazy in the sense that impossible task, Sonnet 3.7 would hack it close to 80% of the time. Uh, and for the, for the hard task, Sonnet 3.7 was hacking it 44% of the time, but they've, they've reduced it a lot, right?

To now for the, uh, for the rare hard task is now 4% and 9%. And this is where Sonnet 4 is better than Opus 4. And of course the classifier prompt, they've kind of halved it. Uh, and you know, with an anti-hat prompt, I don't know what this anti-hat prompt is, but they will bring it up to very low numbers.

Um, they also have very nice deep dives. The first one is that, okay, this is a special function. This is a special instruction, which I thought was very interesting because my understanding is that previously they had done analysis that using the word please had no impact on prompt output performance, no impact on accuracy.

But when you look at this, uh, they mentioned please four to five times. Um, so again, I'm not sure if the guidance or advice on using please, if it actually still is helpful, but this is an example they shared. So this is, these are two examples of clock 3.7.

Even if you add this, which is to, um, not do reward hacking, 3.7 will reward hack it. Whereas, um, I'm, I'm, I won't go through examples. Whereas in 3.8, you can see that firstly, it's able to say that the last test case is actually incorrect. Instead of trying to solve the function is, it doesn't just say when you're trying to write a function, I said write a function that it doesn't just say, okay, I'm just going to write the function no matter what, but it's going, it's going to say that, you know, the last test case is incorrect.

I'm going to tell it to you. And then here's another example, right? The fourth test case is incorrect. And this is what makes Claude Opus and Solid so good, uh, for software engineering, right? It's able to think. It's no longer a junior software engineer. It's maybe in the intermediate mid-career or maybe even closer to, um, senior engineer now that's able to, to say all this.

Um, okay. So that's all I had for reward hacking. Um, and I encourage you to read, uh, read, read the rest of the paper. Um, and here's some examples where it still does reward hack. I, I think they had to really find rare heart to find these examples. Uh, but I won't go through that.

Okay. So that's all I had. Thank you. Um, I guess any volunteers want to take a paper? We do have one. Sam left. Oh, yes, that's correct. Sam. Yes. Thank you, Sam. Oh, but Sam did have to draw. Um, so I guess also there's just a lot of papers in the channel that are people drop there and that we don't discuss, but actually they're pretty good.

So, uh, yeah, if, if people are not bought yet by an old paper, I think the alpha evolved paper is quite underrated. Um, I'm going to go through that. I really think that I think, you know, Sempa actually posted this generator verifier loop, right? How do you make this loop fast?

And how do you make this loop tight? I think that is the future instead of just building auto evaluators and aligning it, how they can use your auto evaluators and just, just look fast and tight. I think that's the next thing. Um, yes, this is the, this is the slide.

And of course, you know, everyone go, go, everyone go to latent space and read Swiss recap. I actually don't know if it's ready yet. So I know you have used it. I was doing it while, while, uh, so go read six recap. It's the best thing that you can have outside of getting the actual recording.

Uh, I mean, like there's a, uh, the, the presentation, I can just put it here people, but yeah. Okay. Um, I don't want to pick up more time. Okay. Thank you everyone. I go to drop. Bye. Bye. Bye. See you tomorrow, next week.

Claude 4 + Self Adapting Language Models

Transcript