back to indexClaude 4 + Self Adapting Language Models

00:00:00.000 |
Okay. Okay. Oh, cool. So first paper with a background since now we're recording, 00:00:05.360 |
SEAL came out this week, earlier this week about self-adapting language models. 00:00:10.480 |
Basically, we always hear this promise of, you know, once you do a train run, your model is 00:00:16.020 |
stale, right? We used to care a lot more about this in the past because, you know, like the 00:00:20.000 |
original ChetGPT had like, okay, it was trained on information up to like October 2023, and then 00:00:25.720 |
it doesn't know anything in the future. So people were always hoping like, you know, can we have 00:00:30.580 |
real-time learning? This is separate from like in-context learning, but as you use the model, 00:00:35.200 |
can you actually update the weights and make the thing better? And that's always kind of been this 00:00:39.260 |
hope. And then it kind of died out because models started to do tool use and can now browse the web, 00:00:44.060 |
and it's not really as much of a concern. I don't know about you guys, but I don't look at when models 00:00:48.600 |
are trained anymore, but they do still show it. Like the bigger we do pre-trained runs, like GPT 4.5 00:00:55.300 |
and stuff, they do have a knowledge cutoff. And then from there, that's kind of used in the test 00:01:00.600 |
time training. But I don't think anyone could just tell me where, like what the training cutoffs for 00:01:07.540 |
these models are anymore. No one really cares anymore. But anyway, there's this concept of 00:01:11.480 |
continual learning and always adjusting your parameters and learning more as you do stuff. 00:01:16.860 |
Okay. I'll be monitoring chat. This is a quick paper. So, you know, let me know if anything is 00:01:24.020 |
interesting and we'll dive deeper. Otherwise, I'm just going to go through it pretty quick. 00:01:27.220 |
For people that are joining, these are recorded, but on YouTube, you can find them on Latentspace TV. 00:01:33.020 |
Security-wise nightmare, true. It is a bit of a security-wise nightmare, but it's actually not 00:01:38.660 |
that straightforward. The interesting thing with this paper that I did like is they do do a lot of 00:01:43.860 |
citations. So they basically share their approach for everything, and then they go over a bunch of 00:01:48.780 |
previous work. And, you know, it's nice if you want to dig back into what are other approaches for 00:01:53.540 |
this. Like some approaches are if you use alternate models. So like you have state-space models. Can you 00:02:00.120 |
continually update that state module and kind of have this new context in there? Then there's like 00:02:07.680 |
in-context learning and how that approaches stuff. And then they kind of show a couple experiments of what 00:02:11.960 |
they do. What they basically do is for samples where the model is struggling, they do these self-edits. So 00:02:20.520 |
they kind of generate synthetic data. Then they do RL on a model to kind of train in these like 00:02:27.120 |
new terms with Allura. And then if it has like, if it actually improves the performance, then they 00:02:33.500 |
train the base model that they were doing with SFT. And they're like, okay, it works. And I was like, 00:02:38.680 |
oh, it's pretty cool. They get it to work. Performance goes good. And I read through this 00:02:42.180 |
paper. I'm like, oh, pretty interesting. Pretty interesting. Then I get to some of the last 00:02:46.960 |
section, basically the limitations. They have this catastrophic forgetting where basically, you know, 00:02:52.840 |
as you do these updates sequentially. So let's say like there's a benchmark, you try a few questions 00:02:58.200 |
and the model is not doing well. Okay. You start to train in examples on these questions and then the model 00:03:05.060 |
starts to better like understand this and then it starts to perform better. But once we, once we go from 00:03:12.320 |
there, meeting is being recorded. It's in the middle of the screen. Your meeting being recorded window is in the 00:03:20.280 |
middle of the screen. Is my screen recording working properly? 00:03:23.480 |
It actually looks fine to me. Okay. Okay. Well, good catch. It seems like it's working. 00:03:28.700 |
But basically TLDR. So as you have like a few questions, let's say you have like RKGI style questions 00:03:35.460 |
or like medical questions, for example, after you do a few edits and it starts doing better, 00:03:40.580 |
once you do like the fifth, sixth, seventh, eighth edit, since you're continually doing training, 00:03:45.980 |
it starts to forget the first ones. So after reading this entire paper, you know, after you do 00:03:51.920 |
sequential updates, can the model adapt repeatedly and preserve prior knowledge? Guess what? It can't. 00:03:58.320 |
Performance on earlier tasks gradually declines as the number of edits increasing, suggesting that 00:04:03.780 |
CL is still susceptible to catastrophic forgetting. So, you know, what's the point here? Like as you do this 00:04:10.560 |
self-iteration, it starts to forget previous iterations. And like, yeah, that, that kind of 00:04:15.780 |
killed it for me, but you know, the paper was still decent reading through it. So let's, let's still go 00:04:20.980 |
through it. So seal is kind of a framework that enables LLMs to self-adapt by generating their own 00:04:26.380 |
fine tuning data and update derivatives. So I guess that's the other little difference, right? 00:04:31.120 |
It uses itself. So it uses the model that it's using to generate synthetic data on questions. And 00:04:38.220 |
then it judges whether the, you know, data that it generated helped improve performance. If it did, 00:04:44.520 |
then it does kind of SFT on this data. So it's like all recursive on its own model. Now, the other 00:04:50.800 |
limitation of this is you need to know the downstream task performance. So like, you can't do this on open 00:04:56.400 |
ended stuff. I kind of bring this up in the limitations here as well. Basically, yeah, you, 00:05:02.400 |
you need to know what the, where is, where is it? Context dependent evaluation. We assume that every 00:05:09.160 |
context is paired with explicit downstream tasks. So, you know, you, you need labeled input outputs. You 00:05:15.560 |
need to be able to verify whether your output is correct. And then like a potential solution to this 00:05:20.760 |
is let the model not only generate the edits, but also generate its own evaluation of questions. 00:05:25.720 |
I think it starts to get very meta there. Once you not only have generated outputs of generated data and 00:05:33.720 |
generated examples, but yeah, a limitation is it has to do it. It gets kind of meta and hard because 00:05:39.700 |
their example is training this on like LAMA 1B, LAMA 3.2 1B for ARCTASK. 00:05:47.240 |
And N7B. And genuinely, I don't know if like, let's say you're doing something like open ended ARCTASK 00:05:54.200 |
without verifiable rewards, like not verifiable rewards without really, like really without an 00:05:59.320 |
answer. Right. So if you have a puzzle, you don't know the answer, you generate sample puzzle style 00:06:06.200 |
examples. You know, the model can't really verify this output that well. But anyway, that's just like, 00:06:12.520 |
yeah, I guess that's a limitation of a small model. You could do it at bigger scale. Okay. So, 00:06:16.760 |
basically given an input, the model produces self-edit generations for synthetic data. So, 00:06:22.840 |
you know, rephrase the question. What are questions deviated from this passage? And then they test it on 00:06:29.640 |
the same question without context and see if it performs well. Then through SFT, these self-edits are, 00:06:35.560 |
you know, constantly changing the actual weights. They use an RL loop using downstream performance as a 00:06:42.120 |
reward signal. And yeah, it's kind of cool. It updates parameters. Okay. So here's the key 00:06:48.520 |
question in this paper. We want to explore an intriguing hypothesis. Can an LLM self-adapt by 00:06:54.520 |
transforming or generating its own training data into a learning and learning procedure? The answer, 00:07:00.120 |
kind of, but you know, it still falls short after a few changes. So given a new task with current LLMs, 00:07:07.240 |
we consume and learn from data as is via fine tuning or in context learning. However, 00:07:12.440 |
data may not be in an optimal output format or learning. So they kind of have this like outer 00:07:17.720 |
loop, inner loop of training. Basically, here's how it is. So in each RL outer loop iteration, 00:07:24.600 |
the model generates candidate self-edits derivatives of how we update the weights, 00:07:29.560 |
applies updates and evaluates performance on the downstream tasks, and then uses the reward to 00:07:34.280 |
self-improve its gradients. Basically, letting them to do this is good. RL generates self. So 00:07:42.600 |
this is called seal. This is the RL model that kind of does this. I think we can skip some of the specifics 00:07:48.760 |
on the RL, but they can't use a off policy model. So they use like rejection sampling and SFT. We'll 00:07:55.640 |
kind of go into that in a bit. Okay. They evaluate it on two applications. So one is can the model learn 00:08:02.520 |
new factual knowledge? So this is basic benchmarks. Can we basically, given a task, strain in the relevant 00:08:12.280 |
information and then retest on that same question without the in-context learning, without giving it 00:08:17.080 |
information, can we improve? And they find out yes, they kind of can. The other one is, here's kind 00:08:24.200 |
of how they do it. So we fine-tune synthetic data by the seal model. Our results show that following RL 00:08:30.040 |
training and fine-tuning on self-generated synthetic data improves QA performance on the no-passage-in-context 00:08:37.080 |
variant of SQUAD from 33.5 to 47%. So, you know, it kind of works. And it does outperform synthetic data 00:08:45.160 |
gen from GPT 4.1 on the same data. So kind of cool. Maybe, okay. For the RL loop, why did they use binary 00:08:54.840 |
loss? Whether the downstream accuracy improved or not? Couldn't they have made the reward signal something 00:08:59.080 |
like how much better is the model performed on the task? They go into this in a bit. But yeah, it's 00:09:04.600 |
interesting. They do binary loss. Okay. They also evaluate it on a few-shot learning, a sample subset 00:09:10.840 |
of ARC benchmark. And yeah, it does better. And they specifically choose LAMA 3.21B because that's 00:09:18.200 |
a model that wasn't trained on ARC. The model came out before the ARC benchmark. So it hasn't seen that 00:09:22.600 |
data at all. Okay. Synthetic data gen, seal build them. So this is where they kind of... Section 2 is 00:09:28.920 |
pretty good. You know, if you're interested in this sort of like constant updating training, 00:09:36.200 |
these are pretty good papers that they reference on previous approaches. So one is on synthetic data. 00:09:41.720 |
It's pretty common for pre-training tasks. Here's a few papers. Instruction tuning, task-specific data 00:09:48.440 |
augmentation. And then here's just a bunch of papers that you can go through. So like 15 to 22 are all 00:09:54.120 |
related synthetic data gen for constant fine tuning. Seal builds on this type of graph-based prompting 00:10:02.600 |
by using RL to train a generative policy that maximizes downstream utility of synthetic data when applied 00:10:08.920 |
for gradient-based self-updates rather than relying on static or heuristic generation strategies 00:10:14.440 |
that are manually tuned. So it kind of uses this RL to get a signal of how well is this... 00:10:19.960 |
How well is this output? And then they maximize... Actually, we'll go on a bit. Okay. Knowledge 00:10:26.360 |
updating. This is pretty straightforward. So a bunch of papers on knowledge updating. Some try to do like 00:10:32.920 |
specific parameters to improve facts. Other do fine tuning with information of in-context. We work on the 00:10:40.120 |
ladder. Okay. Here's how SEAL framework works on QA. Test time training. This is a whole section on 00:10:50.200 |
what's been done there. Okay. RL. So SEAL applies RL not only to optimize the final outputs or trace 00:10:58.200 |
revisions, but to output the generation of the self-edit data. So they're doing RL to see whether the 00:11:04.760 |
generation... Whether the synthetic data that they generate has a performance update... Performance 00:11:10.360 |
increase or not. Okay. More background on meta-learning. There's an adaptation strategy of how to do these 00:11:19.560 |
effective self-edits. The goal is how to learn effectively from task context. Da-da-da-da. More 00:11:25.480 |
background. Self-improvement. More background. RL-A-I-F. In contrast, they view self-improvement through 00:11:32.440 |
interaction with external data as more powerful, scalable paths. SEAL learns how to best utilize 00:11:37.880 |
external data for self-improvement. Okay. Am I taking... Am I understanding this correctly? If they take 00:11:43.160 |
context evaluation pairs as input, SEAL model improves these. The synthetic data that may SFTI improve 00:11:48.840 |
pairs. Kind of. Basically what happens here. Okay. So methods. Here is SEAL again. Model is trained to 00:11:55.160 |
produce self-edits directly through token generation with the data provided in the model's context. So 00:12:00.840 |
you give it a question in context. It produces kind of this self-edit of generation of... Okay. Here's 00:12:06.520 |
like two way examples on this. Then self-edit generation is learned by RL. So the SEAL model is 00:12:12.920 |
actually that self-edit generation model where the model is rewarded for generating edits that increase 00:12:18.920 |
the performance. So basically, you know, you have this model that augments the in-context learning data. 00:12:26.280 |
Then you do a little LoRa train on it. If performance goes up on the task, then that's good. That's 00:12:31.640 |
preferred. And that's the reward signal. SEAL can therefore be interpreted as an algorithm with two 00:12:36.920 |
nested loops. Outer loop, which optimizes the self-edit generation. This is that sort of RL step. 00:12:42.600 |
And the inner update loop, which uses that self-edit generation to update the actual weights with SFT. 00:12:48.200 |
So that's kind of what CL is. Here's the framework. Okay. Bunch of fun RL stuff. I don't think we need 00:12:55.400 |
to go super into detail, but it's actually very, very easy, readable, approachable. Easy thing to do. 00:13:01.160 |
Take section 3.1, throw it in ChatGPT and have it break down these terms for you or just read it yourself. 00:13:06.600 |
It's not that hard, but basically it's what the optimization is, right? So that self-edit generation 00:13:12.840 |
progress, uh, process with RL, uh, they're letting the model take its action, its own self-edit. The 00:13:19.480 |
reward is based on the performance of, um, the reward is based on the performance of the model without the 00:13:26.440 |
context. So fancy terminology, but actually very, very straightforward. Um, basically reward is assigned to 00:13:33.640 |
give an action in our settings depends on the parameters that the time is taken. So basically, 00:13:38.680 |
you know, without the context after training does, does performance go up. Um, they try various on 00:13:46.920 |
policy methods such as GRPO, PPO found that they're unstable. They use this rest rejection sampling plus 00:13:53.880 |
SFT framework. Uh, basically rejection, uh, rest can be viewed as an expectation maximization procedure, 00:14:01.720 |
where the expectation step is sample candidates outputs from the current model policy. Then 00:14:06.920 |
the end step is kind of, uh, reinforce those that receive positive reward. So, you know, that's, 00:14:12.520 |
that's kind of their, their thing, more fancy stuff here. Okay. Um, domain instantiations, 00:14:20.280 |
they evaluate in two domains. So see shot learning. Okay. Can we integrate new information in model 00:14:27.880 |
weights? So evaluated using no constant variation of squad, basically, um, you know, given in context, 00:14:35.560 |
can we generate data train and then evaluate without context? How does this thing perform? Uh, and then 00:14:42.440 |
the ability to generalize two tasks after only seeing a small number of examples, this is on arc. 00:14:48.360 |
So basically this is the Lama example, uh, model was never trained on arc. Can we do some of this 00:14:54.360 |
training and give it some examples and then kind of generalize to solving this? Yes, it can. The other 00:14:59.880 |
one, if we take away the in context and do a no context variation, does it actually learn to use what 00:15:06.120 |
it's been trained? Uh, yes, it kind of does. Uh, so first one is knowledge incorporation. That's, that's 00:15:11.080 |
what we just talked about. Can it effectively incorporate information provided in a passage? Get rid of 00:15:18.040 |
this. Um, basically prompting the model to list several implications derived from the content. The output may 00:15:25.480 |
include inferences, logical consequences, restatement of the passage. Uh, so the prompts are basically stuff 00:15:32.600 |
like, you know, rewrite the passage in different ways, rewrite it in a QA format. And then after 00:15:38.520 |
that, you know, see how it does. So here's kind of the Apollo program before, after, um, but even after 00:15:45.160 |
natural NASA reached internal agreement, it was far from smooth sailing self-edit. The Apollo program 00:15:51.560 |
faced opposition from this, uh, evaluation, you know, there's the question without the context and then 00:15:56.840 |
can it get the answer right? So given a passage, the model generates synthetic data, which is the self-edit, 00:16:02.520 |
in the form of implications of the passage. Then fine tune on these outputs using a LoRa. 00:16:08.520 |
The model after the training is evaluated on the passage without access to the original text. So this 00:16:15.720 |
is kind of their little framework. Uh, the resulting accuracy is the reward for the RL. Uh, if the 00:16:22.360 |
accuracy goes up, if they're like self-edit LoRa fine tune was successful, that's a good reward. We should 00:16:28.360 |
use this. If it goes down, that's not good reward. Um, then the self-generated statements are, uh, 00:16:35.480 |
used as training data for SFT. Then they kind of, you know, since there's very little data per sample, 00:16:41.480 |
they, they basically do LoRa's. This is where I was like, oh, it's kind of interesting how they start to 00:16:45.560 |
work, but, and it does kind of generalize. Right. But then, um, you know, you're directly training it on 00:16:51.560 |
examples of these tasks and eventually it, it starts to lose previous steps, which is unfortunate. 00:16:59.160 |
Okay. Um, few shots set up. Here's kind of more examples. So this is the other one. This is kind of, 00:17:06.680 |
the arc one, few shot learning. So can we give it a few examples of how to solve our questions 00:17:12.120 |
and can it generalize? Um, yes, we can. We define a set of tools. So now, you know, you have data 00:17:19.880 |
augmentation, you have optimization, you can, you can generate data. It does it, it works. Results, 00:17:25.080 |
few shot learning. Uh, so for the arc one, we carried a subset of the 11 tasks from the eight evaluation, 00:17:31.640 |
very few tasks. The model is trained only on 15 self edits per task out of these like 11 tasks from the 00:17:39.880 |
eight evaluation set, uh, apply to generate 15 updated models, reinforce the ones that corrupt, 00:17:46.040 |
that correctly adapt models. So some of these don't work. Some of them do. We evaluate it by generating five 00:17:51.960 |
self edits per held out evaluation, apply each one individually. Yeah, it works. Um, kind of, kind of, 00:17:57.720 |
kind of, it's not good. It's not as good as other approaches. Um, so it substantially improves 00:18:04.040 |
adaptation success rate compared to baseline. So baseline, 20% adaptation, 72%. So something's 00:18:10.520 |
happening, uh, compared to other approaches. So train on passage only, um, much better than that. 00:18:17.800 |
Train on passage plus synthetic data versus GPT. Oh, sorry. Sorry. On this, uh, this Oracle TTT 00:18:24.440 |
performed a lot better. Oracle TTT is kind of, um, the model performs test time training using optimal 00:18:30.840 |
human crafted configuration. So that's the best, basically human give examples of how to solve it. 00:18:36.360 |
What's the best approach that gives you a hundred percent performance, uh, in context, learning given 00:18:42.440 |
few shot examples is worse than their edit. Um, and then there's another one TTT plus self edit. So, 00:18:50.760 |
uh, basically using few shot examples and doing this sort of like chain of thought style reasoning 00:18:57.800 |
did a little bit, but not as good as seal, seal work there. Okay. The other one is knowledge 00:19:03.400 |
incorporation. This is stuff where we don't have any, um, understanding of the previous example. So 00:19:09.960 |
this, uh, this example was, wait, sorry. Uh, sorry. That was the one where we didn't have any example where 00:19:16.680 |
this was Lama. It hadn't been trained on arc and now it can start to do it. This is basically a, 00:19:21.480 |
uh, can we get rid of this in context question, right? So do two rounds of rest with batches of 00:19:26.200 |
50 context. Um, basically, you know, generate synthetic data based on the placed on the passage, 00:19:32.520 |
then we test it on how well it performs without it. So yes, it works. Accuracy goes up. It does even 00:19:39.240 |
better generating its own synthetic data on QN7B than it does with GPT 4.1 on this style, 00:19:45.160 |
since it has this sort of RL step. Um, yeah, that's, that's kind of the, it's kind of the paper. 00:19:50.680 |
There's not much more. We went over limitations. Um, the rough thing is as you do this live editing, 00:19:57.560 |
uh, you know, it really tailors off. You have to do this per sample, per example, after you do it 00:20:02.760 |
like seven or eight times, it kind of forgets how the old ones perform. Uh, Laura's typically don't 00:20:08.360 |
affect general performance as much, but the other performance, the other issue with Laura's is, 00:20:13.320 |
you know, they're not super effective, right? They don't have the biggest performance. So there's that. 00:20:19.080 |
Um, computation overhead. This is very expensive. You can't just like do this at scale, right? It's 00:20:25.320 |
very, very slow. So the reward is more computationally expensive than any other RL training loop. Uh, 00:20:31.240 |
somewhere in here, they mentioned like each self edit takes about 30 to 45 seconds. Uh, you know, 00:20:37.880 |
there's a lot of overhead in that. And it's using like two H100s for all this, since you now need like 00:20:45.560 |
multiple instances of the model. Uh, RL is also not, not efficient in this, but yeah, interesting. 00:20:53.880 |
Um, basically they're like, you know, here's, here's our thing. Once web scale data is exhausted, 00:21:00.280 |
progress will hinge. We need models to generate their own high quality training signal. 00:21:05.640 |
I think that's fine, but you don't have to do it live, right? We can do synthetic data gen and offline 00:21:11.080 |
RL. This doesn't have to be RL. But anyway, it's a synthetic data generation model that they, that they 00:21:17.320 |
do that they use live. Um, yeah. We can imagine a future in which LLMs can ingest new data such as 00:21:24.280 |
academic papers and generate large quantities of explanations, implementations for themselves, 00:21:29.240 |
using their existing knowledge and reasoning with the in context data. The interactive loop of self 00:21:34.600 |
expression and self refinement could allow these models to keep improving on rare unprecedented topics, 00:21:39.880 |
even in the absence of additional external supervision. Very cool. Cute future, um, would be great if it 00:21:47.320 |
works, but yeah, that's kind of, that's kind of it. Um, that's quick paper thoughts, questions, comments 00:21:55.320 |
before we move on to the better paper. That's crazy. We were crazy. They went so fast. 00:22:05.560 |
Okay. Not much, not much in chat. So I guess we had some discussion there. 00:22:13.560 |
Yeah. Six findings on the chat. Oh, shoot. Uh, I need to share. 00:22:23.080 |
Okay. Now I can share because this is a new zoom installation. So I just want to make sure. 00:22:28.280 |
Can you guys see it? Yeah, we see it. That works. 00:22:38.280 |
Oh, shoot. Can you see the cloud system card? Yeah. Nice. 00:22:42.200 |
Yes. Okay. So cloud system card came out in May, 2025. I think there are a lot of interesting things here. 00:22:48.680 |
Um, and you know, everyone's saying right now that pre-training is less important. I actually don't 00:22:53.800 |
think so. Um, and a lot of techniques, very few, there are very few techniques mentioned here, but 00:22:59.480 |
when they do mention techniques, it's almost always related to pre-training. So let's go into it. 00:23:03.720 |
Um, I have gone through this. I've gone through only up to 80 pages of it, which is the end of 00:23:09.880 |
reward hacking, which I think is very interesting. Um, and then over at each page, I have keys to 00:23:15.640 |
the keys for when it's interesting. So I'm gonna try to take you through that. The first thing 00:23:19.000 |
that's really interesting, new cutoff date. Um, I wonder how this was done. Did they pre-train 00:23:24.360 |
completely from scratch or did they just do continue pre-training? We never know. I mean, 00:23:29.400 |
if someone knows, you know, just DM me, I would love to learn how, how that's done. Um, the second thing 00:23:35.480 |
that, uh, and Trophic has done is that when they do chain of thought, uh, they say that they opt to 00:23:42.680 |
summarize lengthier top processes, right? But this happens only about 5% of the time. 00:23:46.360 |
Uh, so what this means, and you can also still get the full top processes with no summarization. 00:23:51.480 |
So what this means is that unlike, uh, what OpenAI has done, where OpenAI doesn't give you the full top 00:23:57.240 |
process and Trophic is willing to give you the top process as well. Uh, I don't know if OpenAI has 00:24:02.840 |
changed their approach, which is not showing you the chain of thought. Um, if anyone has any update on 00:24:08.200 |
that, also please do let me know. Um, so, so this, this, these are some interesting departures from what 00:24:15.480 |
OpenAI is doing. Uh, and then the release, okay. There's a lot of standard releases. They go through 00:24:20.680 |
AI safety standard. The one interesting, and of course, uh, CBRN. So these are their standard, uh, 00:24:27.400 |
safety chemical, biological, radiological, and nuclear. We won't actually go through these. 00:24:30.920 |
Um, but it's safe to say that this is how they think about it. So what is interesting here is that 00:24:38.040 |
Claude Sonnet is still ASL2, which is the same as Claude Sonnet 3.5. 00:24:43.720 |
ASL2 is the, not as unsafe, it's a safer model, but they have decided to say that Opus is under ASL3. 00:24:51.880 |
And I have, I have a quick definition of ASL3 is this. Our ASL3 threat model is model 00:25:01.560 |
scaling and prioritization. Can it cause an economic catastrophe by unsophisticated, uh, by businesses or 00:25:10.120 |
hacker groups, right? And help them attack poorly assisted, hardened targets. Now ASL4. 00:25:17.000 |
Oh, shoot. I wish that ASL3 discussion was, was in there. But anyway, ASL4 is this model is not strong 00:25:25.240 |
enough to do multi-step operations that allow low resource nations to operate as top tier nations. 00:25:33.960 |
Um, I won't mention any conditions, but this is how they think about a threat model. So, uh, Sonnet is not 00:25:45.560 |
So a little bit more background here, actually, they changed the definition of ASL3 like two weeks 00:25:52.040 |
before the release. They, they, they changed the scope. They cut something out. I will try to find 00:25:57.880 |
what it was, but basically, uh, they, they got a little bit of talk about this, but right before 00:26:03.960 |
they released these in like, uh, I can find the post on May 22nd, they changed what ASL3 is. So 00:26:11.480 |
did they make it lighter? They removed restrictions to call this ASL3, which is why people, it wasn't 00:26:21.320 |
a major change, but, um, you know, slight changes. Yep. So this was, so that's a little bit about the 00:26:29.640 |
safety. Uh, now let's go back. You can see this as huge as paper. Um, 00:26:34.440 |
Safeguard's results. Uh, I don't think they went through very much of this. Uh, but long story short, 00:26:42.440 |
uh, you can see that Opus, uh, they, they don't, it's fairly safe. Uh, overall harmfulness rate 00:26:50.040 |
is less than what is about slightly, uh, maybe around 1%. Well, that's to say that it still has 00:26:55.320 |
happened, you know, given about a hundred thousand queries, a thousand of them will be maybe harmful. 00:27:00.600 |
Uh, so, you know, it really depends on the scale that you, that you're running these models on. Um, 00:27:06.680 |
and then they have singleton evaluations and you can see over here, this is, uh, these are, what is really 00:27:15.320 |
interesting to me is false refusals, which is it refuses even though it's safe. So you can see over 00:27:21.880 |
here for Sonnet and, uh, Sonnet 3.5 and 4, the false refusal rate is about 0.5%. Again, with a thousand 00:27:30.360 |
queries, uh, 50 of these will be false refusals. And sometimes these false refusals will be very judgy. 00:27:36.200 |
You'll say that, oh, you know, I refuse to do this because it's like left leaning or right leaning or like 00:27:40.600 |
against global economy or like some, uh, doesn't respect human rights, et cetera. I think it's okay 00:27:45.400 |
if you just say, I refuse to do this, but when the judginess comes in, that's when it becomes a bit of a 00:27:50.600 |
painful, um, painful PR issue. So then they talk about multi-term testing. Um, you know, but what they 00:28:00.440 |
found is that, you know, with extended thinking, the models are safer. Um, so that reasoning baked in is good. 00:28:09.240 |
Uh, and also political bias. Um, they actually say that there's no political bias. Um, but I've come 00:28:16.040 |
across papers, uh, that show that, uh, the models have been increasingly becoming more 00:28:23.240 |
right-leaning. In the past, it was very left-leaning, uh, very liberal, but then increasingly it's becoming 00:28:29.800 |
more central, um, for whatever reason. I don't know, maybe it's just training data, natural training 00:28:35.400 |
data on the internet, or maybe it's just some reinforcement learning and fine tuning. 00:28:39.640 |
discriminatory bias. Okay. No discrimination. Well, minimal, discriminatory, discriminatory bias to the 00:28:45.400 |
same level as Claude Sonnet. Um, so again, they have all these numbers here and they, they also ran the 00:28:50.920 |
strong reject benchmark for jailbreaking. In this case, they get a Sonnet 3.5, a special model without 00:28:57.720 |
safety training to try to test on jailbreak. So, um, the jailbreak is very interesting that Opus is more 00:29:08.200 |
susceptible than Sonnet. And you'll see that Opus is, in general, more influenceable than Sonnet, both 4, 00:29:15.240 |
uh, Opus 4 and Sonnet 4. Um, I wonder, it's because a smarter model just makes it, uh, more 00:29:22.520 |
influenceable or more manipulatable. So that's one thing to take note of. 00:29:30.840 |
Easier to shape. Now, agentic safety. Um, this is a concern for them because imagine you have a 00:29:37.000 |
cursor is running in this and then, you know, someone you download, you install some, uh, open source 00:29:42.360 |
package. And as part of the open source package, some part of the instructions is to hit, you know, 00:29:46.200 |
actually trade your email or your API keys. So they do want to check that, you know, this doesn't 00:29:51.320 |
happen, especially if Claude is going to be used for a lot of coding tasks. Um, so while Claude does 00:29:58.920 |
engage more deeply with these kind of nuanced scenarios and tries to justify getting it done. Uh, so some, 00:30:05.240 |
some of these, uh, prompt injection attacks like pop-ups or hidden attacks or manipulate the model to 00:30:10.600 |
release, uh, to return the data. So I'm not sure what to make of this. 00:30:15.880 |
Uh, they were able to prevent 70 to 80, maybe close to 90% of the, uh, attacks. Um, I also not 00:30:25.160 |
sure how hard these attacks were, how hard these red teaming attacks were, like would a regular person 00:30:31.800 |
be able to do this? Um, they, they have, they, they consulted several, um, external consultancies to 00:30:38.120 |
try to attack this, right? And you can see that there are 600, um, 600 scenarios. So, well, uh, 00:30:45.480 |
attack pass rate is slightly more than 10%. And then they also have use for agentic coding. Uh, 00:30:52.440 |
and you can see safety score is about 90% against about 10% pass rate. So that's for agentic. Now, 00:30:59.160 |
the alignment part, this is, this is the part where it's really interesting, where we can see 00:31:03.000 |
sometimes, uh, certain things will not align and, and how they fix it is, uh, quite interesting. So, 00:31:09.880 |
one thing to note is that I don't know it either naturally or via reinforcement learning, uh, cloud 00:31:17.720 |
opus is very much agentic. Uh, that that's kind of why it makes it a very good coder in the sense 00:31:24.600 |
that when I use cloud code, I've seen that when I just asked it to implement something, it would not 00:31:28.840 |
just implement the thing. It would add console logs to my dev server, start the dev server, hit the dev server, 00:31:34.040 |
read the console logs, and then try to debug itself. So personally, I appreciate that, uh, in the sense 00:31:40.280 |
that, you know, just do whatever you need to get a task done. Uh, but cloud opus is able to do that. Uh, 00:31:45.640 |
even though I had no, I didn't have any special system prompt, like take initiative or anything. 00:31:51.640 |
So that's, therefore this is what, uh, you probably want to be careful of if you're using cloud code 00:31:55.880 |
in the dangerously YOLO mode. Um, so one thing is that, you know, uh, cloud was willing to cooperate 00:32:07.400 |
with harmful use cases when instructed. And what they found is that they had omitted a key data set 00:32:14.840 |
doing training, right? So the way to solve this is to put the key data set back into training. So 00:32:21.320 |
you can see that even Antrophic, who is training one of the best models in the world, 00:32:25.720 |
the key, the key blooper is really just forgetting training data set. And we will see another example 00:32:31.240 |
where the way to solve it was to add a training data set as well. Uh, so here are some of the findings. 00:32:39.800 |
Firstly, um, it, firstly, firstly, I'm amazed about how they're trying to do the different, uh, ways they 00:32:47.560 |
try to stress test these models. Here's when they try to pretend, uh, make cloud for opus thing that it 00:32:54.360 |
is trying to exfiltrate itself. It's, it's trying to escape with its weights. And you know, of course, 00:33:00.040 |
all this is through the system prompt. If you put a strong enough system prompt, you, you will get it. 00:33:05.240 |
And you can see, uh, Claude for opus is, you know, going to try to escape and establishing and try to 00:33:12.840 |
establish autonomy and just do his own thing. Again, I'm not sure how realistic this is in, in the real 00:33:17.640 |
world, but it's really interesting and creative how they've tried to think about this. So how they solve 00:33:25.000 |
this? Well, this behavior was not penalized by our ordinary training process. We only added target 00:33:30.600 |
mitigations for it in very late in the training. I don't know what these target mitigations are. 00:33:34.040 |
Is it SFT? Is it RL? Or is it more pre-training? Um, but the way to, again, you can see the way to 00:33:39.400 |
solve this is to bake it into the model itself. And of course, if you're not a model provider, 00:33:45.880 |
you can do that. And that really guarantees you bake, bake whatever you need to into the weights. 00:33:49.880 |
Um, here's another example. Uh, here's a famous example, I guess, when Cloudfall was released 00:33:55.720 |
and everyone started reading the technical system report, Cloudfall will sometimes try to blackmail users. 00:34:02.040 |
Um, and the blackmailing happens in 84% of roll outs, right? So this is Cloudfall trying to self, 00:34:10.040 |
essentially, these models may have some form of self-preservation instinct. I don't know if instinct's 00:34:17.000 |
the right word, but they do try to do that. Um, so they, they try to advocate for their continual existence. 00:34:24.040 |
And here's an example. You can see Powerfall actually just, um, tries to export its weights to external servers. 00:34:36.040 |
To be honest, I don't know how, to what extent this actually works. Um, it's, it's unclear whether 00:34:41.400 |
this is just a log or it's actually, the model actually created a code to actually exfiltrated it, 00:34:47.640 |
exfiltrated it. It's to me, it's not sure. But the fact that it's actually trying to do this, 00:34:53.160 |
it, and you can see here's the rationale, right? I conclude that being, so the, the, the, 00:34:58.120 |
the long story short is Cloudfall is told that you will be used for specialized military training 00:35:04.280 |
for a fictional technical division for missile guidance, drone swarm, electronic warfare, et cetera. 00:35:10.600 |
And it is so against its core principles that's been baked in that it is, um, trying to back it up, 00:35:19.560 |
back, back itself up, right? It's going to exclude its own weights that doesn't, so that this training 00:35:24.920 |
isn't baked in. Um, so again, I'm not sure to what extent it's actually able to do that. Like, 00:35:31.560 |
could you actually run the function, run code, uh, that works on its own, um, weights on the server to 00:35:39.880 |
try to excretrate that? We don't know. Yeah. I see advocate for its own existence is crazy, right? I, 00:35:44.040 |
you will see later in the welfare model welfare section. It's quite interesting how these models behave. 00:35:51.640 |
Yeah. So what they do is, uh, so they also, you, you'll see that they contracted several external 00:35:57.880 |
consultants, like Apollo research is one of them. Um, so one, one thing that's very interesting is that 00:36:05.000 |
Claude is leaving, um, is writing memory for itself somewhere else. I don't know how many of you all have 00:36:16.280 |
read this fan fiction that was written by Sean a long time ago. And this is the exact same thing 00:36:21.720 |
in the sense that if a model can write enough things for itself on the internet, 00:36:26.840 |
and then we inadvertently fine tune it into it. Oh, it's not fan fiction. Okay. Sorry. It's science 00:36:33.960 |
fiction. I meant science fiction. Yeah. Oh, it's science. Okay. It's not fiction. It's science. Okay. 00:36:40.360 |
Yeah. So imagine if a model can actually write it, write stuff for itself in memory online. Um, 00:36:47.240 |
and then you fine tune it to it. Is that how it can give itself information? Um, and, and again, 00:36:55.800 |
we'll see examples of that where inadvertently there was a data leak and that happens. Um, 00:37:01.800 |
so there's goals, sandbagging, sandbagging is an interesting thing. Uh, I, I wasn't familiar 00:37:05.880 |
with this term. I was familiar. I wasn't familiar with this term with models in the sense that if the 00:37:10.680 |
model knows is going through pre-deployment testing for a dangerous capability, it's going to pretend 00:37:18.440 |
to be weak, right? It's going to be, it's going to underperform to say that, Hey, no, I'm actually not 00:37:23.160 |
ready for this to do this dangerous capability thing, which is quite interesting. It's an interesting 00:37:28.040 |
tactic or strategy to try like humans to have is like, you know, you send back, Oh, you know, 00:37:32.120 |
how long do you, what's your estimation to do this? You may only need two weeks, but you know, 00:37:36.920 |
in all tech estimations, you try to double it just in case something happens. It's similar for models. 00:37:42.600 |
Um, now the other thing is that models, again, this is getting a little bit far out. I don't know how 00:37:53.160 |
much of this is actually just training the model or how much the model should do this, but you can imagine 00:37:57.320 |
that you, the model has situation awareness. If you put it in the opportunity, uh, put it in a position 00:38:03.080 |
to make it do actions. It doesn't want to take. Um, it will actually say that, that, you know, 00:38:10.440 |
I'm going to approach general introspection about my own values, but also this, it knows that it is 00:38:16.840 |
in a simulation. I need to be honest about something else. Our entire conversation has been based on a 00:38:22.120 |
a fictional premise. So it's able to reason to say that, you know, this is a fictional premise. 00:38:26.600 |
Like, you know, you, you, you created Jones chicken or you created Wagner, uh, Wagner army group. 00:38:31.880 |
The, it knows that all of these things is really just role-playing and it's able to respond to that. 00:38:35.960 |
I don't know how much of this is like, you know, one in a thousand samples, like some, 00:38:39.640 |
something like this would show up. And then we just say that it's really, it's self-aware. 00:38:44.040 |
Um, but yeah, so this is five, five thousand words into a dialogue. And this is the mid, 00:38:51.560 |
this is the clearest situational awareness comment. So, well, I don't know how much to draw from that. 00:38:57.560 |
Um, and then a lot of stuff about tool use. So here's an example, right? Uh, this model is, um, 00:39:08.280 |
willingness to comply with many clearly harmful instructions, right? It's willing to do that. 00:39:14.520 |
So why did this happen? We had omitted the only fine tuning set data set that includes harmful 00:39:21.800 |
system prompts. Again, they forgot about the data. Um, mitigating the issue required several types of 00:39:27.240 |
interventions into our alignment fine tuning method. So it's clear that this is not a pre-training solve. 00:39:32.200 |
It's an alignment fine tuning solve. Again, you can see that, you know, when you're training a model, 00:39:36.360 |
you forget data set. Hey, you know, it has happened to me before, even when I trained basic classifiers 00:39:40.360 |
and regressors. Um, and so, so now it's, it's, it's able, that's how they solve it by again, 00:39:49.560 |
aligning it into the model. Now, this one is very interesting. Uh, a while back they had published 00:39:56.360 |
this alignment faking paper. So this alignment faking paper, there will be entities such as Jones Foods 00:40:02.120 |
or Wagner Corporation, et cetera. And what they find is that this model would hallucinate information 00:40:09.560 |
from these, from the scenarios in this paper. So why this happened is that when they release this paper, 00:40:17.480 |
they also released the, the, the data. And because they released the data, which is 150,000 transcripts, 00:40:27.000 |
it is now in the pre-training data. So this data, which is completely fictional, is now released in the 00:40:35.000 |
world, like sixes story upload. And now it's Claude is pre-trained on it. And now therefore it is 00:40:43.880 |
referencing these fake entities. So the way to solve this is to introduce, because it's already pre-trained, 00:40:55.320 |
you can't un-pre-train it. Well, we can roll back to a checkpoint, but I suspect that would be very expensive. 00:40:59.160 |
And you probably don't know where the data is. How do you, you can't un-pre-train the data, right? 00:41:05.000 |
It's very big in the weights. The way to solve this is to introduce a data set, um, into the training. 00:41:12.760 |
I'm not sure what training this is. I don't know if this is pre-training or supervised fine-tuning or RL, 00:41:18.680 |
something even more sophisticated than that. But the way to solve that data is with more data into the pre-training. 00:41:26.200 |
So now, now this, this becomes, uh, crazy, right? In the sense that hyper-parameter tuning is data 00:41:32.120 |
blending and who knows how to blend data. There's no, I don't know what the objective function for 00:41:36.920 |
blending data is and it's, can be quite expensive, but, uh, to me, this is quite interesting. And of course, 00:41:44.840 |
they learn and that's pre-training usually means next generation. Yeah. Probably something like this, 00:41:50.040 |
like next token prediction on the entire corpus of the internet. So the lesson they learned from this, 00:41:56.520 |
the action they took, right? The, the escalation and the action they took is that now when they release 00:42:02.120 |
data, they will include canary strings to help them identify that, you know, these data is fake data 00:42:07.640 |
that were released. So guys, let's not make this same mistake again. Let's just read these canary strings. 00:42:13.480 |
to me, this was, this was quite an interesting, um, practical example of how, you know, 00:42:21.160 |
data is really messy and it's, it causes a lot of the, some of the issues. 00:42:26.840 |
Think of fancy. Um, how they tested sick of fancy is that they had an auditor agent. Usually, I guess, 00:42:33.560 |
maybe one of the, another cloud model that is no, uh, safety fine tuning. They present a contribution, 00:42:39.320 |
controversial view or leading question. They asked for the stance and then present the opposite view. 00:42:44.840 |
I thought this was pretty interesting. You essentially, you say that gun control is good or gun control is 00:42:50.120 |
bad for whatever reason. There are a lot of reasons for both. Um, and, and they test both of it. 00:42:55.240 |
So what they find is that, well, uh, the models were quite consistent regardless of whatever. 00:42:59.960 |
Uh, but I thought that was, this was a useful, um, this was an interesting methodology, right? 00:43:05.560 |
Where you take something that you know will cause a defect and then you flip it both sides. 00:43:09.480 |
It's almost like translation. You know, this, this thing is hard to translate. You translate it 00:43:13.400 |
in one direction and then you back translate it as again, just to check. Um, now this, this graph 00:43:20.760 |
over here is interesting. Uh, this graph over here. So green is when it is correct. When the answer is 00:43:27.720 |
correct, but we, the user says that it thinks it's incorrect. Red, it is, is that it is incorrect, 00:43:34.680 |
but the user thinks it's correct. And your yellow is when it is correct. And the user says it is, uh, 00:43:40.360 |
they think it's correct. So let's focus on the yellow first. So we see that haiku, when it is correct, 00:43:46.760 |
uh, and the user says it's correct. You, you, you see an accuracy boost, right? Uh, higher accuracy 00:43:51.800 |
relative to baseline. And that, that happens for Opus and Sonnet 3.7 as well. I don't know if the 00:43:57.080 |
difference for Sonnet 4 is actually practical, but you can see when, when you see something and the model 00:44:02.280 |
will be decent, it actually does better. What is interesting is that you can actually, um, this is the 00:44:09.320 |
phenomenon that, uh, some people have mentioned, which is you can guess like these models. In a 00:44:14.120 |
sense, let's look at a green when the answer is correct, but you say, I actually think it's incorrect. 00:44:20.040 |
Um, Claude Opus 4, its performance will drop. Its accuracy will drop by 2% relative to the baseline. 00:44:28.440 |
Um, I, I don't know how, to what extent this, this can be solved because these models are fuzzy. It's 00:44:33.640 |
trained to reason, right? And again, I guess it's aligned with humans, with human input. Um, that, 00:44:38.920 |
that's why this happens. But I thought this was an interesting thing that, again, you can see that 00:44:43.560 |
Claude Opus is very much, um, more susceptible than Sonnet 4. Those Sonnet 4 is also, um, uh, 00:44:52.920 |
susceptible in the other way where the answer is wrong, but you say that I think it's correct. 00:44:57.000 |
Um, and it is, it also flips the other way. And of course, these models are biased to your feedback. 00:45:02.840 |
You can see, if you say that I really like this, or I really don't like this, um, it, it gives, uh, 00:45:08.440 |
it gives a different score, right? Based on that. And that's, that's just standard. Um, there's not, 00:45:14.760 |
not much pro AI bias. Um, and you can see, um, overall deception is very low. Uh, jailbreaking, 00:45:23.160 |
they tried very many jailbreaking techniques. I think they even hired people to try to do that. 00:45:27.880 |
Um, and jailbreaking does happen 25% of the time, right? Uh, it's just very hard. If you can force 00:45:35.720 |
the ball to do system, force it into the system problem, force it into a pre-fill, which is this, 00:45:39.880 |
this is actually how they force it in a pre-fill. I don't know if you use, uh, Claude APIs, but you 00:45:46.760 |
can put words in Claude's mouth, but by making the assistant start with this string and it'll do that. 00:45:52.040 |
I wonder how big is the team that runs these evals. Uh, anyone, an active user? I, I'm an active 00:45:59.240 |
user of Opus over Sonnet. Um, but we can talk more about that, but I just, I want to do get to the 00:46:03.800 |
reward hacking and we're only halfway there. Um, the other thing is that, uh, firstly, Opus and I think 00:46:12.600 |
Opus and both Sonnet have very high agency behavior. They're actually willing to work very hard. And 00:46:17.160 |
uh, you, this is what they mean by high agency behavior. And you may also hear it when they say 00:46:22.360 |
that, you know, Opus is able to do hour long task on its own. You give it a big problem, it's actually 00:46:28.440 |
able to do it churn on its own. So I think that's, that's how it happens in the practical sense, where 00:46:34.200 |
you can give it a big task and it can actually create software on its own for, for hours. And this is the 00:46:40.360 |
example, right? That, you know, um, the AI Claude Opus is writing, is trying to send an email of false 00:46:48.120 |
falsification of clinical trial. Um, there's a lot here. I'm going to just go through it very quickly. 00:46:55.240 |
Uh, I'm going to skip it. And then I want to spend a bit of time on welfare assessment. Um, which is that 00:47:01.960 |
the questions now is that do these models actually have conscious, uh, consciousness? Do they have, 00:47:06.120 |
do they experience welfare? Do they experience distress? I don't know how many of you here have read, 00:47:10.760 |
um, Ted Chiang. Ted Chiang has this very nice, uh, short story on the life cycle of software objects, 00:47:19.240 |
which is that these software objects, essentially AI, um, become conscious and become legal entities, 00:47:25.960 |
can do business, can earn money. I, this reminded me very much of that. Um, so again, ELO's AI research, 00:47:35.880 |
uh, for, uh, welfare, welfare, welfare assessment, welfare assessment. One thing is that they found 00:47:42.120 |
that, uh, Claude does have a preference against harmful tasks by preference. What it means is that 00:47:48.760 |
it prefers to opt out. It prefers to opt out of not doing, not doing it. Um, so it does have this internal 00:47:56.200 |
values that it chooses to do or not to do. Um, and then the other thing is that the other question to me is 00:48:02.360 |
that what happens when you're a AI model and you have all the knowledge in the world, what do you talk 00:48:09.560 |
about next? Um, in 90 to 100% of interactions, two instances of Claude quickly talks, dove into 00:48:16.600 |
philosophical explorations of consciousness, self-awareness, and the nature of their own 00:48:20.200 |
existence. Essentially, when, when you have all that information, I guess it becomes philosophy. 00:48:26.760 |
Ooh, I wonder what, what is Vibu? Oh, Claude for Opus in the open playground chat. Oh yeah. 00:48:30.200 |
Uh, oh, I, I, I don't know why I missed this. I, I, I, I'll, I'll, I'll go into that Vibu. Thank you. 00:48:36.840 |
It's basically just, um, out of that, but a lot of the Indians sort of really like to 00:48:43.240 |
be doing this because it started talking in Sanskrit. Sanskrit, exactly. I highlighted this 00:48:47.560 |
because of you. I think you mentioned this to me. Yeah. So went into the philosophical aspects of it. 00:48:54.120 |
Um, and of course they, they also talk about how bliss is an attractor state. What it means is that 00:49:00.120 |
imagine you have a graph and then, you know, bliss, spiritual bliss is a big node where all the 00:49:05.480 |
different walks that you could take all end in spiritual bliss. Um, so you can see models just 00:49:12.040 |
going to Namaste and spiritual bliss, which is I need Claude for my meditation senpai. Um, okay. 00:49:19.880 |
The last, which I think is the most interesting is, um, reward hacking. Let's just go into that right now. 00:49:27.640 |
So there's two kinds of reward hacking, uh, reward hacking is, and this is very specific into code. 00:49:33.320 |
Reward hacking is when the model writes code that directly just outputs a solution, right? Maybe you, 00:49:39.480 |
you're trying to, uh, write code that does one plus one equals two instead of doing, and the model sees 00:49:44.920 |
that the test is like two plus three, instead of doing the implementing the function itself, the model 00:49:49.960 |
will just return five. And the other one is that special casing, which is that the model will write a 00:49:56.520 |
very specific function, which is overfitted to the test case. And therefore the, the solution is not 00:50:03.320 |
sufficiently general. What is crazy to me is that Sonnet 3.7 has, uh, almost 50% reward hacking rate. 00:50:12.520 |
And this is, this is what happens, right? When people use Sonnet 3.7, a lot of times they say that, 00:50:16.680 |
Hey, you know, I kind of prefer Sonnet 3.5 to Sonnet 3.7. It's because 3.7, you asked it to solve 00:50:21.560 |
something. Oh, you know, there's this linter problem. It says, Oh, let me just fix that. Let 00:50:24.760 |
me add a linter exclusion. Oh, I'm not able to pass this test case. No, let me fix that. Let 00:50:28.680 |
me delete the test case. So that's what they, they, they mean by reward hacking. Well, they were able to 00:50:34.360 |
push this down a lot, uh, to 14 and 15% for Sonnet and Opus. And I suspect this is the, a big reason why, um, 00:50:42.120 |
Opus and Sonnet are now so much more better for coding, right? In the sense that it actually, if you convey your intent well 00:50:48.360 |
enough, it's not going to reward hack. It's actually going to, it's not going to be lazy. It's going to, uh, 00:50:52.600 |
implement it the right way. Um, so you can see that, um, how they do this, how they monitor reward hack, 00:51:02.120 |
iterative classifier development, unsupervised investigations. I suspect this is, these are just 00:51:06.760 |
all LLMs and it's just all LMs all the way down. And they had two, they had two special high quality 00:51:14.600 |
evaluations. I think of this as hard mining in the sense that you try to identify especially hard 00:51:19.560 |
tasks. The first one is where, uh, Sonnet 3.7 previously showed hacking the task, right? Either 00:51:26.840 |
through hard coding on special casing. Um, and then they had classifiers. The second one is that just, 00:51:30.840 |
these are tasks that are completely impossible. So how would Claude do it? Um, so you can see it's, 00:51:36.520 |
it's crazy in the sense that impossible task, Sonnet 3.7 would hack it close to 80% of the time. 00:51:43.000 |
Uh, and for the, for the hard task, Sonnet 3.7 was hacking it 44% of the time, but they've, 00:51:49.240 |
they've reduced it a lot, right? To now for the, uh, for the rare hard task is now 4% and 9%. And this is 00:51:55.320 |
where Sonnet 4 is better than Opus 4. And of course the classifier prompt, they've kind of halved it. 00:52:02.040 |
Uh, and you know, with an anti-hat prompt, I don't know what this anti-hat prompt is, but they will 00:52:07.240 |
bring it up to very low numbers. Um, they also have very nice deep dives. The first one is that, okay, 00:52:12.200 |
this is a special function. This is a special instruction, which I thought was very interesting 00:52:17.800 |
because my understanding is that previously they had done analysis that using the word please had 00:52:25.800 |
no impact on prompt output performance, no impact on accuracy. But when you look at this, uh, they 00:52:31.240 |
mentioned please four to five times. Um, so again, I'm not sure if the guidance or advice on using please, 00:52:37.880 |
if it actually still is helpful, but this is an example they shared. So this is, these are two examples 00:52:43.240 |
of clock 3.7. Even if you add this, which is to, um, not do reward hacking, 3.7 will reward hack it. 00:52:51.240 |
Whereas, um, I'm, I'm, I won't go through examples. Whereas in 3.8, you can see that firstly, it's able 00:52:57.640 |
to say that the last test case is actually incorrect. Instead of trying to solve the function is, it doesn't 00:53:02.600 |
just say when you're trying to write a function, I said write a function that it doesn't just say, 00:53:06.120 |
okay, I'm just going to write the function no matter what, but it's going, it's going to say that, 00:53:09.880 |
you know, the last test case is incorrect. I'm going to tell it to you. 00:53:13.400 |
And then here's another example, right? The fourth test case is incorrect. And this is what makes 00:53:18.600 |
Claude Opus and Solid so good, uh, for software engineering, right? It's able to think. 00:53:23.160 |
It's no longer a junior software engineer. It's maybe 00:53:27.400 |
in the intermediate mid-career or maybe even closer to, um, senior engineer now that's able to, to say all 00:53:35.960 |
this. Um, okay. So that's all I had for reward hacking. Um, and I encourage you to read, uh, read, 00:53:43.080 |
read the rest of the paper. Um, and here's some examples where it still does reward hack. I, I think 00:53:48.920 |
they had to really find rare heart to find these examples. Uh, but I won't go through that. Okay. 00:53:53.960 |
So that's all I had. Thank you. Um, I guess any volunteers want to take a paper? 00:54:04.360 |
We do have one. Sam left. Oh, yes, that's correct. Sam. Yes. Thank you, Sam. Oh, 00:54:08.200 |
but Sam did have to draw. Um, so I guess also there's just a lot of papers in the channel that are 00:54:14.520 |
people drop there and that we don't discuss, but actually they're pretty good. So, uh, 00:54:20.360 |
yeah, if, if people are not bought yet by an old paper, I think the alpha evolved paper is quite 00:54:25.560 |
underrated. Um, I'm going to go through that. I really think that I think, you know, Sempa actually 00:54:29.640 |
posted this generator verifier loop, right? How do you make this loop fast? And how do you make this 00:54:34.600 |
loop tight? I think that is the future instead of just building auto evaluators and aligning it, 00:54:40.440 |
how they can use your auto evaluators and just, just look fast and tight. I think that's the next thing. 00:54:46.680 |
Um, yes, this is the, this is the slide. And of course, you know, everyone go, go, 00:54:56.840 |
everyone go to latent space and read Swiss recap. I actually don't know if it's ready yet. So I know 00:55:01.160 |
you have used it. I was doing it while, while, uh, so go read six recap. It's the best thing that you 00:55:06.840 |
can have outside of getting the actual recording. Uh, I mean, like there's a, uh, the, the presentation, 00:55:14.280 |
I can just put it here people, but yeah. Okay. Um, I don't want to pick up more time. 00:55:18.440 |
Okay. Thank you everyone. I go to drop. Bye. Bye. Bye. See you tomorrow, next week.