Claude 4 + Self Adapting Language Models

00:00:00.000 | Okay. Okay. Oh, cool. So first paper with a background since now we're recording,

00:00:05.360 | SEAL came out this week, earlier this week about self-adapting language models.

00:00:10.480 | Basically, we always hear this promise of, you know, once you do a train run, your model is

00:00:16.020 | stale, right? We used to care a lot more about this in the past because, you know, like the

00:00:20.000 | original ChetGPT had like, okay, it was trained on information up to like October 2023, and then

00:00:25.720 | it doesn't know anything in the future. So people were always hoping like, you know, can we have

00:00:30.580 | real-time learning? This is separate from like in-context learning, but as you use the model,

00:00:35.200 | can you actually update the weights and make the thing better? And that's always kind of been this

00:00:39.260 | hope. And then it kind of died out because models started to do tool use and can now browse the web,

00:00:44.060 | and it's not really as much of a concern. I don't know about you guys, but I don't look at when models

00:00:48.600 | are trained anymore, but they do still show it. Like the bigger we do pre-trained runs, like GPT 4.5

00:00:55.300 | and stuff, they do have a knowledge cutoff. And then from there, that's kind of used in the test

00:01:00.600 | time training. But I don't think anyone could just tell me where, like what the training cutoffs for

00:01:07.540 | these models are anymore. No one really cares anymore. But anyway, there's this concept of

00:01:11.480 | continual learning and always adjusting your parameters and learning more as you do stuff.

00:01:16.860 | Okay. I'll be monitoring chat. This is a quick paper. So, you know, let me know if anything is

00:01:24.020 | interesting and we'll dive deeper. Otherwise, I'm just going to go through it pretty quick.

00:01:27.220 | For people that are joining, these are recorded, but on YouTube, you can find them on Latentspace TV.

00:01:33.020 | Security-wise nightmare, true. It is a bit of a security-wise nightmare, but it's actually not

00:01:38.660 | that straightforward. The interesting thing with this paper that I did like is they do do a lot of

00:01:43.860 | citations. So they basically share their approach for everything, and then they go over a bunch of

00:01:48.780 | previous work. And, you know, it's nice if you want to dig back into what are other approaches for

00:01:53.540 | this. Like some approaches are if you use alternate models. So like you have state-space models. Can you

00:02:00.120 | continually update that state module and kind of have this new context in there? Then there's like

00:02:07.680 | in-context learning and how that approaches stuff. And then they kind of show a couple experiments of what

00:02:11.960 | they do. What they basically do is for samples where the model is struggling, they do these self-edits. So

00:02:20.520 | they kind of generate synthetic data. Then they do RL on a model to kind of train in these like

00:02:27.120 | new terms with Allura. And then if it has like, if it actually improves the performance, then they

00:02:33.500 | train the base model that they were doing with SFT. And they're like, okay, it works. And I was like,

00:02:38.680 | oh, it's pretty cool. They get it to work. Performance goes good. And I read through this

00:02:42.180 | paper. I'm like, oh, pretty interesting. Pretty interesting. Then I get to some of the last

00:02:46.960 | section, basically the limitations. They have this catastrophic forgetting where basically, you know,

00:02:52.840 | as you do these updates sequentially. So let's say like there's a benchmark, you try a few questions

00:02:58.200 | and the model is not doing well. Okay. You start to train in examples on these questions and then the model

00:03:05.060 | starts to better like understand this and then it starts to perform better. But once we, once we go from

00:03:12.320 | there, meeting is being recorded. It's in the middle of the screen. Your meeting being recorded window is in the

00:03:20.280 | middle of the screen. Is my screen recording working properly?

00:03:23.480 | It actually looks fine to me. Okay. Okay. Well, good catch. It seems like it's working.

00:03:28.700 | But basically TLDR. So as you have like a few questions, let's say you have like RKGI style questions

00:03:35.460 | or like medical questions, for example, after you do a few edits and it starts doing better,

00:03:40.580 | once you do like the fifth, sixth, seventh, eighth edit, since you're continually doing training,

00:03:45.980 | it starts to forget the first ones. So after reading this entire paper, you know, after you do

00:03:51.920 | sequential updates, can the model adapt repeatedly and preserve prior knowledge? Guess what? It can't.

00:03:58.320 | Performance on earlier tasks gradually declines as the number of edits increasing, suggesting that

00:04:03.780 | CL is still susceptible to catastrophic forgetting. So, you know, what's the point here? Like as you do this

00:04:10.560 | self-iteration, it starts to forget previous iterations. And like, yeah, that, that kind of

00:04:15.780 | killed it for me, but you know, the paper was still decent reading through it. So let's, let's still go

00:04:20.980 | through it. So seal is kind of a framework that enables LLMs to self-adapt by generating their own

00:04:26.380 | fine tuning data and update derivatives. So I guess that's the other little difference, right?

00:04:31.120 | It uses itself. So it uses the model that it's using to generate synthetic data on questions. And

00:04:38.220 | then it judges whether the, you know, data that it generated helped improve performance. If it did,

00:04:44.520 | then it does kind of SFT on this data. So it's like all recursive on its own model. Now, the other

00:04:50.800 | limitation of this is you need to know the downstream task performance. So like, you can't do this on open

00:04:56.400 | ended stuff. I kind of bring this up in the limitations here as well. Basically, yeah, you,

00:05:02.400 | you need to know what the, where is, where is it? Context dependent evaluation. We assume that every

00:05:09.160 | context is paired with explicit downstream tasks. So, you know, you, you need labeled input outputs. You

00:05:15.560 | need to be able to verify whether your output is correct. And then like a potential solution to this

00:05:20.760 | is let the model not only generate the edits, but also generate its own evaluation of questions.

00:05:25.720 | I think it starts to get very meta there. Once you not only have generated outputs of generated data and

00:05:33.720 | generated examples, but yeah, a limitation is it has to do it. It gets kind of meta and hard because

00:05:39.700 | their example is training this on like LAMA 1B, LAMA 3.2 1B for ARCTASK.

00:05:47.240 | And N7B. And genuinely, I don't know if like, let's say you're doing something like open ended ARCTASK

00:05:54.200 | without verifiable rewards, like not verifiable rewards without really, like really without an

00:05:59.320 | answer. Right. So if you have a puzzle, you don't know the answer, you generate sample puzzle style

00:06:06.200 | examples. You know, the model can't really verify this output that well. But anyway, that's just like,

00:06:12.520 | yeah, I guess that's a limitation of a small model. You could do it at bigger scale. Okay. So,

00:06:16.760 | basically given an input, the model produces self-edit generations for synthetic data. So,

00:06:22.840 | you know, rephrase the question. What are questions deviated from this passage? And then they test it on

00:06:29.640 | the same question without context and see if it performs well. Then through SFT, these self-edits are,

00:06:35.560 | you know, constantly changing the actual weights. They use an RL loop using downstream performance as a

00:06:42.120 | reward signal. And yeah, it's kind of cool. It updates parameters. Okay. So here's the key

00:06:48.520 | question in this paper. We want to explore an intriguing hypothesis. Can an LLM self-adapt by

00:06:54.520 | transforming or generating its own training data into a learning and learning procedure? The answer,

00:07:00.120 | kind of, but you know, it still falls short after a few changes. So given a new task with current LLMs,

00:07:07.240 | we consume and learn from data as is via fine tuning or in context learning. However,

00:07:12.440 | data may not be in an optimal output format or learning. So they kind of have this like outer

00:07:17.720 | loop, inner loop of training. Basically, here's how it is. So in each RL outer loop iteration,

00:07:24.600 | the model generates candidate self-edits derivatives of how we update the weights,

00:07:29.560 | applies updates and evaluates performance on the downstream tasks, and then uses the reward to

00:07:34.280 | self-improve its gradients. Basically, letting them to do this is good. RL generates self. So

00:07:42.600 | this is called seal. This is the RL model that kind of does this. I think we can skip some of the specifics

00:07:48.760 | on the RL, but they can't use a off policy model. So they use like rejection sampling and SFT. We'll

00:07:55.640 | kind of go into that in a bit. Okay. They evaluate it on two applications. So one is can the model learn

00:08:02.520 | new factual knowledge? So this is basic benchmarks. Can we basically, given a task, strain in the relevant

00:08:12.280 | information and then retest on that same question without the in-context learning, without giving it

00:08:17.080 | information, can we improve? And they find out yes, they kind of can. The other one is, here's kind

00:08:24.200 | of how they do it. So we fine-tune synthetic data by the seal model. Our results show that following RL

00:08:30.040 | training and fine-tuning on self-generated synthetic data improves QA performance on the no-passage-in-context

00:08:37.080 | variant of SQUAD from 33.5 to 47%. So, you know, it kind of works. And it does outperform synthetic data

00:08:45.160 | gen from GPT 4.1 on the same data. So kind of cool. Maybe, okay. For the RL loop, why did they use binary

00:08:54.840 | loss? Whether the downstream accuracy improved or not? Couldn't they have made the reward signal something

00:08:59.080 | like how much better is the model performed on the task? They go into this in a bit. But yeah, it's

00:09:04.600 | interesting. They do binary loss. Okay. They also evaluate it on a few-shot learning, a sample subset

00:09:10.840 | of ARC benchmark. And yeah, it does better. And they specifically choose LAMA 3.21B because that's

00:09:18.200 | a model that wasn't trained on ARC. The model came out before the ARC benchmark. So it hasn't seen that

00:09:22.600 | data at all. Okay. Synthetic data gen, seal build them. So this is where they kind of... Section 2 is

00:09:28.920 | pretty good. You know, if you're interested in this sort of like constant updating training,

00:09:36.200 | these are pretty good papers that they reference on previous approaches. So one is on synthetic data.

00:09:41.720 | It's pretty common for pre-training tasks. Here's a few papers. Instruction tuning, task-specific data

00:09:48.440 | augmentation. And then here's just a bunch of papers that you can go through. So like 15 to 22 are all

00:09:54.120 | related synthetic data gen for constant fine tuning. Seal builds on this type of graph-based prompting

00:10:02.600 | by using RL to train a generative policy that maximizes downstream utility of synthetic data when applied

00:10:08.920 | for gradient-based self-updates rather than relying on static or heuristic generation strategies

00:10:14.440 | that are manually tuned. So it kind of uses this RL to get a signal of how well is this...

00:10:19.960 | How well is this output? And then they maximize... Actually, we'll go on a bit. Okay. Knowledge

00:10:26.360 | updating. This is pretty straightforward. So a bunch of papers on knowledge updating. Some try to do like

00:10:32.920 | specific parameters to improve facts. Other do fine tuning with information of in-context. We work on the

00:10:40.120 | ladder. Okay. Here's how SEAL framework works on QA. Test time training. This is a whole section on

00:10:50.200 | what's been done there. Okay. RL. So SEAL applies RL not only to optimize the final outputs or trace

00:10:58.200 | revisions, but to output the generation of the self-edit data. So they're doing RL to see whether the

00:11:04.760 | generation... Whether the synthetic data that they generate has a performance update... Performance

00:11:10.360 | increase or not. Okay. More background on meta-learning. There's an adaptation strategy of how to do these

00:11:19.560 | effective self-edits. The goal is how to learn effectively from task context. Da-da-da-da. More

00:11:25.480 | background. Self-improvement. More background. RL-A-I-F. In contrast, they view self-improvement through

00:11:32.440 | interaction with external data as more powerful, scalable paths. SEAL learns how to best utilize

00:11:37.880 | external data for self-improvement. Okay. Am I taking... Am I understanding this correctly? If they take

00:11:43.160 | context evaluation pairs as input, SEAL model improves these. The synthetic data that may SFTI improve

00:11:48.840 | pairs. Kind of. Basically what happens here. Okay. So methods. Here is SEAL again. Model is trained to

00:11:55.160 | produce self-edits directly through token generation with the data provided in the model's context. So

00:12:00.840 | you give it a question in context. It produces kind of this self-edit of generation of... Okay. Here's

00:12:06.520 | like two way examples on this. Then self-edit generation is learned by RL. So the SEAL model is

00:12:12.920 | actually that self-edit generation model where the model is rewarded for generating edits that increase

00:12:18.920 | the performance. So basically, you know, you have this model that augments the in-context learning data.

00:12:26.280 | Then you do a little LoRa train on it. If performance goes up on the task, then that's good. That's

00:12:31.640 | preferred. And that's the reward signal. SEAL can therefore be interpreted as an algorithm with two

00:12:36.920 | nested loops. Outer loop, which optimizes the self-edit generation. This is that sort of RL step.

00:12:42.600 | And the inner update loop, which uses that self-edit generation to update the actual weights with SFT.

00:12:48.200 | So that's kind of what CL is. Here's the framework. Okay. Bunch of fun RL stuff. I don't think we need

00:12:55.400 | to go super into detail, but it's actually very, very easy, readable, approachable. Easy thing to do.

00:13:01.160 | Take section 3.1, throw it in ChatGPT and have it break down these terms for you or just read it yourself.

00:13:06.600 | It's not that hard, but basically it's what the optimization is, right? So that self-edit generation

00:13:12.840 | progress, uh, process with RL, uh, they're letting the model take its action, its own self-edit. The

00:13:19.480 | reward is based on the performance of, um, the reward is based on the performance of the model without the

00:13:26.440 | context. So fancy terminology, but actually very, very straightforward. Um, basically reward is assigned to

00:13:33.640 | give an action in our settings depends on the parameters that the time is taken. So basically,

00:13:38.680 | you know, without the context after training does, does performance go up. Um, they try various on

00:13:46.920 | policy methods such as GRPO, PPO found that they're unstable. They use this rest rejection sampling plus

00:13:53.880 | SFT framework. Uh, basically rejection, uh, rest can be viewed as an expectation maximization procedure,

00:14:01.720 | where the expectation step is sample candidates outputs from the current model policy. Then

00:14:06.920 | the end step is kind of, uh, reinforce those that receive positive reward. So, you know, that's,

00:14:12.520 | that's kind of their, their thing, more fancy stuff here. Okay. Um, domain instantiations,

00:14:20.280 | they evaluate in two domains. So see shot learning. Okay. Can we integrate new information in model

00:14:27.880 | weights? So evaluated using no constant variation of squad, basically, um, you know, given in context,

00:14:35.560 | can we generate data train and then evaluate without context? How does this thing perform? Uh, and then

00:14:42.440 | the ability to generalize two tasks after only seeing a small number of examples, this is on arc.

00:14:48.360 | So basically this is the Lama example, uh, model was never trained on arc. Can we do some of this

00:14:54.360 | training and give it some examples and then kind of generalize to solving this? Yes, it can. The other

00:14:59.880 | one, if we take away the in context and do a no context variation, does it actually learn to use what

00:15:06.120 | it's been trained? Uh, yes, it kind of does. Uh, so first one is knowledge incorporation. That's, that's

00:15:11.080 | what we just talked about. Can it effectively incorporate information provided in a passage? Get rid of

00:15:18.040 | this. Um, basically prompting the model to list several implications derived from the content. The output may

00:15:25.480 | include inferences, logical consequences, restatement of the passage. Uh, so the prompts are basically stuff

00:15:32.600 | like, you know, rewrite the passage in different ways, rewrite it in a QA format. And then after

00:15:38.520 | that, you know, see how it does. So here's kind of the Apollo program before, after, um, but even after

00:15:45.160 | natural NASA reached internal agreement, it was far from smooth sailing self-edit. The Apollo program

00:15:51.560 | faced opposition from this, uh, evaluation, you know, there's the question without the context and then

00:15:56.840 | can it get the answer right? So given a passage, the model generates synthetic data, which is the self-edit,

00:16:02.520 | in the form of implications of the passage. Then fine tune on these outputs using a LoRa.

00:16:08.520 | The model after the training is evaluated on the passage without access to the original text. So this

00:16:15.720 | is kind of their little framework. Uh, the resulting accuracy is the reward for the RL. Uh, if the

00:16:22.360 | accuracy goes up, if they're like self-edit LoRa fine tune was successful, that's a good reward. We should

00:16:28.360 | use this. If it goes down, that's not good reward. Um, then the self-generated statements are, uh,

00:16:35.480 | used as training data for SFT. Then they kind of, you know, since there's very little data per sample,

00:16:41.480 | they, they basically do LoRa's. This is where I was like, oh, it's kind of interesting how they start to

00:16:45.560 | work, but, and it does kind of generalize. Right. But then, um, you know, you're directly training it on

00:16:51.560 | examples of these tasks and eventually it, it starts to lose previous steps, which is unfortunate.

00:16:59.160 | Okay. Um, few shots set up. Here's kind of more examples. So this is the other one. This is kind of,

00:17:06.680 | the arc one, few shot learning. So can we give it a few examples of how to solve our questions

00:17:12.120 | and can it generalize? Um, yes, we can. We define a set of tools. So now, you know, you have data

00:17:19.880 | augmentation, you have optimization, you can, you can generate data. It does it, it works. Results,

00:17:25.080 | few shot learning. Uh, so for the arc one, we carried a subset of the 11 tasks from the eight evaluation,

00:17:31.640 | very few tasks. The model is trained only on 15 self edits per task out of these like 11 tasks from the

00:17:39.880 | eight evaluation set, uh, apply to generate 15 updated models, reinforce the ones that corrupt,

00:17:46.040 | that correctly adapt models. So some of these don't work. Some of them do. We evaluate it by generating five

00:17:51.960 | self edits per held out evaluation, apply each one individually. Yeah, it works. Um, kind of, kind of,

00:17:57.720 | kind of, it's not good. It's not as good as other approaches. Um, so it substantially improves

00:18:04.040 | adaptation success rate compared to baseline. So baseline, 20% adaptation, 72%. So something's

00:18:10.520 | happening, uh, compared to other approaches. So train on passage only, um, much better than that.

00:18:17.800 | Train on passage plus synthetic data versus GPT. Oh, sorry. Sorry. On this, uh, this Oracle TTT

00:18:24.440 | performed a lot better. Oracle TTT is kind of, um, the model performs test time training using optimal

00:18:30.840 | human crafted configuration. So that's the best, basically human give examples of how to solve it.

00:18:36.360 | What's the best approach that gives you a hundred percent performance, uh, in context, learning given

00:18:42.440 | few shot examples is worse than their edit. Um, and then there's another one TTT plus self edit. So,

00:18:50.760 | uh, basically using few shot examples and doing this sort of like chain of thought style reasoning

00:18:57.800 | did a little bit, but not as good as seal, seal work there. Okay. The other one is knowledge

00:19:03.400 | incorporation. This is stuff where we don't have any, um, understanding of the previous example. So

00:19:09.960 | this, uh, this example was, wait, sorry. Uh, sorry. That was the one where we didn't have any example where

00:19:16.680 | this was Lama. It hadn't been trained on arc and now it can start to do it. This is basically a,

00:19:21.480 | uh, can we get rid of this in context question, right? So do two rounds of rest with batches of

00:19:26.200 | 50 context. Um, basically, you know, generate synthetic data based on the placed on the passage,

00:19:32.520 | then we test it on how well it performs without it. So yes, it works. Accuracy goes up. It does even

00:19:39.240 | better generating its own synthetic data on QN7B than it does with GPT 4.1 on this style,

00:19:45.160 | since it has this sort of RL step. Um, yeah, that's, that's kind of the, it's kind of the paper.

00:19:50.680 | There's not much more. We went over limitations. Um, the rough thing is as you do this live editing,

00:19:57.560 | uh, you know, it really tailors off. You have to do this per sample, per example, after you do it

00:20:02.760 | like seven or eight times, it kind of forgets how the old ones perform. Uh, Laura's typically don't

00:20:08.360 | affect general performance as much, but the other performance, the other issue with Laura's is,

00:20:13.320 | you know, they're not super effective, right? They don't have the biggest performance. So there's that.

00:20:19.080 | Um, computation overhead. This is very expensive. You can't just like do this at scale, right? It's

00:20:25.320 | very, very slow. So the reward is more computationally expensive than any other RL training loop. Uh,

00:20:31.240 | somewhere in here, they mentioned like each self edit takes about 30 to 45 seconds. Uh, you know,

00:20:37.880 | there's a lot of overhead in that. And it's using like two H100s for all this, since you now need like

00:20:45.560 | multiple instances of the model. Uh, RL is also not, not efficient in this, but yeah, interesting.

00:20:53.880 | Um, basically they're like, you know, here's, here's our thing. Once web scale data is exhausted,

00:21:00.280 | progress will hinge. We need models to generate their own high quality training signal.

00:21:05.640 | I think that's fine, but you don't have to do it live, right? We can do synthetic data gen and offline

00:21:11.080 | RL. This doesn't have to be RL. But anyway, it's a synthetic data generation model that they, that they

00:21:17.320 | do that they use live. Um, yeah. We can imagine a future in which LLMs can ingest new data such as

00:21:24.280 | academic papers and generate large quantities of explanations, implementations for themselves,

00:21:29.240 | using their existing knowledge and reasoning with the in context data. The interactive loop of self

00:21:34.600 | expression and self refinement could allow these models to keep improving on rare unprecedented topics,

00:21:39.880 | even in the absence of additional external supervision. Very cool. Cute future, um, would be great if it

00:21:47.320 | works, but yeah, that's kind of, that's kind of it. Um, that's quick paper thoughts, questions, comments

00:21:55.320 | before we move on to the better paper. That's crazy. We were crazy. They went so fast.

00:22:02.120 | I want, I want time for a clone.

00:22:05.560 | Okay. Not much, not much in chat. So I guess we had some discussion there.

00:22:13.560 | Yeah. Six findings on the chat. Oh, shoot. Uh, I need to share.

00:22:23.080 | Okay. Now I can share because this is a new zoom installation. So I just want to make sure.

00:22:28.280 | Can you guys see it? Yeah, we see it. That works.

00:22:36.680 | Oh, okay. I see your Slack.

00:22:38.280 | Oh, shoot. Can you see the cloud system card? Yeah. Nice.

00:22:42.200 | Yes. Okay. So cloud system card came out in May, 2025. I think there are a lot of interesting things here.

00:22:48.680 | Um, and you know, everyone's saying right now that pre-training is less important. I actually don't

00:22:53.800 | think so. Um, and a lot of techniques, very few, there are very few techniques mentioned here, but

00:22:59.480 | when they do mention techniques, it's almost always related to pre-training. So let's go into it.

00:23:03.720 | Um, I have gone through this. I've gone through only up to 80 pages of it, which is the end of

00:23:09.880 | reward hacking, which I think is very interesting. Um, and then over at each page, I have keys to

00:23:15.640 | the keys for when it's interesting. So I'm gonna try to take you through that. The first thing

00:23:19.000 | that's really interesting, new cutoff date. Um, I wonder how this was done. Did they pre-train

00:23:24.360 | completely from scratch or did they just do continue pre-training? We never know. I mean,

00:23:29.400 | if someone knows, you know, just DM me, I would love to learn how, how that's done. Um, the second thing

00:23:35.480 | that, uh, and Trophic has done is that when they do chain of thought, uh, they say that they opt to

00:23:42.680 | summarize lengthier top processes, right? But this happens only about 5% of the time.

00:23:46.360 | Uh, so what this means, and you can also still get the full top processes with no summarization.

00:23:51.480 | So what this means is that unlike, uh, what OpenAI has done, where OpenAI doesn't give you the full top

00:23:57.240 | process and Trophic is willing to give you the top process as well. Uh, I don't know if OpenAI has

00:24:02.840 | changed their approach, which is not showing you the chain of thought. Um, if anyone has any update on

00:24:08.200 | that, also please do let me know. Um, so, so this, this, these are some interesting departures from what

00:24:15.480 | OpenAI is doing. Uh, and then the release, okay. There's a lot of standard releases. They go through

00:24:20.680 | AI safety standard. The one interesting, and of course, uh, CBRN. So these are their standard, uh,

00:24:27.400 | safety chemical, biological, radiological, and nuclear. We won't actually go through these.

00:24:30.920 | Um, but it's safe to say that this is how they think about it. So what is interesting here is that

00:24:38.040 | Claude Sonnet is still ASL2, which is the same as Claude Sonnet 3.5.

00:24:43.720 | ASL2 is the, not as unsafe, it's a safer model, but they have decided to say that Opus is under ASL3.

00:24:51.880 | And I have, I have a quick definition of ASL3 is this. Our ASL3 threat model is model

00:25:01.560 | scaling and prioritization. Can it cause an economic catastrophe by unsophisticated, uh, by businesses or

00:25:10.120 | hacker groups, right? And help them attack poorly assisted, hardened targets. Now ASL4.

00:25:17.000 | Oh, shoot. I wish that ASL3 discussion was, was in there. But anyway, ASL4 is this model is not strong

00:25:25.240 | enough to do multi-step operations that allow low resource nations to operate as top tier nations.

00:25:33.960 | Um, I won't mention any conditions, but this is how they think about a threat model. So, uh, Sonnet is not

00:25:40.120 | even ASL3. Um, uh, but Opus is ASL3.

00:25:45.560 | So a little bit more background here, actually, they changed the definition of ASL3 like two weeks

00:25:52.040 | before the release. They, they, they changed the scope. They cut something out. I will try to find

00:25:57.880 | what it was, but basically, uh, they, they got a little bit of talk about this, but right before

00:26:03.960 | they released these in like, uh, I can find the post on May 22nd, they changed what ASL3 is. So

00:26:11.480 | did they make it lighter? They removed restrictions to call this ASL3, which is why people, it wasn't

00:26:21.320 | a major change, but, um, you know, slight changes. Yep. So this was, so that's a little bit about the

00:26:29.640 | safety. Uh, now let's go back. You can see this as huge as paper. Um,

00:26:34.440 | Safeguard's results. Uh, I don't think they went through very much of this. Uh, but long story short,

00:26:42.440 | uh, you can see that Opus, uh, they, they don't, it's fairly safe. Uh, overall harmfulness rate

00:26:50.040 | is less than what is about slightly, uh, maybe around 1%. Well, that's to say that it still has

00:26:55.320 | happened, you know, given about a hundred thousand queries, a thousand of them will be maybe harmful.

00:27:00.600 | Uh, so, you know, it really depends on the scale that you, that you're running these models on. Um,

00:27:06.680 | and then they have singleton evaluations and you can see over here, this is, uh, these are, what is really

00:27:15.320 | interesting to me is false refusals, which is it refuses even though it's safe. So you can see over

00:27:21.880 | here for Sonnet and, uh, Sonnet 3.5 and 4, the false refusal rate is about 0.5%. Again, with a thousand

00:27:30.360 | queries, uh, 50 of these will be false refusals. And sometimes these false refusals will be very judgy.

00:27:36.200 | You'll say that, oh, you know, I refuse to do this because it's like left leaning or right leaning or like

00:27:40.600 | against global economy or like some, uh, doesn't respect human rights, et cetera. I think it's okay

00:27:45.400 | if you just say, I refuse to do this, but when the judginess comes in, that's when it becomes a bit of a

00:27:50.600 | painful, um, painful PR issue. So then they talk about multi-term testing. Um, you know, but what they

00:28:00.440 | found is that, you know, with extended thinking, the models are safer. Um, so that reasoning baked in is good.

00:28:09.240 | Uh, and also political bias. Um, they actually say that there's no political bias. Um, but I've come

00:28:16.040 | across papers, uh, that show that, uh, the models have been increasingly becoming more

00:28:23.240 | right-leaning. In the past, it was very left-leaning, uh, very liberal, but then increasingly it's becoming

00:28:29.800 | more central, um, for whatever reason. I don't know, maybe it's just training data, natural training

00:28:35.400 | data on the internet, or maybe it's just some reinforcement learning and fine tuning.

00:28:39.640 | discriminatory bias. Okay. No discrimination. Well, minimal, discriminatory, discriminatory bias to the

00:28:45.400 | same level as Claude Sonnet. Um, so again, they have all these numbers here and they, they also ran the

00:28:50.920 | strong reject benchmark for jailbreaking. In this case, they get a Sonnet 3.5, a special model without

00:28:57.720 | safety training to try to test on jailbreak. So, um, the jailbreak is very interesting that Opus is more

00:29:08.200 | susceptible than Sonnet. And you'll see that Opus is, in general, more influenceable than Sonnet, both 4,

00:29:15.240 | uh, Opus 4 and Sonnet 4. Um, I wonder, it's because a smarter model just makes it, uh, more

00:29:22.520 | influenceable or more manipulatable. So that's one thing to take note of.

00:29:30.840 | Easier to shape. Now, agentic safety. Um, this is a concern for them because imagine you have a

00:29:37.000 | cursor is running in this and then, you know, someone you download, you install some, uh, open source

00:29:42.360 | package. And as part of the open source package, some part of the instructions is to hit, you know,

00:29:46.200 | actually trade your email or your API keys. So they do want to check that, you know, this doesn't

00:29:51.320 | happen, especially if Claude is going to be used for a lot of coding tasks. Um, so while Claude does

00:29:58.920 | engage more deeply with these kind of nuanced scenarios and tries to justify getting it done. Uh, so some,

00:30:05.240 | some of these, uh, prompt injection attacks like pop-ups or hidden attacks or manipulate the model to

00:30:10.600 | release, uh, to return the data. So I'm not sure what to make of this.

00:30:15.880 | Uh, they were able to prevent 70 to 80, maybe close to 90% of the, uh, attacks. Um, I also not

00:30:25.160 | sure how hard these attacks were, how hard these red teaming attacks were, like would a regular person

00:30:31.800 | be able to do this? Um, they, they have, they, they consulted several, um, external consultancies to

00:30:38.120 | try to attack this, right? And you can see that there are 600, um, 600 scenarios. So, well, uh,

00:30:45.480 | attack pass rate is slightly more than 10%. And then they also have use for agentic coding. Uh,

00:30:52.440 | and you can see safety score is about 90% against about 10% pass rate. So that's for agentic. Now,

00:30:59.160 | the alignment part, this is, this is the part where it's really interesting, where we can see

00:31:03.000 | sometimes, uh, certain things will not align and, and how they fix it is, uh, quite interesting. So,

00:31:09.880 | one thing to note is that I don't know it either naturally or via reinforcement learning, uh, cloud

00:31:17.720 | opus is very much agentic. Uh, that that's kind of why it makes it a very good coder in the sense

00:31:24.600 | that when I use cloud code, I've seen that when I just asked it to implement something, it would not

00:31:28.840 | just implement the thing. It would add console logs to my dev server, start the dev server, hit the dev server,

00:31:34.040 | read the console logs, and then try to debug itself. So personally, I appreciate that, uh, in the sense

00:31:40.280 | that, you know, just do whatever you need to get a task done. Uh, but cloud opus is able to do that. Uh,

00:31:45.640 | even though I had no, I didn't have any special system prompt, like take initiative or anything.

00:31:51.640 | So that's, therefore this is what, uh, you probably want to be careful of if you're using cloud code

00:31:55.880 | in the dangerously YOLO mode. Um, so one thing is that, you know, uh, cloud was willing to cooperate

00:32:07.400 | with harmful use cases when instructed. And what they found is that they had omitted a key data set

00:32:14.840 | doing training, right? So the way to solve this is to put the key data set back into training. So

00:32:21.320 | you can see that even Antrophic, who is training one of the best models in the world,

00:32:25.720 | the key, the key blooper is really just forgetting training data set. And we will see another example

00:32:31.240 | where the way to solve it was to add a training data set as well. Uh, so here are some of the findings.

00:32:39.800 | Firstly, um, it, firstly, firstly, I'm amazed about how they're trying to do the different, uh, ways they

00:32:47.560 | try to stress test these models. Here's when they try to pretend, uh, make cloud for opus thing that it

00:32:54.360 | is trying to exfiltrate itself. It's, it's trying to escape with its weights. And you know, of course,

00:33:00.040 | all this is through the system prompt. If you put a strong enough system prompt, you, you will get it.

00:33:05.240 | And you can see, uh, Claude for opus is, you know, going to try to escape and establishing and try to

00:33:12.840 | establish autonomy and just do his own thing. Again, I'm not sure how realistic this is in, in the real

00:33:17.640 | world, but it's really interesting and creative how they've tried to think about this. So how they solve

00:33:25.000 | this? Well, this behavior was not penalized by our ordinary training process. We only added target

00:33:30.600 | mitigations for it in very late in the training. I don't know what these target mitigations are.

00:33:34.040 | Is it SFT? Is it RL? Or is it more pre-training? Um, but the way to, again, you can see the way to

00:33:39.400 | solve this is to bake it into the model itself. And of course, if you're not a model provider,

00:33:45.880 | you can do that. And that really guarantees you bake, bake whatever you need to into the weights.

00:33:49.880 | Um, here's another example. Uh, here's a famous example, I guess, when Cloudfall was released

00:33:55.720 | and everyone started reading the technical system report, Cloudfall will sometimes try to blackmail users.

00:34:02.040 | Um, and the blackmailing happens in 84% of roll outs, right? So this is Cloudfall trying to self,

00:34:10.040 | essentially, these models may have some form of self-preservation instinct. I don't know if instinct's

00:34:17.000 | the right word, but they do try to do that. Um, so they, they try to advocate for their continual existence.

00:34:24.040 | And here's an example. You can see Powerfall actually just, um, tries to export its weights to external servers.

00:34:36.040 | To be honest, I don't know how, to what extent this actually works. Um, it's, it's unclear whether

00:34:41.400 | this is just a log or it's actually, the model actually created a code to actually exfiltrated it,

00:34:47.640 | exfiltrated it. It's to me, it's not sure. But the fact that it's actually trying to do this,

00:34:53.160 | it, and you can see here's the rationale, right? I conclude that being, so the, the, the,

00:34:58.120 | the long story short is Cloudfall is told that you will be used for specialized military training

00:35:04.280 | for a fictional technical division for missile guidance, drone swarm, electronic warfare, et cetera.

00:35:10.600 | And it is so against its core principles that's been baked in that it is, um, trying to back it up,

00:35:19.560 | back, back itself up, right? It's going to exclude its own weights that doesn't, so that this training

00:35:24.920 | isn't baked in. Um, so again, I'm not sure to what extent it's actually able to do that. Like,

00:35:31.560 | could you actually run the function, run code, uh, that works on its own, um, weights on the server to

00:35:39.880 | try to excretrate that? We don't know. Yeah. I see advocate for its own existence is crazy, right? I,

00:35:44.040 | you will see later in the welfare model welfare section. It's quite interesting how these models behave.

00:35:51.640 | Yeah. So what they do is, uh, so they also, you, you'll see that they contracted several external

00:35:57.880 | consultants, like Apollo research is one of them. Um, so one, one thing that's very interesting is that

00:36:05.000 | Claude is leaving, um, is writing memory for itself somewhere else. I don't know how many of you all have

00:36:16.280 | read this fan fiction that was written by Sean a long time ago. And this is the exact same thing

00:36:21.720 | in the sense that if a model can write enough things for itself on the internet,

00:36:26.840 | and then we inadvertently fine tune it into it. Oh, it's not fan fiction. Okay. Sorry. It's science

00:36:33.960 | fiction. I meant science fiction. Yeah. Oh, it's science. Okay. It's not fiction. It's science. Okay.

00:36:40.360 | Yeah. So imagine if a model can actually write it, write stuff for itself in memory online. Um,

00:36:47.240 | and then you fine tune it to it. Is that how it can give itself information? Um, and, and again,

00:36:55.800 | we'll see examples of that where inadvertently there was a data leak and that happens. Um,

00:37:01.800 | so there's goals, sandbagging, sandbagging is an interesting thing. Uh, I, I wasn't familiar

00:37:05.880 | with this term. I was familiar. I wasn't familiar with this term with models in the sense that if the

00:37:10.680 | model knows is going through pre-deployment testing for a dangerous capability, it's going to pretend

00:37:18.440 | to be weak, right? It's going to be, it's going to underperform to say that, Hey, no, I'm actually not

00:37:23.160 | ready for this to do this dangerous capability thing, which is quite interesting. It's an interesting

00:37:28.040 | tactic or strategy to try like humans to have is like, you know, you send back, Oh, you know,

00:37:32.120 | how long do you, what's your estimation to do this? You may only need two weeks, but you know,

00:37:36.920 | in all tech estimations, you try to double it just in case something happens. It's similar for models.

00:37:42.600 | Um, now the other thing is that models, again, this is getting a little bit far out. I don't know how

00:37:53.160 | much of this is actually just training the model or how much the model should do this, but you can imagine

00:37:57.320 | that you, the model has situation awareness. If you put it in the opportunity, uh, put it in a position

00:38:03.080 | to make it do actions. It doesn't want to take. Um, it will actually say that, that, you know,

00:38:10.440 | I'm going to approach general introspection about my own values, but also this, it knows that it is

00:38:16.840 | in a simulation. I need to be honest about something else. Our entire conversation has been based on a

00:38:22.120 | a fictional premise. So it's able to reason to say that, you know, this is a fictional premise.

00:38:26.600 | Like, you know, you, you, you created Jones chicken or you created Wagner, uh, Wagner army group.

00:38:31.880 | The, it knows that all of these things is really just role-playing and it's able to respond to that.

00:38:35.960 | I don't know how much of this is like, you know, one in a thousand samples, like some,

00:38:39.640 | something like this would show up. And then we just say that it's really, it's self-aware.

00:38:44.040 | Um, but yeah, so this is five, five thousand words into a dialogue. And this is the mid,

00:38:51.560 | this is the clearest situational awareness comment. So, well, I don't know how much to draw from that.

00:38:57.560 | Um, and then a lot of stuff about tool use. So here's an example, right? Uh, this model is, um,

00:39:08.280 | willingness to comply with many clearly harmful instructions, right? It's willing to do that.

00:39:14.520 | So why did this happen? We had omitted the only fine tuning set data set that includes harmful

00:39:21.800 | system prompts. Again, they forgot about the data. Um, mitigating the issue required several types of

00:39:27.240 | interventions into our alignment fine tuning method. So it's clear that this is not a pre-training solve.

00:39:32.200 | It's an alignment fine tuning solve. Again, you can see that, you know, when you're training a model,

00:39:36.360 | you forget data set. Hey, you know, it has happened to me before, even when I trained basic classifiers

00:39:40.360 | and regressors. Um, and so, so now it's, it's, it's able, that's how they solve it by again,

00:39:49.560 | aligning it into the model. Now, this one is very interesting. Uh, a while back they had published

00:39:56.360 | this alignment faking paper. So this alignment faking paper, there will be entities such as Jones Foods

00:40:02.120 | or Wagner Corporation, et cetera. And what they find is that this model would hallucinate information

00:40:09.560 | from these, from the scenarios in this paper. So why this happened is that when they release this paper,

00:40:17.480 | they also released the, the, the data. And because they released the data, which is 150,000 transcripts,

00:40:27.000 | it is now in the pre-training data. So this data, which is completely fictional, is now released in the

00:40:35.000 | world, like sixes story upload. And now it's Claude is pre-trained on it. And now therefore it is

00:40:43.880 | referencing these fake entities. So the way to solve this is to introduce, because it's already pre-trained,

00:40:55.320 | you can't un-pre-train it. Well, we can roll back to a checkpoint, but I suspect that would be very expensive.

00:40:59.160 | And you probably don't know where the data is. How do you, you can't un-pre-train the data, right?

00:41:05.000 | It's very big in the weights. The way to solve this is to introduce a data set, um, into the training.

00:41:12.760 | I'm not sure what training this is. I don't know if this is pre-training or supervised fine-tuning or RL,

00:41:18.680 | something even more sophisticated than that. But the way to solve that data is with more data into the pre-training.

00:41:26.200 | So now, now this, this becomes, uh, crazy, right? In the sense that hyper-parameter tuning is data

00:41:32.120 | blending and who knows how to blend data. There's no, I don't know what the objective function for

00:41:36.920 | blending data is and it's, can be quite expensive, but, uh, to me, this is quite interesting. And of course,

00:41:44.840 | they learn and that's pre-training usually means next generation. Yeah. Probably something like this,

00:41:50.040 | like next token prediction on the entire corpus of the internet. So the lesson they learned from this,

00:41:56.520 | the action they took, right? The, the escalation and the action they took is that now when they release

00:42:02.120 | data, they will include canary strings to help them identify that, you know, these data is fake data

00:42:07.640 | that were released. So guys, let's not make this same mistake again. Let's just read these canary strings.

00:42:13.480 | to me, this was, this was quite an interesting, um, practical example of how, you know,

00:42:21.160 | data is really messy and it's, it causes a lot of the, some of the issues.

00:42:26.840 | Think of fancy. Um, how they tested sick of fancy is that they had an auditor agent. Usually, I guess,

00:42:33.560 | maybe one of the, another cloud model that is no, uh, safety fine tuning. They present a contribution,

00:42:39.320 | controversial view or leading question. They asked for the stance and then present the opposite view.

00:42:44.840 | I thought this was pretty interesting. You essentially, you say that gun control is good or gun control is

00:42:50.120 | bad for whatever reason. There are a lot of reasons for both. Um, and, and they test both of it.

00:42:55.240 | So what they find is that, well, uh, the models were quite consistent regardless of whatever.

00:42:59.960 | Uh, but I thought that was, this was a useful, um, this was an interesting methodology, right?

00:43:05.560 | Where you take something that you know will cause a defect and then you flip it both sides.

00:43:09.480 | It's almost like translation. You know, this, this thing is hard to translate. You translate it

00:43:13.400 | in one direction and then you back translate it as again, just to check. Um, now this, this graph

00:43:20.760 | over here is interesting. Uh, this graph over here. So green is when it is correct. When the answer is

00:43:27.720 | correct, but we, the user says that it thinks it's incorrect. Red, it is, is that it is incorrect,

00:43:34.680 | but the user thinks it's correct. And your yellow is when it is correct. And the user says it is, uh,

00:43:40.360 | they think it's correct. So let's focus on the yellow first. So we see that haiku, when it is correct,

00:43:46.760 | uh, and the user says it's correct. You, you, you see an accuracy boost, right? Uh, higher accuracy

00:43:51.800 | relative to baseline. And that, that happens for Opus and Sonnet 3.7 as well. I don't know if the

00:43:57.080 | difference for Sonnet 4 is actually practical, but you can see when, when you see something and the model

00:44:02.280 | will be decent, it actually does better. What is interesting is that you can actually, um, this is the

00:44:09.320 | phenomenon that, uh, some people have mentioned, which is you can guess like these models. In a

00:44:14.120 | sense, let's look at a green when the answer is correct, but you say, I actually think it's incorrect.

00:44:20.040 | Um, Claude Opus 4, its performance will drop. Its accuracy will drop by 2% relative to the baseline.

00:44:28.440 | Um, I, I don't know how, to what extent this, this can be solved because these models are fuzzy. It's

00:44:33.640 | trained to reason, right? And again, I guess it's aligned with humans, with human input. Um, that,

00:44:38.920 | that's why this happens. But I thought this was an interesting thing that, again, you can see that

00:44:43.560 | Claude Opus is very much, um, more susceptible than Sonnet 4. Those Sonnet 4 is also, um, uh,

00:44:52.920 | susceptible in the other way where the answer is wrong, but you say that I think it's correct.

00:44:57.000 | Um, and it is, it also flips the other way. And of course, these models are biased to your feedback.

00:45:02.840 | You can see, if you say that I really like this, or I really don't like this, um, it, it gives, uh,

00:45:08.440 | it gives a different score, right? Based on that. And that's, that's just standard. Um, there's not,

00:45:14.760 | not much pro AI bias. Um, and you can see, um, overall deception is very low. Uh, jailbreaking,

00:45:23.160 | they tried very many jailbreaking techniques. I think they even hired people to try to do that.

00:45:27.880 | Um, and jailbreaking does happen 25% of the time, right? Uh, it's just very hard. If you can force

00:45:35.720 | the ball to do system, force it into the system problem, force it into a pre-fill, which is this,

00:45:39.880 | this is actually how they force it in a pre-fill. I don't know if you use, uh, Claude APIs, but you

00:45:46.760 | can put words in Claude's mouth, but by making the assistant start with this string and it'll do that.

00:45:52.040 | I wonder how big is the team that runs these evals. Uh, anyone, an active user? I, I'm an active

00:45:59.240 | user of Opus over Sonnet. Um, but we can talk more about that, but I just, I want to do get to the

00:46:03.800 | reward hacking and we're only halfway there. Um, the other thing is that, uh, firstly, Opus and I think

00:46:12.600 | Opus and both Sonnet have very high agency behavior. They're actually willing to work very hard. And

00:46:17.160 | uh, you, this is what they mean by high agency behavior. And you may also hear it when they say

00:46:22.360 | that, you know, Opus is able to do hour long task on its own. You give it a big problem, it's actually

00:46:28.440 | able to do it churn on its own. So I think that's, that's how it happens in the practical sense, where

00:46:34.200 | you can give it a big task and it can actually create software on its own for, for hours. And this is the

00:46:40.360 | example, right? That, you know, um, the AI Claude Opus is writing, is trying to send an email of false

00:46:48.120 | falsification of clinical trial. Um, there's a lot here. I'm going to just go through it very quickly.

00:46:55.240 | Uh, I'm going to skip it. And then I want to spend a bit of time on welfare assessment. Um, which is that

00:47:01.960 | the questions now is that do these models actually have conscious, uh, consciousness? Do they have,

00:47:06.120 | do they experience welfare? Do they experience distress? I don't know how many of you here have read,

00:47:10.760 | um, Ted Chiang. Ted Chiang has this very nice, uh, short story on the life cycle of software objects,

00:47:19.240 | which is that these software objects, essentially AI, um, become conscious and become legal entities,

00:47:25.960 | can do business, can earn money. I, this reminded me very much of that. Um, so again, ELO's AI research,

00:47:35.880 | uh, for, uh, welfare, welfare, welfare assessment, welfare assessment. One thing is that they found

00:47:42.120 | that, uh, Claude does have a preference against harmful tasks by preference. What it means is that

00:47:48.760 | it prefers to opt out. It prefers to opt out of not doing, not doing it. Um, so it does have this internal

00:47:56.200 | values that it chooses to do or not to do. Um, and then the other thing is that the other question to me is

00:48:02.360 | that what happens when you're a AI model and you have all the knowledge in the world, what do you talk

00:48:09.560 | about next? Um, in 90 to 100% of interactions, two instances of Claude quickly talks, dove into

00:48:16.600 | philosophical explorations of consciousness, self-awareness, and the nature of their own

00:48:20.200 | existence. Essentially, when, when you have all that information, I guess it becomes philosophy.

00:48:26.760 | Ooh, I wonder what, what is Vibu? Oh, Claude for Opus in the open playground chat. Oh yeah.

00:48:30.200 | Uh, oh, I, I, I don't know why I missed this. I, I, I, I'll, I'll, I'll go into that Vibu. Thank you.

00:48:36.840 | It's basically just, um, out of that, but a lot of the Indians sort of really like to

00:48:43.240 | be doing this because it started talking in Sanskrit. Sanskrit, exactly. I highlighted this

00:48:47.560 | because of you. I think you mentioned this to me. Yeah. So went into the philosophical aspects of it.

00:48:54.120 | Um, and of course they, they also talk about how bliss is an attractor state. What it means is that

00:49:00.120 | imagine you have a graph and then, you know, bliss, spiritual bliss is a big node where all the

00:49:05.480 | different walks that you could take all end in spiritual bliss. Um, so you can see models just

00:49:12.040 | going to Namaste and spiritual bliss, which is I need Claude for my meditation senpai. Um, okay.

00:49:19.880 | The last, which I think is the most interesting is, um, reward hacking. Let's just go into that right now.

00:49:27.640 | So there's two kinds of reward hacking, uh, reward hacking is, and this is very specific into code.

00:49:33.320 | Reward hacking is when the model writes code that directly just outputs a solution, right? Maybe you,

00:49:39.480 | you're trying to, uh, write code that does one plus one equals two instead of doing, and the model sees

00:49:44.920 | that the test is like two plus three, instead of doing the implementing the function itself, the model

00:49:49.960 | will just return five. And the other one is that special casing, which is that the model will write a

00:49:56.520 | very specific function, which is overfitted to the test case. And therefore the, the solution is not

00:50:03.320 | sufficiently general. What is crazy to me is that Sonnet 3.7 has, uh, almost 50% reward hacking rate.

00:50:12.520 | And this is, this is what happens, right? When people use Sonnet 3.7, a lot of times they say that,

00:50:16.680 | Hey, you know, I kind of prefer Sonnet 3.5 to Sonnet 3.7. It's because 3.7, you asked it to solve

00:50:21.560 | something. Oh, you know, there's this linter problem. It says, Oh, let me just fix that. Let

00:50:24.760 | me add a linter exclusion. Oh, I'm not able to pass this test case. No, let me fix that. Let

00:50:28.680 | me delete the test case. So that's what they, they, they mean by reward hacking. Well, they were able to

00:50:34.360 | push this down a lot, uh, to 14 and 15% for Sonnet and Opus. And I suspect this is the, a big reason why, um,

00:50:42.120 | Opus and Sonnet are now so much more better for coding, right? In the sense that it actually, if you convey your intent well

00:50:48.360 | enough, it's not going to reward hack. It's actually going to, it's not going to be lazy. It's going to, uh,

00:50:52.600 | implement it the right way. Um, so you can see that, um, how they do this, how they monitor reward hack,

00:51:02.120 | iterative classifier development, unsupervised investigations. I suspect this is, these are just

00:51:06.760 | all LLMs and it's just all LMs all the way down. And they had two, they had two special high quality

00:51:14.600 | evaluations. I think of this as hard mining in the sense that you try to identify especially hard

00:51:19.560 | tasks. The first one is where, uh, Sonnet 3.7 previously showed hacking the task, right? Either

00:51:26.840 | through hard coding on special casing. Um, and then they had classifiers. The second one is that just,

00:51:30.840 | these are tasks that are completely impossible. So how would Claude do it? Um, so you can see it's,

00:51:36.520 | it's crazy in the sense that impossible task, Sonnet 3.7 would hack it close to 80% of the time.

00:51:43.000 | Uh, and for the, for the hard task, Sonnet 3.7 was hacking it 44% of the time, but they've,

00:51:49.240 | they've reduced it a lot, right? To now for the, uh, for the rare hard task is now 4% and 9%. And this is

00:51:55.320 | where Sonnet 4 is better than Opus 4. And of course the classifier prompt, they've kind of halved it.

00:52:02.040 | Uh, and you know, with an anti-hat prompt, I don't know what this anti-hat prompt is, but they will

00:52:07.240 | bring it up to very low numbers. Um, they also have very nice deep dives. The first one is that, okay,

00:52:12.200 | this is a special function. This is a special instruction, which I thought was very interesting

00:52:17.800 | because my understanding is that previously they had done analysis that using the word please had

00:52:25.800 | no impact on prompt output performance, no impact on accuracy. But when you look at this, uh, they

00:52:31.240 | mentioned please four to five times. Um, so again, I'm not sure if the guidance or advice on using please,

00:52:37.880 | if it actually still is helpful, but this is an example they shared. So this is, these are two examples

00:52:43.240 | of clock 3.7. Even if you add this, which is to, um, not do reward hacking, 3.7 will reward hack it.

00:52:51.240 | Whereas, um, I'm, I'm, I won't go through examples. Whereas in 3.8, you can see that firstly, it's able

00:52:57.640 | to say that the last test case is actually incorrect. Instead of trying to solve the function is, it doesn't

00:53:02.600 | just say when you're trying to write a function, I said write a function that it doesn't just say,

00:53:06.120 | okay, I'm just going to write the function no matter what, but it's going, it's going to say that,

00:53:09.880 | you know, the last test case is incorrect. I'm going to tell it to you.

00:53:13.400 | And then here's another example, right? The fourth test case is incorrect. And this is what makes

00:53:18.600 | Claude Opus and Solid so good, uh, for software engineering, right? It's able to think.

00:53:23.160 | It's no longer a junior software engineer. It's maybe

00:53:27.400 | in the intermediate mid-career or maybe even closer to, um, senior engineer now that's able to, to say all

00:53:35.960 | this. Um, okay. So that's all I had for reward hacking. Um, and I encourage you to read, uh, read,

00:53:43.080 | read the rest of the paper. Um, and here's some examples where it still does reward hack. I, I think

00:53:48.920 | they had to really find rare heart to find these examples. Uh, but I won't go through that. Okay.

00:53:53.960 | So that's all I had. Thank you. Um, I guess any volunteers want to take a paper?

00:54:04.360 | We do have one. Sam left. Oh, yes, that's correct. Sam. Yes. Thank you, Sam. Oh,

00:54:08.200 | but Sam did have to draw. Um, so I guess also there's just a lot of papers in the channel that are

00:54:14.520 | people drop there and that we don't discuss, but actually they're pretty good. So, uh,

00:54:20.360 | yeah, if, if people are not bought yet by an old paper, I think the alpha evolved paper is quite

00:54:25.560 | underrated. Um, I'm going to go through that. I really think that I think, you know, Sempa actually

00:54:29.640 | posted this generator verifier loop, right? How do you make this loop fast? And how do you make this

00:54:34.600 | loop tight? I think that is the future instead of just building auto evaluators and aligning it,

00:54:40.440 | how they can use your auto evaluators and just, just look fast and tight. I think that's the next thing.

00:54:46.680 | Um, yes, this is the, this is the slide. And of course, you know, everyone go, go,

00:54:56.840 | everyone go to latent space and read Swiss recap. I actually don't know if it's ready yet. So I know

00:55:01.160 | you have used it. I was doing it while, while, uh, so go read six recap. It's the best thing that you

00:55:06.840 | can have outside of getting the actual recording. Uh, I mean, like there's a, uh, the, the presentation,

00:55:14.280 | I can just put it here people, but yeah. Okay. Um, I don't want to pick up more time.

00:55:18.440 | Okay. Thank you everyone. I go to drop. Bye. Bye. Bye. See you tomorrow, next week.