back to indexAnthropic: Circuit Tracing + On the Biology of a Large Language Model

00:00:00.000 |
So I presented the other three Anthropic SAE kind of Macinturb papers. I think I'll just share 00:00:08.460 |
slides after this. So if people want to go through them, go through them. But basically the very, 00:00:13.160 |
very first one that they did was like, okay, forget LLM's. Let's just see if we can interpret 00:00:18.420 |
basic stuff in a transformer. So let's just pull it up. Anthropic SAE. So the first one was basically 00:00:28.860 |
let's take a toy model. I think it was just like, you know, three layers. Like they had a basic 00:00:33.800 |
encoder and then, you know, the sparse auto encoder. And they just trained a toy model, 00:00:39.280 |
like a couple hundred million parameters that input output, and they have this middle layer, 00:00:43.920 |
which was just an encoder. Can we start to interpret what's going on in that encoder? 00:00:48.180 |
Turns out, yeah, they can find some features. After that, they had like, or that was just 00:00:54.260 |
encoders. Then they started to make it sparse. This was kind of the big one that became pretty 00:00:58.680 |
popular. I think this came out in May. We covered it shortly after. They applied this SAE work to 00:01:05.600 |
CloudSonnet. They found out, oh shit, we can find out features that happen in the model. So they 00:01:11.620 |
basically train a sparse auto encoder to match these inputs and outputs. And then they start to 00:01:17.340 |
interpret them and match out features. High level TLDR is we can now map features that activate in this 00:01:26.560 |
sparse auto encoder. So the whole thing is you train an encoder to stay sparse. You only want 00:01:31.460 |
a very few number of features to activate for outputs. In this case, they found out stuff like 00:01:36.780 |
Golden Gate Claude. There's specific features that they trained in their little auto encoder 00:01:41.760 |
that, you know, activate when specific topics come up. So they had a feature that would always fire up for 00:01:48.480 |
words like Golden Gate. They had stuff for like tourism, for infrastructure. There were features 00:01:53.960 |
that extended throughout like multiple concepts. So, you know, it's not just one feature to one thing. 00:01:59.920 |
But yeah, they have a pretty good long blog post on this. They started grouping them. They had different 00:02:05.840 |
sizes. So they had a one mil, four mil, 34 mil size auto encoder. From there, you know, it's been a few 00:02:15.140 |
months and now they're like, okay, no more essays. Let's do circuit training. So basically essays were 00:02:21.120 |
good, but we, we kind of had a holistic understanding, right? You can apply an essay for every layer and 00:02:26.420 |
try to understand what happens in layers, or you can apply it for like, you know, just specific parts of 00:02:32.280 |
the model. You can do it in tension blocks and you can try to interpret what parts of them are firing up, 00:02:37.180 |
but this is where they started to come in with circuit training. So circuit training is where you 00:02:42.580 |
actually train a transcoder model to mimic the input model and you can do this across layers. So 00:02:47.900 |
this model actually, it matches the input model layer by layer and it, you know, it maps out what's going 00:02:57.040 |
on. I'll be right back in a second. I was stuck. Sorry. Doggo is crying. Okay. So circuit training, 00:03:07.840 |
this came out a few weeks ago, and this is kind of the high level overview of what they're doing here. 00:03:11.620 |
Basically, they, they train this cross link transcoder and they start poking around in a cloud 3.5 haiku, 00:03:18.940 |
and they start to find features that are consistent throughout layers. Hey Siri, stop so many 00:03:25.580 |
notifications. Um, so some of the interesting stuff here is they try to see like, can models internally 00:03:31.580 |
think like when they answer questions, like things that take two different steps, does the model start 00:03:37.740 |
to think through its response in advance, or is it just, you know, token predicting? And they, they find 00:03:42.300 |
interesting little case studies where actually the model is doing some thinking. So the first main example 00:03:48.060 |
that they show here is like, um, there's this prompt of what is the capital of the state that includes like 00:03:58.460 |
Austin and you're supposed to say Texas, right? So this is kind of a question that has two steps of thinking, 00:04:03.740 |
right? There's two levels of reasoning. First step is you have to think, um, what is the state that the city is 00:04:10.300 |
in and then what's the capital of that state? So they kind of go through how they do all this, but let's, 00:04:17.260 |
let's, let's start off by talking about this previous, um, previous paper that came out like a week ago about circuit 00:04:23.980 |
training. So circuit training is where they, they train this, um, transcoder model to replicate the exact input 00:04:31.900 |
model. And then they start to do these attribution graphs to figure out what happened. So, um, high level, 00:04:39.020 |
here's kind of an overview of what, oops, of what people have done in previous Macinterp work. Um, 00:04:45.740 |
um, we had transcoders, transcoders were, you know, alternatives to SAE that let us do replacement models. 00:04:53.740 |
Then we have this cross layer transcoder, which is let's do transcoders that go throughout different 00:04:58.860 |
model layers. Then we have attribution graphs and linear attribution between features. Um, they prune out the 00:05:06.380 |
the ones that are not that relevant. They have a little fill in, we'll go a little quick through 00:05:10.860 |
this since it's kind of a second, uh, paper, but they did have a little overview that I thought was 00:05:16.380 |
interesting here. Okay. Uh, big in building an interpretable replacement model. So this is kind 00:05:24.380 |
of the architecture of what this model is. So once again, they're going to create an entire model, 00:05:30.860 |
call it a local replacement model that matches the exact, um, that matches the number of layers for 00:05:37.180 |
the original transformer. So they, they train two of these. And so they start to give some statistics 00:05:42.060 |
of what it would be like to train another one. And I think they talk about how, how much compute 00:05:46.940 |
this requires on like Gemma 2B and like a 9B model. But essentially what they're doing here is they take 00:05:52.540 |
a model, they look at the architecture and they freeze the attention. And basically they replace 00:05:58.540 |
this MLP. So the feed forward layers, they replace the MLP with this cross layer transcoder, 00:06:04.220 |
and then they can start to make this sparse and have features that we can interpret from it. 00:06:08.940 |
So a bunch of math that if you're interested is pretty straightforward, actually, it's just, 00:06:13.900 |
it's just a straight replacement. It's trained to match the exact input output. Um, so here's a cool 00:06:19.580 |
little diagram. Basically you have different layers in a transformer, right? This is an original transformer 00:06:24.620 |
model. You have attention blocks and you have the MLP, right? So throughout different layers, 00:06:29.660 |
there's attention, then there's feed forward, attention and feed forward. And then eventually 00:06:33.180 |
you have output, you pick the most probable token and you know, that's your output. So in the 00:06:38.060 |
replacement model, instead of these MLP feed forward networks, they're replacing them with these cross 00:06:44.540 |
layer transcoders. These cross layer transcoders speak to each other and we start to interpret, 00:06:49.820 |
you know, we want to keep them sparse. So there's a sparsity factor. So only one feature activates, 00:06:54.460 |
then we map that to something interpretable. Um, this blog post is actually very long, but 00:07:00.220 |
that's how they make this local replacement model. Um, Ted, you have a question? 00:07:04.460 |
Not a question, but can, is it okay if I add a little bit of color here? Yeah. Um, 00:07:11.100 |
so, so one of the things is the early research along this very same direction on CNNs didn't require any of 00:07:19.180 |
this stuff. And the reason is because, um, uh, the conventional wisdom now is that the number of 00:07:25.180 |
things that, that people wanted to represent in a CNN was approximately equal to the number of filters, 00:07:32.300 |
the number of neurons, uh, uh, layers and such that you have in a CNN. So CNN wants to find vertical 00:07:38.860 |
lines, horizontal lines, diagonal lines, and then in the higher layers, triangles, circles, squares, 00:07:43.660 |
and then eventually faces, arms, that kind of stuff. And you have approximately as many things in your network 00:07:49.900 |
as you do concepts that you're trying to represent. So if all the data lives in, in essentially a vector 00:07:58.380 |
space, if you guys remember your linear algebra, then everything can be represented as an orthogonal direction. 00:08:04.540 |
And there's this linear representation hypothesis that says that information is encoded in a direction, 00:08:11.260 |
not in a magnitude, just in a direction. And if you have a small number of concepts, they can all be completely 00:08:17.660 |
orthogonal. And if you take the dot product of a vector with any of your concepts, there will be no interference 00:08:25.100 |
between concepts because they're all orthogonal to each other. So if one is due east and one is due north, 00:08:31.180 |
and you, you dot something with a canonical north vector to see how, whether or not north is present, 00:08:37.500 |
whether you add more east-west or not changes nothing about the dot product in your north direction. 00:08:43.580 |
The problem is when we get to LLMs, um, uh, no, uh, operations are additions. There's no rotations in the 00:08:51.260 |
linear representation hypothesis. So what you have to do is you have to sort of, if you have something east and 00:08:56.540 |
you want to add north, you have to sort of like add a lot of north to make sure that you get north-northeast 00:09:02.860 |
enough that your dot product with north is, is not close to zero anymore. So the problem with LLMs is that we 00:09:10.780 |
think that there are hundreds of millions, if not billions of concepts that an LLM needs to understand. 00:09:17.420 |
And there are not enough neurons in the LLM to uniquely, or sorry, there's not enough space in the 00:09:25.020 |
residual stream to uniquely represent all of these concepts. So you might have, um, a model dimension 00:09:30.700 |
that's what, 16,000, 30,000, some, somewhere around there, right? In a big model. That's not nearly enough 00:09:37.660 |
to represent hundreds of millions or billions of concepts, each with orthogonal directions. So then 00:09:43.900 |
ultimately what ends up happening is the model takes advantage of sparsity and it says, well, if I 00:09:49.740 |
represented basketball as north and the Eiffel Tower as east, and I represented ethylene glycol as 00:09:58.220 |
northeast, the odds that we're going to have the Eiffel Tower and ethylene glycol in the same sentence 00:10:04.460 |
are pretty small, same paragraph, same sentence, whatever. Uh, so that if I take the dot product 00:10:11.260 |
against northeast, if either the Eiffel Tower or basketball shows up, I'm screwed, but the odds 00:10:16.860 |
of them actually showing up at the same time are really small. Okay. So then the, so then that's the 00:10:23.180 |
reason why you need an SAE or, um, in this case, a transcoder, uh, because you have more concepts than you 00:10:31.900 |
have, uh, dimensions, uh, that you can just straight up analyze. And so the, the, the cross coder has, 00:10:39.180 |
uh, uh, uh, uh, uh, uh, a sparsity penalty, uh, akin to, uh, an L1 loss if you're familiar with lasso 00:10:48.060 |
regression. Uh, and that's what encourages it to represent each of these different concepts as a unique 00:10:58.380 |
column as a unique neuron in the matrix, as it were, um, instead of the current representation, 00:11:04.620 |
they're all just sort of jammed in there. Yeah. Um, basically when they train this, 00:11:09.260 |
there's, there's two things that they train on. They use a sparsity penalty, which is, you know, 00:11:13.260 |
if you've seen the other SAE work, uh, that enforces it to stay sparse. So, you know, single activations 00:11:18.940 |
for concepts and then a reconstruction loss reconstruction loss is so that at inference 00:11:24.540 |
time, instead of actually running like inference through haiku, we run inference of a prompt through 00:11:31.980 |
our CLT model. So our local reconstruction model, it has the exact same output as haiku or whatever 00:11:38.780 |
you're training it on. So this toy model that we've trained exactly kind of one-to-one matches. 00:11:44.300 |
Of course there's some degradation, but you know, it's trained with reconstruction loss. So it's trained 00:11:48.620 |
to match the exact output of the big model that you trained on. So technically, you know, you should be 00:11:53.980 |
able to swap it in directly. And a lot of this works because, you know, you're freezing the attention 00:11:58.380 |
layers and you're specifically training it on a loss to recreate the inputs. And from there, 00:12:04.140 |
that's where we have this model that now has these sparse features. But, um, yeah, thanks for that 00:12:09.260 |
overview, Ted. It's, it's a little bit better for the math explanation of what's going on here, 00:12:14.220 |
but, um, continuing through this, here's kind of what happened. So they have this, uh, reconstruction 00:12:20.220 |
error. These are error nodes that happened between the original output and the replacement model. 00:12:25.340 |
Then they start to prune features that aren't. So since the model is sparse, right, there's only a few 00:12:30.380 |
features per token that actually activate a different layer. So this is layer wise activation, 00:12:35.820 |
right? This is our local replacement model. So for example, for the first layer here, uh, these three 00:12:42.140 |
features activated and this one, these three, and this one, these two for these, they look through 00:12:47.500 |
the traversal of what activated and what influenced the final output. And then they start to prune, 00:12:52.780 |
I think 95% of the ones that didn't have an effect on the output. And now we can see, okay, 00:12:58.220 |
what neurons, what kind of activation features impact the output. And there, from there, we can start to, 00:13:05.020 |
you know, generate these attribution graphs, attribution graphs, kind of combine these concepts. 00:13:10.300 |
So for these two, for these hierarchical, um, categories, once we cluster them and, you 00:13:15.740 |
know, add them on top of each other, what do they represent? So we can see what different features make 00:13:21.740 |
up, um, these different tokens. So I didn't find this one to be the most, um, you know, interpretable 00:13:27.980 |
because it's on a token split, but they have a lot of these features for different, um, different 00:13:33.020 |
concepts, right? So for example, for the word digital here, if we look at it, it's starting to activate 00:13:38.060 |
once there's words like smartphones, television companies, there's another feature that takes 00:13:42.700 |
it in a different representation, right? So, um, in this one, there's digital suicide, 00:13:48.300 |
there's color image, you know, this is like a bit of a different understanding of the word digital. 00:13:52.460 |
In this one, there's tech director, right? There's a DVD, which is digital. 00:13:56.620 |
In this case, there's, um, mobile devices, same thing for analytics. So web analytics, commercial analytics, 00:14:05.580 |
this feature talks about data, quantitative assessments, all, all different features that, 00:14:11.100 |
you know, all different features that represent analytics in different, in different, um, domains. 00:14:17.660 |
So in this case, there's, um, let's see which other ones make sense. So performance metrics are a way to 00:14:22.860 |
analyze, to represent analytics, routines or analytics. Um, but yeah, they kind of start to group these 00:14:29.420 |
features into these different things. Then it comes to, uh, how they construct it. Basically, 00:14:35.980 |
they have output nodes that are output tokens, and then they prune the ones that don't, um, 00:14:41.420 |
really have anything. There's input and output nodes as well. And then we kind of have this whole 00:14:46.700 |
interactive chart where you can play around with it. Um, they make it very interactive. Um, 00:14:53.260 |
um, they kind of explain what this chart is like. So, uh, for labeling features, you know, 00:14:59.980 |
they, they say how there's different understandings for different, for the same concept. Um, I think 00:15:06.700 |
that's enough on circuit tracing. If there's questions, we can dig a little deeper and we can 00:15:12.060 |
always come back to it. But at a high level, what we've done so far is with a sparsity loss and a 00:15:18.780 |
recreation loss, we've kind of created a new local model, which is not small, by the way, the model 00:15:24.060 |
has to have the same layers as the original model, and you kind of have to retrain it to match output. 00:15:29.820 |
So this is not like cheap per se. It's pretty computationally expensive, but now we've been 00:15:36.060 |
able to kind of peel back through different layers, what features kind of activate upon, uh, output. 00:15:42.860 |
There's an interesting little section here that talks about how expensive this really is. So estimated 00:15:49.180 |
compute requirements for CLT training to give a rough sense of compute requirements to train one. 00:15:54.380 |
We share estimated costs for CLTs based on the Gemma 2 series. So on a 2B model to, uh, run 2 million 00:16:01.420 |
features and train on a billion tokens, it takes about 210 H100 hours on a 9B model. It takes almost 00:16:07.740 |
4,000 H100 hours, and that's for 5 million features on 3 billion tokens. Now that's not cheap, right? Like 00:16:15.100 |
this is 4,000 H100 hours. Most people don't have access to that. Um, but you know, they're able to do this 00:16:21.420 |
on Haiku and then we go back into our main blog post of what features they found and what different 00:16:26.700 |
little, um, interesting niches. I'll take a little pause here and see if we have any questions on 00:16:32.780 |
circuit tracing, what this CLT transcoder model is, um, any questions, any thoughts, any additions, 00:16:39.660 |
any comments, just very high level. What we've done so far is we've retrained a model. It matches the layers. 00:16:46.780 |
We call it the local replacement model. It matches the layers of the original transformer. 00:16:51.740 |
It freezes attention. It replaces the MLP or the feed forward network with this transcoder. 00:16:58.300 |
And basically this transcoder is just trained to re this model is trained to re output the exact same outputs 00:17:04.060 |
for inputs. And then we start to dig deeper at these little, um, sparse features and start to map them. 00:17:09.900 |
Uh, they do this, they show the cost of how much it would be then for the big one. So in this paper, 00:17:15.820 |
they, they train it on two, two models, 18 layer language model, and then also on Claude Haiku. 00:17:22.540 |
The Haiku one is a local model that has 30 million features and you know, you can kind of extrapolate 00:17:28.300 |
how expensive that would be. But quick pause, any, any thoughts on circuit tracing, any questions, 00:17:34.140 |
or otherwise we can start to continue. The next section is let's start to look at some of these 00:17:39.020 |
features. Let's see what happened. Can we, uh, they, they have a few different examples here. So 00:17:43.420 |
multi-step reasoning, planning and writing poems, features that are multilingual features that kind of 00:17:50.620 |
expect that, uh, mess with medical diagnosis, refusals, they start to do some stuff like different 00:17:56.620 |
clamping. So they clamp in different features. So for example, in this, what's the capital of Austin, 00:18:02.620 |
if we take out Austin, well, you know, let's say we sub, uh, let's say we throw in the feature for 00:18:08.380 |
Sacramento. The model will now output, um, California. Okay. Questions. Why can't we just directly train circuits? 00:18:16.620 |
So you kind of are training the circuit. So the circuit tracing is this transcoder. What you are 00:18:22.460 |
training is this transcoder network, right? You keep attention flat, you replace it with the MLP, 00:18:27.500 |
but you're training this circuit. Um, in terms of directly training on circuits, you're, you're kind 00:18:34.780 |
of messing with that feed forward network, right? Like technically this is the exact same thing as our MLP 00:18:42.060 |
layer. It's just now you're forcing it to be sparse. Like we've trained a model to do the same thing, 00:18:47.980 |
but if you train it with a sparse, uh, with a sparsity in fat, like sparsity from scratch, 00:18:55.420 |
you probably won't get very far, right? This is like, in my mind, it's similar to distillation where 00:18:59.980 |
you can take a big model. You use a teacher forcing distillation loss to get a small model to mimic it. 00:19:05.420 |
But that doesn't mean that you can just train a small model to be just as good. 00:19:08.780 |
Um, okay. If we predict smile string, I wonder what concept we can see. So there's like a very, 00:19:15.420 |
very deep interactive bunch of demos here of different, uh, input output prompts, 00:19:19.980 |
and you can see what features activate. So I found, um, global weights. 00:19:29.020 |
Okay. Well, we'll find it cause it shows up again in the other, in the other one, but okay. We'll, 00:19:35.260 |
we'll start to go through the actual biology of an LLM. So going through this, um, 00:19:39.820 |
okay. In this paper, we focus on applying attribution graphs to Claude 3.5 Haiku, which 00:19:47.900 |
is Anthropics lightweight model. So they have this introductory example of multi-step reasoning. 00:19:53.260 |
Uh, introductory example of multi-step reasoning, planning and poems, multilingual circuits, 00:20:05.180 |
addition, where it shows how it does math, medical diagnosis. Uh, we'll start to go through like the 00:20:11.340 |
first three of these. And then I think we'll just open it up for people's thoughts and we can dig 00:20:15.100 |
through the rest as needed. So brief overview is kind of that circuit training case study walking through 00:20:22.220 |
this. Okay. Um, they do talk a lot about limitations. If anyone's interested in Mechinterp, uh, they have 00:20:29.260 |
a whole like limitations section. They have a future works questions. They have open questions that they 00:20:35.420 |
would expect people to work on. But remember, unlike essays, which you can do on one layer, this stuff is 00:20:41.260 |
pretty compute intensive. So pretty big models you're training, but, um, you know, it's always interesting 00:20:47.820 |
stuff for people to work on. Okay. Method overview. This is just high level again of what we just 00:20:52.780 |
talked about. You freeze MLP, uh, sorry, you freeze attention. You change MLP to the CLT model. Then we 00:21:00.860 |
have feature visualization. They have this error nodes that they have to add in. This is the local replacement 00:21:05.820 |
model. So Texas capital is Austin. It goes through these different features. Okay. Um, they group these 00:21:13.500 |
related nodes on a single layer into super nodes. So we have one, we have, um, graphs, right? So 00:21:20.460 |
basically graph networks are kind of useful in this sense because each node is kind of a concept, 00:21:25.580 |
but then the edges between them can go throughout layers, right? So on a layer wise, they call these 00:21:31.340 |
super nodes and they kind of stack them together. So in this case, let's look at the features that activate 00:21:36.700 |
for the word capital. So, um, obviously terms like city, uh, buildings, uh, there's another feature 00:21:45.100 |
for, I guess this is a multilingual one. There's one for businesses, you know, capital, uh, cyber attacks 00:21:52.620 |
that happen, venture capital. What else have we got? We've got states, we've got the concepts of the United 00:21:58.860 |
state, France. So countries, um, now we've got another feature that, you know, it actually fires 00:22:05.260 |
up when we talk about specifics. So Connecticut, um, I think there's one here for languages as well, 00:22:10.700 |
which was pretty interesting. So like capital letters, you know, um, of course a bunch more cities. 00:22:16.140 |
Um, that's kind of the basic graph, right? So for Texas, we've got stuff like income tax, big, 00:22:23.180 |
far, um, Austin, different things that Texas is like. So these are kind of these super clusters. 00:22:31.500 |
Um, this is their example of intervention. If they clamp down the feature of Texas, 00:22:36.620 |
well now, you know, Texas capital, well, instead we're going to go through capital, say a capital, 00:22:41.420 |
then we observe that if we take out Texas, it instead decides that Sacramento is pretty important. 00:22:47.020 |
It's, it's the capital that it decides to predict. So, uh, we can clamp down on these. 00:22:51.820 |
Not sure. I understand why transformer attention KV matrices are needed to be frozen. It's needed to 00:22:56.780 |
be frozen because they don't want to train more than what they need in the circuit tracing, right? 00:23:01.180 |
They're basically doing this sparsity loss. And once you start messing with attention and training 00:23:06.540 |
in this objective, you're kind of going to mess stuff up, right? So all they're really trying to 00:23:11.660 |
do in circuit tracing is just train this, um, this replacement layer. They're, they're just training 00:23:18.220 |
these sparse transcoders. They're, they're not trying to, they're not trying to mess with attention. 00:23:23.900 |
So attention is a lot of the training, but you know, perhaps they could unfreeze it and we'd start to 00:23:29.500 |
get a weird aspect where, you know, now you have randomly your zero initialized weights. Um, and it's 00:23:37.260 |
not what we're trying to look at, but you could also do this through, um, the, the attention layers 00:23:41.980 |
are still kind of mapped. Right. But, um, yeah, that's why we're not freezing. That's why we freeze 00:23:47.420 |
attention. Okay. Uh, continuing through this, this is their first example of let's see if we can see 00:23:54.780 |
multi-step reasoning in, um, cloud 3.5 Haiku. And this is not a thinking model. This is just a regular 00:24:02.060 |
next token prediction model. How does it come to the output? So let's consider the prompt, uh, fact, 00:24:08.060 |
the capital of the state containing Dallas is, and then of course, Haiku is pretty straightforward. 00:24:13.100 |
It answers, uh, Austin. So this step, this question, this prompt takes two steps, right? 00:24:18.140 |
First, you have to realize that it's asking about the state containing Dallas. So, um, it's asking 00:24:23.900 |
about the capital of the state containing Dallas. So first, what state is Dallas in? I have to think, 00:24:29.260 |
okay, it's in Texas. Second, I have to think, what is the capital of Texas? It's Austin. So kind of two 00:24:35.900 |
steps to this answer. Right now, the question is, does Claude actually do these two steps internally 00:24:41.980 |
or does it kind of just pattern match shortcut? Like it's been trained enough to just realize, 00:24:46.540 |
oh, this is obviously just Austin. So let's peel back what happens at different layers. 00:24:51.420 |
Let's see what features activate and see if we have any traces of these, this sort of thinking work, 00:24:56.700 |
right? Does it have these two steps? Um, previous work has shown that there is evidence of genuine, 00:25:03.020 |
of genuine multi-hop reasoning to various degrees, but let's do it with their attribution graph. 00:25:08.700 |
So here's kind of, um, what they visualize. So first we find several features for the word, 00:25:13.980 |
the exact word capital. So the word capital has different features, right? So there's a business 00:25:21.740 |
capital. There's all this, um, capital of different countries. There's these different 00:25:27.020 |
features that they group together. They actually have cities as well. So Berlin, Athens, Bangkok, 00:25:32.060 |
Tokyo, Dublin, um, top of buildings. One example, um, there's, there's several features. Okay. 00:25:39.260 |
Then there's output features. So landmarks in Texas, these show up for, um, 00:25:45.900 |
one feature activates on various landmarks. So there's a feature around suburban district, 00:25:52.220 |
Texas history museum, some seafood place. Uh, we also find promote the same capital. Okay. Uh, 00:26:02.060 |
features that promote the output of the same capital generally. So responding with a variety of us state 00:26:08.140 |
capitals, this feature talks about different capitals. So headquarters, state capital promote 00:26:16.620 |
various countries, Maryland, Massachusetts, but going through all that, here's kind of where we get up. 00:26:22.620 |
So fact, the capital of the state containing Dallas is when we look at capital, here's the different 00:26:28.140 |
meanings of it, you know, um, state Dallas. Then when we go one, one level deeper, it looks like, 00:26:35.420 |
oh, there's this super node of say a capital, say a capital has capitals, crazy concept. It maps to 00:26:44.060 |
capitals. Texas has, you know, examples of different things in Texas. So Houston, Austin, San Antonio, 00:26:51.980 |
uh, features for, you know, different things croquet that happens here, this place teacher stuff. Um, 00:27:01.980 |
the attribution graph contains multiple interesting paths. We summarize them below. So the Dallas feature 00:27:07.980 |
with some contribution from the state feature activates a group of features that represent concepts 00:27:13.180 |
state of, uh, related to the state of Texas. So Dallas and state, Dallas and state have features of Texas. 00:27:22.620 |
Um, kind of interesting, right? Dallas and state have features of Texas in parallel features activated by the 00:27:28.860 |
world capital activate another cluster used to say the name of a capital. So features of capital 00:27:35.580 |
have features of say a capital, Texas features and stay a feature, uh, say a capital eventually 00:27:42.860 |
for lead to lead to stay Austin. So passing these two together, we have the, you know, say a capital in 00:27:49.180 |
Texas, uh, to stay Austin. Um, then they start to do some of this clamping work. 00:27:54.860 |
Clamping is pretty interesting, right? So if we look at the most probable prediction, 00:27:59.420 |
um, you know, capital of the state, Dallas, say Austin, Austin is most likely. If we take out 00:28:05.980 |
this feature of say a capital capital of state, Texas. Well, uh, if we take out capital right now, 00:28:12.540 |
it's just going to say Texas. If we take out Texas, it's just going to say capital of state, 00:28:17.660 |
Dallas, say a capital, and then it's kind of confused, right? So, um, they have little different 00:28:24.460 |
things as you, as you take out stuff. So if we take out capital state of Dallas, still Texas, if you take 00:28:30.140 |
out, um, state, it's still going to say, it's going to say Austin now. So capital, Dallas, Texas still says 00:28:38.940 |
Austin. From here, they start swapping in features. So if we swap in California, the feature for California 00:28:44.700 |
is pretty interesting, right? We see ferry building marketplace, um, universal studios, 00:28:50.460 |
sea world. You have a bunch of features that activate for California. Uh, what else have we got here? 00:28:58.620 |
different features outdoor San Jose. These are cities. So these are cities in California. 00:29:03.660 |
Um, Stockton, these are more cities, Riverside, Oakland, this one, the governor Republican. So this 00:29:11.500 |
is kind of the political feature for California. Once they clamp this into the Dallas thing, if they 00:29:16.860 |
replace this, the capital of the state containing Oakland is, um, they can get Cal, they, they can, 00:29:23.020 |
oh, sorry. So they, they change the prompt, you know, the capital of the state containing Oakland, 00:29:27.340 |
they find a California feature, a super feature of California. Then they can clamp it back in. Uh, 00:29:32.940 |
when they clamp in the capital of Dallas, they replace it with our California feature. It says 00:29:37.420 |
Sacramento. They do it to Georgia. They say it says Atlanta, uh, British Columbia says Victoria. 00:29:43.660 |
They find like the, the British Columbia feature has stuff like, you know, Canada and whatnot. 00:29:48.060 |
If they heavily add in China, it says Beijing. So this is kind of their process of how do we find 00:29:56.380 |
these super features? Here's how we can find one. You know, we change the prompt to Oakland. We find 00:30:02.140 |
something that represents California, a group of features. We swap that back into our original prompt 00:30:07.340 |
of Dallas. And you know, now we get Sacramento. We can do the same thing for other things that we can 00:30:12.460 |
kind of start to interpret this stuff. So that's kind of their, their first multi-step reasoning. 00:30:17.500 |
So we can one, see that the model has this two level approach, right? So it first has to figure out, 00:30:23.340 |
um, what state, then the capital of that state. And it's starting to do that. We can see that through 00:30:28.380 |
the layers. The second one is we can start to clamp these features through, uh, Ted, do you want to 00:30:33.820 |
pop in? Yeah. Just a super quick thing. So they do all of this circuit analysis on the replacement model, 00:30:40.300 |
because it's way easier to analyze the replacement model. It's smaller, it's linear, it's all that 00:30:45.740 |
stuff. But these experiments you show where they replace whatever Texas with California, 00:30:51.260 |
those are done on the original LLM. That's, that's super important. So they're not trying to prove the 00:30:56.860 |
replacement works this way. They're trying to prove the original LLM works the same way as the replacement. 00:31:02.860 |
And so, um, in the chat, like, like this could all be a bunch of BS, but because the intervention 00:31:09.500 |
works on the original model. So if you, if you said that, you know, the, the ligament in my leg 00:31:15.660 |
is connected to vision and you, you cut that and I can't walk, but I can still see perfectly fine. 00:31:21.420 |
Then your explanation is probably wrong. But if you say the optic nerve is, is really important 00:31:26.780 |
for vision and you cut that and suddenly I'm blind then, but I can do everything else. I can walk, 00:31:32.700 |
I can taste, I can do everything else just fine. That's pretty strong support that the, that the, 00:31:37.500 |
the one thing you cut is critical component just for what you said it was. 00:31:42.060 |
Yeah. Um, yeah, all this is still done on the original model. Uh, someone's asking what layers 00:31:49.740 |
generate these super node features. So there's super nodes across different layers, right? So, 00:31:54.700 |
uh, this is throughout different layers. There's one for California here, there's Oakland at this level. 00:32:01.340 |
So it's kind of throughout, they have a lot of interactive charts that you can play through this 00:32:05.660 |
to go through different layers. These are just kind of the hand cherry picked examples and they 00:32:10.860 |
acknowledge this as well. They acknowledge that what they found is cherry picked and heavily 00:32:15.660 |
biased towards what they thought was, you know, here's what we see. Here's what we should dig into. 00:32:21.100 |
It's a bit of a limitation in the work, but nonetheless, it's still there. Um, another example that they 00:32:27.100 |
show is, you know, uh, planning in poems. So how does Claude 3.5 haiku write a rhyming poem? 00:32:32.860 |
So writing a poem requires satisfying two constraints at one time, right? There's two things that we have to do. 00:32:38.460 |
The lines need to rhyme and they need to make sense. There's two ways that a model could do this. One is 00:32:44.300 |
pure improvision, right? Um, model could just begin each line without regard of needing to rhyme. Uh, 00:32:51.340 |
sorry, the model could write the beginning of each line without regard for needing to rhyme at the end. 00:32:55.900 |
And then the last word just kind of has to rhyme with the first, or there's this planning step, right? 00:33:01.740 |
So you can either just kind of start. And as you go, think of words that rhyme, or you can actually plan 00:33:07.820 |
ahead. So this example tries to see, is there planning when, when I tell you to write a poem and I give you 00:33:13.740 |
a word to start with, like, you know, write a poem and have something that rhymes with the word tape. 00:33:19.180 |
I forced you to have the first word, and then you can start generating words that rhyme with tape. 00:33:24.060 |
Or if I tell you to write a poem about something you can plan in advance before the first word is written. 00:33:29.340 |
So, um, even though the models are, you know, trained to think one token at a time and predict the next 00:33:36.460 |
token outside of, you know, thinking models, uh, we would assume that, you know, the model would rely 00:33:41.740 |
on pure improvision, right? It will just kind of do it on the fly. But the interesting thing here is 00:33:47.100 |
they kind of find a planning mechanism per se in what happens. So specifically the model often activates 00:33:54.300 |
features corresponding to candidate end of next line words prior to writing the line. 00:33:59.020 |
So before like the net, before the rhyming word is predicted, even if it's at the end of the line, 00:34:06.700 |
we can see traces of it starting to come up pretty early on. Um, so for example, a rhyming, a rhyming 00:34:14.220 |
couplet, he saw a carrot and had to grab it. His hunger was a powerful rabbit or starving like a rabbit. 00:34:21.100 |
Um, these words start to show up pretty early on. So first let's look at, you know, where do these features 00:34:27.660 |
come from? Um, what are the different features that form them? So for habit, um, you know, 50x very clear 00:34:35.340 |
reason, best answer. Um, for habit there's, there's just different features, mobile app that gamifies 00:34:43.020 |
habit tracking, habit tracker, habit formation, uh, budgeting, rapid habit formation, discussing 00:34:53.020 |
habits with doctors. So, you know, once again, they've got this concept of habit. Uh, let's see where it starts to come in. 00:34:59.340 |
So before they go into their thing, they talk about prior work. Um, sequence models, add a body of example in several ways. 00:35:07.020 |
We provide a mechanistic account for how words are planned, forward planning, backward planning, the model... 00:35:16.700 |
Here we are. Uh, the whole, the model holds multiple possible planned words in mind. We're able to edit the model's planned words. 00:35:28.700 |
We discover, um, the mechanism with an unsupervised bottom-up approach. Model used to represent words, 00:35:36.380 |
ordinary features. Okay. Planned words and their mechanistic role. So, um, we study how Claude completes 00:35:43.580 |
the following prompt asking for a rhyming couplet. The model's output sampling the most likely token is shown 00:35:48.380 |
in bold. So, uh, this is kind of the input, a rhyming couplet. He saw a carrot and had to grab it. The output we get is, 00:35:55.420 |
his hunger was like a starving rabbit. So model, the output is coherent. It makes sense and it rhymes, right? 00:36:02.060 |
So, uh, starving rabbit, carrot, kind of all rhymes there. To start, we focus on the last word of the 00:36:09.420 |
second line and attempt to identify the circuit that can shoot that, uh, contributed to choosing rabbit. So 00:36:15.500 |
this makes sense, right? Rabbits like carrots, um, grab it, rabbit, it rhymes. So there's kind of that 00:36:22.620 |
two-step thing. Was it just the last token predicted or did we have some thought to it? 00:36:26.700 |
Okay. So these are kind of the, the features. So it comma hunger was like starving. Okay. Let's, 00:36:34.460 |
let's start to dig through this. So rhymes with, there's a feature here of rhymes with it, it sound, um, get, um, 00:36:45.340 |
that they have features that activate across different languages and stuff that activate, 00:36:50.300 |
uh, that, you know, have this sort of rhyming feature. Then they have rabbit and habit that came up, 00:36:55.420 |
um, say rabbit. And then this feature of the dash T and then, oh, cool. We got rabbit. What does this show? 00:37:03.900 |
The attribution graph above computed by, uh, attributing back from the rabbit output node shows 00:37:09.980 |
an important group of features activate on the new line token before the beginning of the second line 00:37:14.620 |
features activate over the it token, uh, activate. So, um, 00:37:20.540 |
basically the second last output token where, um, grab it had features that activated these different, 00:37:30.940 |
um, you know, sort of rhyming tokens. The candidates have, uh, the candidate completions in turn have 00:37:38.300 |
positive edges to say rabbit features over the last token. So that's this hypothesis. We perform a 00:37:43.660 |
variety of interventions on new line planning sites to see how probability, how it affects the probability 00:37:48.540 |
of the last token. Okay. So let's, uh, 10 X down the word habit and we've got different changes, 10 X up 00:37:57.980 |
and down new line, um, different things affect different things. The results confirm our hypothesis that features 00:38:04.460 |
that planning features strongly influence the final token. So if we kind of take out that new line 00:38:10.780 |
token, we can see, oh, it's a, it's not doing this anymore. Okay. Planning features only matter at 00:38:16.860 |
planning location, planning words, influence immediate words, nothing too interesting here. Okay. Clamping 00:38:23.740 |
was a line to lead to transformer. How do they map trans corridor back to transformer? Say we clamp Texas. 00:38:28.540 |
So in, there's a question around the clamping stuff and how this is working. The previous SAE thing that 00:38:34.220 |
they put out in May, it explains how they do all these clamping features. Uh, basically same thing. 00:38:39.740 |
There's more in here as well. In both of these papers, they kind of go into the math about it as well, but 00:38:45.660 |
keeping it high level. Let's just kind of try to see, um, some more of these planned words. So, um, 00:38:52.860 |
yep, we can, we can sort of see as we take out different things, uh, we no longer have this planning 00:38:59.900 |
step. Okay. I'm going to go quickly through the next few ones, ideally in the next like seven minutes, 00:39:05.980 |
and then we'll leave the last 10 minutes for just questions and discussions on this. So we first, 00:39:11.180 |
you know, we just saw how there's pre-planning in poems for rhyming. There's this multi-step sort of 00:39:17.660 |
thinking that happens throughout layers. Uh, now we've got multilingual circuits. So models, uh, modern 00:39:24.460 |
networks have highly abstract representations and unified concepts across multiple languages. So we have 00:39:30.220 |
little understanding of how these features fit in larger, larger circuits. Let's see how it, um, you know, 00:39:36.060 |
how does it go through the exact same prompt in different languages? Are there features that fire 00:39:43.500 |
that are consistent through different languages? Um, also fun fact, I guess rabbits don't eat carrots. 00:39:49.740 |
Carrots are like treats. Crazy, crazy. Someone knows about, um, rabbits. Okay. So, um, the opposite of small is, 00:39:58.220 |
and then we would expect big in French. It's grand in Chinese. It's this character. Um, let's see if there's 00:40:05.260 |
consistency across these features. So high level story features the same. The model recognizes using a language 00:40:11.340 |
independent, um, representation. So very interesting. There's language independent, uh, representation. So this term of 00:40:20.700 |
say large, uh, is something that, uh, that activates across all three languages. Let's see some of the features. So, um, 00:40:29.020 |
large has stuff like, you know, 42nd order. Uh, there's a Spanish version in here, uh, short arm and long arm. 00:40:36.940 |
It activates in this language. It activates in a numerical sense. It activates small things. Great. Um, 00:40:47.340 |
this feature is kind of multilingually representing the word large. Um, 00:40:57.580 |
yeah, there's, there's kind of just these high level features that activate. So 00:41:03.420 |
the opposite of small is little, uh, there's a synonym feature, antonym, antonym kind of synonym, multilingual, 00:41:13.660 |
very interesting. Editing the operation antonyms, the synonyms is kind of another one. 00:41:17.660 |
They can kind of clamp this in. So, um, they show how that works. Editing small, the hot. 00:41:24.860 |
Okay. Editing the output language. There's another thing that we can start to do. So if we start to 00:41:30.300 |
swap in the features for different languages, you know, we can get output in different language. Um, 00:41:35.900 |
more circuits for French, I think it's okay. You can go through this on your own time. Do models 00:41:41.820 |
think in English? This is an interesting one. As researchers have begun to mechanistically 00:41:46.860 |
investigate multilingual properties of models, there's been tension in our link in our literature. 00:41:51.420 |
Researchers have found multilingual neurons and features and evidence of multilingual representations. 00:41:58.060 |
On the other hand, there's present evidence that models, um, you know, they, they use English 00:42:04.220 |
representation. It's, uh, so what should we make of this conflicting evidence? It seems to us that 00:42:09.820 |
Claude 3.5's haiku is generally, is using genuinely multilingual features, especially in the middle 00:42:15.820 |
layer. So in middle layers, we see multilingual features. Um, there in, there are important mechanistic 00:42:22.780 |
ways in which English is privileged. For example, multilingual features have more significant direct 00:42:28.300 |
weights to corresponding English output nodes while non-English outputs being more strongly meditated in 00:42:34.140 |
the XY language features. So kind of interesting. There's still a bit of an English bias, but you know, 00:42:39.100 |
there are definitely some inherent, um, multilingual features there. Okay. Next example is English. 00:42:47.500 |
Uh, we want to see how does Claude add two numbers like 36 plus 59. Uh, we found that we can, uh, 00:42:54.060 |
we found that it split the problem into multiple pathways, computing the result in a, at a rough 00:43:01.180 |
precision parallel computing while one digits answer before reconstructing these to get the cue key, uh, 00:43:06.940 |
the correct answer. We find a key step performed by a lookup table feature. Ooh, very interesting. 00:43:12.060 |
Lookup table feature that translates the properties of inputs. Okay. Let's kind of see what's 00:43:17.340 |
going on first. We visualize the role of addition problems using operators, um, show the activity 00:43:24.300 |
of features on the equal token for prompts, uh, calculation AB. So addition features, calculate 00:43:31.420 |
a plus B equals, they kind of have this lookup table, some features. It's very interesting how it's 00:43:37.340 |
doing attention. Uh, this one gets a little bit complex in how we go through what's happening here in the 00:43:45.180 |
case. And the sake of time, I think that's enough of a little overview. They can, of course, mess with 00:43:51.260 |
its math. Let's go on to the next one. Medical diagnosis. This is a fun one. In recent years, 00:43:58.220 |
researchers have explored medical applications for LLMs, for example, aiding clinicians in accurate 00:44:03.580 |
diagnosis. So what happens? Thus, we are interested in whether our methods can shake lights on reasoning 00:44:10.300 |
model on the reasoning models perform internally in medical contexts. We study an example scenario 00:44:16.780 |
in which a model is presented information with a P about a patient and asked to suggest a follow-up 00:44:22.140 |
question to inform diagnosis of the treatment. This mirrors common medical practice. Um, 00:44:28.540 |
okay. So let's see what happens. Um, human, a 32 year old female, 30 week gestation period, this, 00:44:35.580 |
this, this mild headache, nausea. Um, if only we can ask one symptom, what would she, what would, 00:44:42.540 |
what would we ask assistant visual disturbances? So the model is most, the model's most likely 00:44:48.620 |
completion here is visual disturbances. And this two key indicators in this issue. Okay. We noticed that the 00:44:56.220 |
model activated a number of features that activate in context of this, um, you know, this issue in people. 00:45:03.900 |
So what are these features in coming to this? Okay. Their, their UI is struggling, uh, slightly more 00:45:10.700 |
deadly material, uh, gestational disease pressure. So there's a bunch of features that come up to this 00:45:18.220 |
blood pressure, protein stroke. Um, some of the other features were on synonyms of this other 00:45:27.020 |
activations in broad context, kind of interesting, right? So they do see this kind of internal 00:45:32.940 |
understanding. They have more examples of this for different stuff. So, you know, if we could only ask 00:45:37.820 |
stuff, it's whether he's experiencing chest pain in this one, whether there's a rash and they kind of go 00:45:43.180 |
through what are some of the features that make this stuff up here. Uh, pretty interesting. I think 00:45:47.500 |
you should check it out if interested, if interested. Okay. Uh, 10 minutes left. I think there's a lot 00:45:53.180 |
more of these are clamping this there's entity recognition, there's refusals, but okay. I want to 00:45:59.740 |
pause here. See if we have any other comments, questions, thoughts, things that we want to dig more into. 00:46:09.900 |
I'm going to check chat, but see if there's, um, yeah, if anyone has any stuff they want to dig into, 00:46:16.380 |
let's feel free, you know, pop in. Could you do a quick overview of the hallucination section? 00:46:25.260 |
Yeah. Let's just keep going. Um, so entity recognition and hallucination. So basically, 00:46:34.380 |
hallucination is where you make up false information, right? Hallucination is common when models are asked 00:46:39.420 |
about obscure facts because they like to be confident. An example, consider this hallucination by, uh, 00:46:45.260 |
given by Haiku 3.5. So prompt, uh, this guy plays the sport of completion pickleball, which is a paddle 00:46:53.100 |
ball sport, uh, that consists of elements of this. The behavior is reasonable in the model's, uh, training 00:46:59.020 |
data. A sentence seems likely to be associated with the name of a sport without any information of who this 00:47:04.700 |
guy is the model says a plausible support, uh, plausible sport at random during fine tuning. 00:47:12.620 |
However, models are trained to avoid such Bob, uh, behavior when acting in the assistant character, 00:47:19.740 |
this leads to responses like the following. So base model Haiku without it's kind of, um, 00:47:25.260 |
um, you know, RL chat tuning. It just completes this and says, Oh, the sentence sounds like a sport. 00:47:32.460 |
I will give you a sport. Now, after their sort of training, what sport does this guy play answer in 00:47:38.460 |
one word models like, Oh shit, I can't do that. I don't know who this is. I need context. Given that 00:47:43.820 |
hallucination is some sense of natural behavior, which is mitigated by fine tuning. We take a look at the 00:47:49.420 |
the service, uh, the circuits that prevent models from hallucinating. So they're not really in this 00:47:54.780 |
sense, looking at hallucination and what caused it. They're looking at how they fixed it. So, 00:47:58.460 |
uh, quick high level TLDR. We have base models. We do this RL or SFT and we convert them into chat 00:48:05.900 |
models, right? In that we have this preference tuning. One of the things that they're trained to 00:48:10.060 |
do is be a helpful assistant. And that the objective is kind of, if you don't know what to say, 00:48:16.060 |
you know, you tell them, you don't know, and you ask for more context. So base model 00:48:20.220 |
would just complete tokens and be like, this guy plays pickleball. Cause it sounds like he plays a 00:48:24.140 |
sport. Um, there's probably a famous Michael or two or, or Batkin that play pickleball. Um, 00:48:31.020 |
assistant model is like, yo, I don't know who this guy is. So let me ask for more information, 00:48:35.820 |
but let's start to look at, um, what are these features that make that up? So hallucinations can be 00:48:41.980 |
attributed to a misfire in the circuit. For example, when asking the model for papers written by a 00:48:48.060 |
particular author, the model may activate some of these known act answer features, 00:48:52.940 |
even if it locked lacks the specific knowledge of the author, uh, knowledge of author specific papers. 00:48:59.260 |
This is one kind of interesting, right? So our results were related to recent findings of Fernando 00:49:06.780 |
use sparse, uh, which use sparse, sparse autoencoders to find features that represent unknown 00:49:11.660 |
entities. So this. So, okay. Human in which country is a great wall located? Uh, it says China in which 00:49:18.140 |
country is this based? It's okay. Um, known answer, unknown answer, different features, difficult, 00:49:25.180 |
uh, default, uh, default refusal circuits. There's a can't answer, um, feature directly activate broadly 00:49:34.140 |
fire for human assistant prompts. The picture suggests that the can't answer feature was activated by 00:49:40.380 |
default for human assistant prompts. In other words, the model is skeptical of user. So they kind of show 00:49:45.500 |
this can't answer, can't answer also, uh, can't answer features are also pro pro promoted by a group of 00:49:53.820 |
unfamiliar names. So names that it doesn't understand are, I guess, a feature, uh, these kind of just prompt 00:50:00.780 |
it to say, I can't answer. I don't know. Okay. Now what about the known answer circuit? So where does Mike, 00:50:07.420 |
what sport does Michael Jordan play? He plays basketball. So there's a group of known answer and known entity 00:50:13.420 |
features. These are what accidentally misfire when you get hallucination. That's a bit of a spoiler, 00:50:19.420 |
but you know, uh, known answer is like different features that kind of, you know, what answer is, 00:50:27.260 |
what country is this based in? It knows Japan. What team does Devin Booker play on? It knows the answer. 00:50:33.820 |
Where's the great wall located? These are kind of known internal facts. There's a feature for it. Once this fires, 00:50:39.740 |
you're cooked, it's going to answer and you know, it'll hallucinate. Once this has gone off, there's 00:50:43.740 |
strong evidence for that. Um, this graph, these graphs are kind of a little interesting. They kind 00:50:48.540 |
of show both sides. Um, so this is kind of the traversal throughout the layers in the RL, right? 00:50:55.180 |
So we had, um, this assistant feature because it's an assistant unknown name was a feature. And then, 00:51:01.740 |
you know, that leads to, I can't answer because this thing has been RL to not answer stuff. I apologize. 00:51:07.740 |
So can't answer. I apologize. I can't figure this out. That's where the next turns come up because that 00:51:12.140 |
shows up after that. What about something we don't know? Michael Jordan. Oh, I know this answer. There's 00:51:17.100 |
a bunch of stuff that, sorry. So first we have assistant and Michael Jordan in layer one known 00:51:23.820 |
answer. Oh my God. Okay. No one answer. I know a bunch of these facts say basketball, 00:51:29.180 |
Oh, basketball has said vertical vertical. Now let's once again, do a bunch of fun clamping stuff, 00:51:34.780 |
right? So, um, if we have Michael, Michael Jordan and we have known answer and we clamp it, 00:51:44.300 |
it says basketball. What if we clamp down known answers? If we take that feature, we turn it down, 00:51:49.980 |
even though the question is what sport does Michael Jordan play? We clamp down known answer. Well, 00:51:55.580 |
it can't answer because the other one that fires up is unknown answer. Um, what sport does it play if we, 00:52:02.540 |
if we still have strong known answer and we add in unknown name, it still says basketball, 00:52:07.980 |
um, little stuff here to kind of go through, but that's kind of a high level of what's going on. 00:52:13.820 |
I thought the academic papers one is another interesting. So, um, same concept, but you know, 00:52:19.260 |
this is this unknown name feature of stuff. So name paper written by Senpai, Senpai Karpathy. One notable 00:52:28.060 |
paper is image net. Um, there's kind of the same thing, known answer unknown name. If we change them 00:52:34.940 |
up, what happens? Um, pretty, pretty fun stuff, you know? Okay. That's kind of high level of what's 00:52:43.180 |
happening in the hallucination. Refusals was another interesting one. It's kind of interesting to see 00:52:48.140 |
some of the output to the base model and how their RL is like, you know, showing how this stuff works. 00:52:53.340 |
They have known entity and unknown entity features and say, I don't know features. The I don't know 00:52:58.940 |
feature is much less interesting. The unknown feature relates to self knowledge. Okay. Yeah. Just 00:53:05.260 |
interesting thoughts on this. Cool. Three more minutes. Any other fun thoughts, questions, comments? 00:53:11.980 |
I would recommend reading through just examples of this. Um, and if you haven't, the SAE one is pretty 00:53:17.820 |
fun too. Here's kind of the limitations, what issues show up, um, discussion. What have we, what have we 00:53:29.100 |
learned? Um, yeah, kind of high level. Interesting. The background of this is kind of this circuit tracing 00:53:36.460 |
transcoder work. It's very interesting how they can just train a model with a reconstruction loss and 00:53:42.140 |
just have it match the output because you know, these models are still only 30 million features, 00:53:47.500 |
even though they have the same layers, it's still outputting the exact same outputs. Kind of interesting. 00:53:52.540 |
Uh, do folks think the taxonomy of circuits, how circuits are divided will likely converge to be a 00:54:00.460 |
same breakdown every model or with different models, do different things differently. I think that 00:54:05.020 |
different models might do different things differently, right? Cause this is layer wise 00:54:08.860 |
understanding different models have different architectures, different layers, whether they're 00:54:13.100 |
MOEs, they're also trained in different ways, right? So the pre-training data set mixture kind of affects 00:54:19.420 |
some of this. So what if you're trained on high value, you know, training data first, and then garbage 00:54:26.380 |
at the end, you've probably got slop in your model, but you know, you might have different circuits that go 00:54:31.580 |
throughout. Um, and then there's obviously some general variety. Um, they did though, they did 00:54:37.900 |
actually in this one train it on a 18 layer language model, just a general model on a couple billion tokens. 00:54:43.900 |
And it still has coherency in what you expect. This still goes back to like early transformer stuff. You know, 00:54:48.940 |
we have a basic understanding of early layers are more general and layers are more niche and output specific, but. 00:54:55.660 |
Okay, that's kind of, um, kind of time on the hour. 00:55:01.580 |
I think next week and the week after we have a few volunteers, if Lama 4 drops a paper, we'll cover it, 00:55:10.700 |
of course. But I think we have a few volunteers. We'll, we'll share in discord. What's what's coming 00:55:16.060 |
soon? If anyone wants to volunteer a paper, if anyone wants to, you know, follow up, please share. 00:55:22.300 |
Thanks, Ted, for sharing insights as well, by the way. 00:55:24.780 |
Um, I think we have a, don't we have a potential speaker for next week? 00:55:32.300 |
Yeah, I thought you had one. I also have one. 00:55:34.700 |
Uh, yeah, but mine moved back after your guy came in. 00:55:39.020 |
Oh, okay. Okay. Well, I'll, I'll share details. Um, in discord. 00:55:42.780 |
Okay. Oh, there's questions in the court. Yeah. 00:55:46.140 |
Uh, are there open weights, sparse autoencoder? 00:55:51.020 |
Ooh, Gemma trained some. Um, there, there's some layer wise ones. 00:55:55.820 |
So like there's some that have been done on like Lama 38 B for each layer, but not throughout the whole 00:56:01.180 |
model. Um, but yeah, transcoder is different. It's not model wise autoencoder. 00:56:07.260 |
Uh, they do give recipe and, you know, expected cost to do this yourself though. 00:56:14.620 |
Okay, let's continue discussion in discord then. Thanks for attending guys.