back to index

Anthropic: Circuit Tracing + On the Biology of a Large Language Model


Whisper Transcript | Transcript Only Page

00:00:00.000 | So I presented the other three Anthropic SAE kind of Macinturb papers. I think I'll just share
00:00:08.460 | slides after this. So if people want to go through them, go through them. But basically the very,
00:00:13.160 | very first one that they did was like, okay, forget LLM's. Let's just see if we can interpret
00:00:18.420 | basic stuff in a transformer. So let's just pull it up. Anthropic SAE. So the first one was basically
00:00:28.860 | let's take a toy model. I think it was just like, you know, three layers. Like they had a basic
00:00:33.800 | encoder and then, you know, the sparse auto encoder. And they just trained a toy model,
00:00:39.280 | like a couple hundred million parameters that input output, and they have this middle layer,
00:00:43.920 | which was just an encoder. Can we start to interpret what's going on in that encoder?
00:00:48.180 | Turns out, yeah, they can find some features. After that, they had like, or that was just
00:00:54.260 | encoders. Then they started to make it sparse. This was kind of the big one that became pretty
00:00:58.680 | popular. I think this came out in May. We covered it shortly after. They applied this SAE work to
00:01:05.600 | CloudSonnet. They found out, oh shit, we can find out features that happen in the model. So they
00:01:11.620 | basically train a sparse auto encoder to match these inputs and outputs. And then they start to
00:01:17.340 | interpret them and match out features. High level TLDR is we can now map features that activate in this
00:01:26.560 | sparse auto encoder. So the whole thing is you train an encoder to stay sparse. You only want
00:01:31.460 | a very few number of features to activate for outputs. In this case, they found out stuff like
00:01:36.780 | Golden Gate Claude. There's specific features that they trained in their little auto encoder
00:01:41.760 | that, you know, activate when specific topics come up. So they had a feature that would always fire up for
00:01:48.480 | words like Golden Gate. They had stuff for like tourism, for infrastructure. There were features
00:01:53.960 | that extended throughout like multiple concepts. So, you know, it's not just one feature to one thing.
00:01:59.920 | But yeah, they have a pretty good long blog post on this. They started grouping them. They had different
00:02:05.840 | sizes. So they had a one mil, four mil, 34 mil size auto encoder. From there, you know, it's been a few
00:02:15.140 | months and now they're like, okay, no more essays. Let's do circuit training. So basically essays were
00:02:21.120 | good, but we, we kind of had a holistic understanding, right? You can apply an essay for every layer and
00:02:26.420 | try to understand what happens in layers, or you can apply it for like, you know, just specific parts of
00:02:32.280 | the model. You can do it in tension blocks and you can try to interpret what parts of them are firing up,
00:02:37.180 | but this is where they started to come in with circuit training. So circuit training is where you
00:02:42.580 | actually train a transcoder model to mimic the input model and you can do this across layers. So
00:02:47.900 | this model actually, it matches the input model layer by layer and it, you know, it maps out what's going
00:02:57.040 | on. I'll be right back in a second. I was stuck. Sorry. Doggo is crying. Okay. So circuit training,
00:03:07.840 | this came out a few weeks ago, and this is kind of the high level overview of what they're doing here.
00:03:11.620 | Basically, they, they train this cross link transcoder and they start poking around in a cloud 3.5 haiku,
00:03:18.940 | and they start to find features that are consistent throughout layers. Hey Siri, stop so many
00:03:25.580 | notifications. Um, so some of the interesting stuff here is they try to see like, can models internally
00:03:31.580 | think like when they answer questions, like things that take two different steps, does the model start
00:03:37.740 | to think through its response in advance, or is it just, you know, token predicting? And they, they find
00:03:42.300 | interesting little case studies where actually the model is doing some thinking. So the first main example
00:03:48.060 | that they show here is like, um, there's this prompt of what is the capital of the state that includes like
00:03:58.460 | Austin and you're supposed to say Texas, right? So this is kind of a question that has two steps of thinking,
00:04:03.740 | right? There's two levels of reasoning. First step is you have to think, um, what is the state that the city is
00:04:10.300 | in and then what's the capital of that state? So they kind of go through how they do all this, but let's,
00:04:17.260 | let's, let's start off by talking about this previous, um, previous paper that came out like a week ago about circuit
00:04:23.980 | training. So circuit training is where they, they train this, um, transcoder model to replicate the exact input
00:04:31.900 | model. And then they start to do these attribution graphs to figure out what happened. So, um, high level,
00:04:39.020 | here's kind of an overview of what, oops, of what people have done in previous Macinterp work. Um,
00:04:45.740 | um, we had transcoders, transcoders were, you know, alternatives to SAE that let us do replacement models.
00:04:53.740 | Then we have this cross layer transcoder, which is let's do transcoders that go throughout different
00:04:58.860 | model layers. Then we have attribution graphs and linear attribution between features. Um, they prune out the
00:05:06.380 | the ones that are not that relevant. They have a little fill in, we'll go a little quick through
00:05:10.860 | this since it's kind of a second, uh, paper, but they did have a little overview that I thought was
00:05:16.380 | interesting here. Okay. Uh, big in building an interpretable replacement model. So this is kind
00:05:24.380 | of the architecture of what this model is. So once again, they're going to create an entire model,
00:05:30.860 | call it a local replacement model that matches the exact, um, that matches the number of layers for
00:05:37.180 | the original transformer. So they, they train two of these. And so they start to give some statistics
00:05:42.060 | of what it would be like to train another one. And I think they talk about how, how much compute
00:05:46.940 | this requires on like Gemma 2B and like a 9B model. But essentially what they're doing here is they take
00:05:52.540 | a model, they look at the architecture and they freeze the attention. And basically they replace
00:05:58.540 | this MLP. So the feed forward layers, they replace the MLP with this cross layer transcoder,
00:06:04.220 | and then they can start to make this sparse and have features that we can interpret from it.
00:06:08.940 | So a bunch of math that if you're interested is pretty straightforward, actually, it's just,
00:06:13.900 | it's just a straight replacement. It's trained to match the exact input output. Um, so here's a cool
00:06:19.580 | little diagram. Basically you have different layers in a transformer, right? This is an original transformer
00:06:24.620 | model. You have attention blocks and you have the MLP, right? So throughout different layers,
00:06:29.660 | there's attention, then there's feed forward, attention and feed forward. And then eventually
00:06:33.180 | you have output, you pick the most probable token and you know, that's your output. So in the
00:06:38.060 | replacement model, instead of these MLP feed forward networks, they're replacing them with these cross
00:06:44.540 | layer transcoders. These cross layer transcoders speak to each other and we start to interpret,
00:06:49.820 | you know, we want to keep them sparse. So there's a sparsity factor. So only one feature activates,
00:06:54.460 | then we map that to something interpretable. Um, this blog post is actually very long, but
00:07:00.220 | that's how they make this local replacement model. Um, Ted, you have a question?
00:07:04.460 | Not a question, but can, is it okay if I add a little bit of color here? Yeah. Um,
00:07:11.100 | so, so one of the things is the early research along this very same direction on CNNs didn't require any of
00:07:19.180 | this stuff. And the reason is because, um, uh, the conventional wisdom now is that the number of
00:07:25.180 | things that, that people wanted to represent in a CNN was approximately equal to the number of filters,
00:07:32.300 | the number of neurons, uh, uh, layers and such that you have in a CNN. So CNN wants to find vertical
00:07:38.860 | lines, horizontal lines, diagonal lines, and then in the higher layers, triangles, circles, squares,
00:07:43.660 | and then eventually faces, arms, that kind of stuff. And you have approximately as many things in your network
00:07:49.900 | as you do concepts that you're trying to represent. So if all the data lives in, in essentially a vector
00:07:58.380 | space, if you guys remember your linear algebra, then everything can be represented as an orthogonal direction.
00:08:04.540 | And there's this linear representation hypothesis that says that information is encoded in a direction,
00:08:11.260 | not in a magnitude, just in a direction. And if you have a small number of concepts, they can all be completely
00:08:17.660 | orthogonal. And if you take the dot product of a vector with any of your concepts, there will be no interference
00:08:25.100 | between concepts because they're all orthogonal to each other. So if one is due east and one is due north,
00:08:31.180 | and you, you dot something with a canonical north vector to see how, whether or not north is present,
00:08:37.500 | whether you add more east-west or not changes nothing about the dot product in your north direction.
00:08:43.580 | The problem is when we get to LLMs, um, uh, no, uh, operations are additions. There's no rotations in the
00:08:51.260 | linear representation hypothesis. So what you have to do is you have to sort of, if you have something east and
00:08:56.540 | you want to add north, you have to sort of like add a lot of north to make sure that you get north-northeast
00:09:02.860 | enough that your dot product with north is, is not close to zero anymore. So the problem with LLMs is that we
00:09:10.780 | think that there are hundreds of millions, if not billions of concepts that an LLM needs to understand.
00:09:17.420 | And there are not enough neurons in the LLM to uniquely, or sorry, there's not enough space in the
00:09:25.020 | residual stream to uniquely represent all of these concepts. So you might have, um, a model dimension
00:09:30.700 | that's what, 16,000, 30,000, some, somewhere around there, right? In a big model. That's not nearly enough
00:09:37.660 | to represent hundreds of millions or billions of concepts, each with orthogonal directions. So then
00:09:43.900 | ultimately what ends up happening is the model takes advantage of sparsity and it says, well, if I
00:09:49.740 | represented basketball as north and the Eiffel Tower as east, and I represented ethylene glycol as
00:09:58.220 | northeast, the odds that we're going to have the Eiffel Tower and ethylene glycol in the same sentence
00:10:04.460 | are pretty small, same paragraph, same sentence, whatever. Uh, so that if I take the dot product
00:10:11.260 | against northeast, if either the Eiffel Tower or basketball shows up, I'm screwed, but the odds
00:10:16.860 | of them actually showing up at the same time are really small. Okay. So then the, so then that's the
00:10:23.180 | reason why you need an SAE or, um, in this case, a transcoder, uh, because you have more concepts than you
00:10:31.900 | have, uh, dimensions, uh, that you can just straight up analyze. And so the, the, the cross coder has,
00:10:39.180 | uh, uh, uh, uh, uh, uh, a sparsity penalty, uh, akin to, uh, an L1 loss if you're familiar with lasso
00:10:48.060 | regression. Uh, and that's what encourages it to represent each of these different concepts as a unique
00:10:58.380 | column as a unique neuron in the matrix, as it were, um, instead of the current representation,
00:11:04.620 | they're all just sort of jammed in there. Yeah. Um, basically when they train this,
00:11:09.260 | there's, there's two things that they train on. They use a sparsity penalty, which is, you know,
00:11:13.260 | if you've seen the other SAE work, uh, that enforces it to stay sparse. So, you know, single activations
00:11:18.940 | for concepts and then a reconstruction loss reconstruction loss is so that at inference
00:11:24.540 | time, instead of actually running like inference through haiku, we run inference of a prompt through
00:11:31.980 | our CLT model. So our local reconstruction model, it has the exact same output as haiku or whatever
00:11:38.780 | you're training it on. So this toy model that we've trained exactly kind of one-to-one matches.
00:11:44.300 | Of course there's some degradation, but you know, it's trained with reconstruction loss. So it's trained
00:11:48.620 | to match the exact output of the big model that you trained on. So technically, you know, you should be
00:11:53.980 | able to swap it in directly. And a lot of this works because, you know, you're freezing the attention
00:11:58.380 | layers and you're specifically training it on a loss to recreate the inputs. And from there,
00:12:04.140 | that's where we have this model that now has these sparse features. But, um, yeah, thanks for that
00:12:09.260 | overview, Ted. It's, it's a little bit better for the math explanation of what's going on here,
00:12:14.220 | but, um, continuing through this, here's kind of what happened. So they have this, uh, reconstruction
00:12:20.220 | error. These are error nodes that happened between the original output and the replacement model.
00:12:25.340 | Then they start to prune features that aren't. So since the model is sparse, right, there's only a few
00:12:30.380 | features per token that actually activate a different layer. So this is layer wise activation,
00:12:35.820 | right? This is our local replacement model. So for example, for the first layer here, uh, these three
00:12:42.140 | features activated and this one, these three, and this one, these two for these, they look through
00:12:47.500 | the traversal of what activated and what influenced the final output. And then they start to prune,
00:12:52.780 | I think 95% of the ones that didn't have an effect on the output. And now we can see, okay,
00:12:58.220 | what neurons, what kind of activation features impact the output. And there, from there, we can start to,
00:13:05.020 | you know, generate these attribution graphs, attribution graphs, kind of combine these concepts.
00:13:10.300 | So for these two, for these hierarchical, um, categories, once we cluster them and, you
00:13:15.740 | know, add them on top of each other, what do they represent? So we can see what different features make
00:13:21.740 | up, um, these different tokens. So I didn't find this one to be the most, um, you know, interpretable
00:13:27.980 | because it's on a token split, but they have a lot of these features for different, um, different
00:13:33.020 | concepts, right? So for example, for the word digital here, if we look at it, it's starting to activate
00:13:38.060 | once there's words like smartphones, television companies, there's another feature that takes
00:13:42.700 | it in a different representation, right? So, um, in this one, there's digital suicide,
00:13:48.300 | there's color image, you know, this is like a bit of a different understanding of the word digital.
00:13:52.460 | In this one, there's tech director, right? There's a DVD, which is digital.
00:13:56.620 | In this case, there's, um, mobile devices, same thing for analytics. So web analytics, commercial analytics,
00:14:05.580 | this feature talks about data, quantitative assessments, all, all different features that,
00:14:11.100 | you know, all different features that represent analytics in different, in different, um, domains.
00:14:17.660 | So in this case, there's, um, let's see which other ones make sense. So performance metrics are a way to
00:14:22.860 | analyze, to represent analytics, routines or analytics. Um, but yeah, they kind of start to group these
00:14:29.420 | features into these different things. Then it comes to, uh, how they construct it. Basically,
00:14:35.980 | they have output nodes that are output tokens, and then they prune the ones that don't, um,
00:14:41.420 | really have anything. There's input and output nodes as well. And then we kind of have this whole
00:14:46.700 | interactive chart where you can play around with it. Um, they make it very interactive. Um,
00:14:53.260 | um, they kind of explain what this chart is like. So, uh, for labeling features, you know,
00:14:59.980 | they, they say how there's different understandings for different, for the same concept. Um, I think
00:15:06.700 | that's enough on circuit tracing. If there's questions, we can dig a little deeper and we can
00:15:12.060 | always come back to it. But at a high level, what we've done so far is with a sparsity loss and a
00:15:18.780 | recreation loss, we've kind of created a new local model, which is not small, by the way, the model
00:15:24.060 | has to have the same layers as the original model, and you kind of have to retrain it to match output.
00:15:29.820 | So this is not like cheap per se. It's pretty computationally expensive, but now we've been
00:15:36.060 | able to kind of peel back through different layers, what features kind of activate upon, uh, output.
00:15:42.860 | There's an interesting little section here that talks about how expensive this really is. So estimated
00:15:49.180 | compute requirements for CLT training to give a rough sense of compute requirements to train one.
00:15:54.380 | We share estimated costs for CLTs based on the Gemma 2 series. So on a 2B model to, uh, run 2 million
00:16:01.420 | features and train on a billion tokens, it takes about 210 H100 hours on a 9B model. It takes almost
00:16:07.740 | 4,000 H100 hours, and that's for 5 million features on 3 billion tokens. Now that's not cheap, right? Like
00:16:15.100 | this is 4,000 H100 hours. Most people don't have access to that. Um, but you know, they're able to do this
00:16:21.420 | on Haiku and then we go back into our main blog post of what features they found and what different
00:16:26.700 | little, um, interesting niches. I'll take a little pause here and see if we have any questions on
00:16:32.780 | circuit tracing, what this CLT transcoder model is, um, any questions, any thoughts, any additions,
00:16:39.660 | any comments, just very high level. What we've done so far is we've retrained a model. It matches the layers.
00:16:46.780 | We call it the local replacement model. It matches the layers of the original transformer.
00:16:51.740 | It freezes attention. It replaces the MLP or the feed forward network with this transcoder.
00:16:58.300 | And basically this transcoder is just trained to re this model is trained to re output the exact same outputs
00:17:04.060 | for inputs. And then we start to dig deeper at these little, um, sparse features and start to map them.
00:17:09.900 | Uh, they do this, they show the cost of how much it would be then for the big one. So in this paper,
00:17:15.820 | they, they train it on two, two models, 18 layer language model, and then also on Claude Haiku.
00:17:22.540 | The Haiku one is a local model that has 30 million features and you know, you can kind of extrapolate
00:17:28.300 | how expensive that would be. But quick pause, any, any thoughts on circuit tracing, any questions,
00:17:34.140 | or otherwise we can start to continue. The next section is let's start to look at some of these
00:17:39.020 | features. Let's see what happened. Can we, uh, they, they have a few different examples here. So
00:17:43.420 | multi-step reasoning, planning and writing poems, features that are multilingual features that kind of
00:17:50.620 | expect that, uh, mess with medical diagnosis, refusals, they start to do some stuff like different
00:17:56.620 | clamping. So they clamp in different features. So for example, in this, what's the capital of Austin,
00:18:02.620 | if we take out Austin, well, you know, let's say we sub, uh, let's say we throw in the feature for
00:18:08.380 | Sacramento. The model will now output, um, California. Okay. Questions. Why can't we just directly train circuits?
00:18:16.620 | So you kind of are training the circuit. So the circuit tracing is this transcoder. What you are
00:18:22.460 | training is this transcoder network, right? You keep attention flat, you replace it with the MLP,
00:18:27.500 | but you're training this circuit. Um, in terms of directly training on circuits, you're, you're kind
00:18:34.780 | of messing with that feed forward network, right? Like technically this is the exact same thing as our MLP
00:18:42.060 | layer. It's just now you're forcing it to be sparse. Like we've trained a model to do the same thing,
00:18:47.980 | but if you train it with a sparse, uh, with a sparsity in fat, like sparsity from scratch,
00:18:55.420 | you probably won't get very far, right? This is like, in my mind, it's similar to distillation where
00:18:59.980 | you can take a big model. You use a teacher forcing distillation loss to get a small model to mimic it.
00:19:05.420 | But that doesn't mean that you can just train a small model to be just as good.
00:19:08.780 | Um, okay. If we predict smile string, I wonder what concept we can see. So there's like a very,
00:19:15.420 | very deep interactive bunch of demos here of different, uh, input output prompts,
00:19:19.980 | and you can see what features activate. So I found, um, global weights.
00:19:29.020 | Okay. Well, we'll find it cause it shows up again in the other, in the other one, but okay. We'll,
00:19:35.260 | we'll start to go through the actual biology of an LLM. So going through this, um,
00:19:39.820 | okay. In this paper, we focus on applying attribution graphs to Claude 3.5 Haiku, which
00:19:47.900 | is Anthropics lightweight model. So they have this introductory example of multi-step reasoning.
00:19:53.260 | Uh, introductory example of multi-step reasoning, planning and poems, multilingual circuits,
00:20:05.180 | addition, where it shows how it does math, medical diagnosis. Uh, we'll start to go through like the
00:20:11.340 | first three of these. And then I think we'll just open it up for people's thoughts and we can dig
00:20:15.100 | through the rest as needed. So brief overview is kind of that circuit training case study walking through
00:20:22.220 | this. Okay. Um, they do talk a lot about limitations. If anyone's interested in Mechinterp, uh, they have
00:20:29.260 | a whole like limitations section. They have a future works questions. They have open questions that they
00:20:35.420 | would expect people to work on. But remember, unlike essays, which you can do on one layer, this stuff is
00:20:41.260 | pretty compute intensive. So pretty big models you're training, but, um, you know, it's always interesting
00:20:47.820 | stuff for people to work on. Okay. Method overview. This is just high level again of what we just
00:20:52.780 | talked about. You freeze MLP, uh, sorry, you freeze attention. You change MLP to the CLT model. Then we
00:21:00.860 | have feature visualization. They have this error nodes that they have to add in. This is the local replacement
00:21:05.820 | model. So Texas capital is Austin. It goes through these different features. Okay. Um, they group these
00:21:13.500 | related nodes on a single layer into super nodes. So we have one, we have, um, graphs, right? So
00:21:20.460 | basically graph networks are kind of useful in this sense because each node is kind of a concept,
00:21:25.580 | but then the edges between them can go throughout layers, right? So on a layer wise, they call these
00:21:31.340 | super nodes and they kind of stack them together. So in this case, let's look at the features that activate
00:21:36.700 | for the word capital. So, um, obviously terms like city, uh, buildings, uh, there's another feature
00:21:45.100 | for, I guess this is a multilingual one. There's one for businesses, you know, capital, uh, cyber attacks
00:21:52.620 | that happen, venture capital. What else have we got? We've got states, we've got the concepts of the United
00:21:58.860 | state, France. So countries, um, now we've got another feature that, you know, it actually fires
00:22:05.260 | up when we talk about specifics. So Connecticut, um, I think there's one here for languages as well,
00:22:10.700 | which was pretty interesting. So like capital letters, you know, um, of course a bunch more cities.
00:22:16.140 | Um, that's kind of the basic graph, right? So for Texas, we've got stuff like income tax, big,
00:22:23.180 | far, um, Austin, different things that Texas is like. So these are kind of these super clusters.
00:22:31.500 | Um, this is their example of intervention. If they clamp down the feature of Texas,
00:22:36.620 | well now, you know, Texas capital, well, instead we're going to go through capital, say a capital,
00:22:41.420 | then we observe that if we take out Texas, it instead decides that Sacramento is pretty important.
00:22:47.020 | It's, it's the capital that it decides to predict. So, uh, we can clamp down on these.
00:22:51.820 | Not sure. I understand why transformer attention KV matrices are needed to be frozen. It's needed to
00:22:56.780 | be frozen because they don't want to train more than what they need in the circuit tracing, right?
00:23:01.180 | They're basically doing this sparsity loss. And once you start messing with attention and training
00:23:06.540 | in this objective, you're kind of going to mess stuff up, right? So all they're really trying to
00:23:11.660 | do in circuit tracing is just train this, um, this replacement layer. They're, they're just training
00:23:18.220 | these sparse transcoders. They're, they're not trying to, they're not trying to mess with attention.
00:23:23.900 | So attention is a lot of the training, but you know, perhaps they could unfreeze it and we'd start to
00:23:29.500 | get a weird aspect where, you know, now you have randomly your zero initialized weights. Um, and it's
00:23:37.260 | not what we're trying to look at, but you could also do this through, um, the, the attention layers
00:23:41.980 | are still kind of mapped. Right. But, um, yeah, that's why we're not freezing. That's why we freeze
00:23:47.420 | attention. Okay. Uh, continuing through this, this is their first example of let's see if we can see
00:23:54.780 | multi-step reasoning in, um, cloud 3.5 Haiku. And this is not a thinking model. This is just a regular
00:24:02.060 | next token prediction model. How does it come to the output? So let's consider the prompt, uh, fact,
00:24:08.060 | the capital of the state containing Dallas is, and then of course, Haiku is pretty straightforward.
00:24:13.100 | It answers, uh, Austin. So this step, this question, this prompt takes two steps, right?
00:24:18.140 | First, you have to realize that it's asking about the state containing Dallas. So, um, it's asking
00:24:23.900 | about the capital of the state containing Dallas. So first, what state is Dallas in? I have to think,
00:24:29.260 | okay, it's in Texas. Second, I have to think, what is the capital of Texas? It's Austin. So kind of two
00:24:35.900 | steps to this answer. Right now, the question is, does Claude actually do these two steps internally
00:24:41.980 | or does it kind of just pattern match shortcut? Like it's been trained enough to just realize,
00:24:46.540 | oh, this is obviously just Austin. So let's peel back what happens at different layers.
00:24:51.420 | Let's see what features activate and see if we have any traces of these, this sort of thinking work,
00:24:56.700 | right? Does it have these two steps? Um, previous work has shown that there is evidence of genuine,
00:25:03.020 | of genuine multi-hop reasoning to various degrees, but let's do it with their attribution graph.
00:25:08.700 | So here's kind of, um, what they visualize. So first we find several features for the word,
00:25:13.980 | the exact word capital. So the word capital has different features, right? So there's a business
00:25:21.740 | capital. There's all this, um, capital of different countries. There's these different
00:25:27.020 | features that they group together. They actually have cities as well. So Berlin, Athens, Bangkok,
00:25:32.060 | Tokyo, Dublin, um, top of buildings. One example, um, there's, there's several features. Okay.
00:25:39.260 | Then there's output features. So landmarks in Texas, these show up for, um,
00:25:45.900 | one feature activates on various landmarks. So there's a feature around suburban district,
00:25:52.220 | Texas history museum, some seafood place. Uh, we also find promote the same capital. Okay. Uh,
00:26:02.060 | features that promote the output of the same capital generally. So responding with a variety of us state
00:26:08.140 | capitals, this feature talks about different capitals. So headquarters, state capital promote
00:26:16.620 | various countries, Maryland, Massachusetts, but going through all that, here's kind of where we get up.
00:26:22.620 | So fact, the capital of the state containing Dallas is when we look at capital, here's the different
00:26:28.140 | meanings of it, you know, um, state Dallas. Then when we go one, one level deeper, it looks like,
00:26:35.420 | oh, there's this super node of say a capital, say a capital has capitals, crazy concept. It maps to
00:26:44.060 | capitals. Texas has, you know, examples of different things in Texas. So Houston, Austin, San Antonio,
00:26:51.980 | uh, features for, you know, different things croquet that happens here, this place teacher stuff. Um,
00:27:01.980 | the attribution graph contains multiple interesting paths. We summarize them below. So the Dallas feature
00:27:07.980 | with some contribution from the state feature activates a group of features that represent concepts
00:27:13.180 | state of, uh, related to the state of Texas. So Dallas and state, Dallas and state have features of Texas.
00:27:22.620 | Um, kind of interesting, right? Dallas and state have features of Texas in parallel features activated by the
00:27:28.860 | world capital activate another cluster used to say the name of a capital. So features of capital
00:27:35.580 | have features of say a capital, Texas features and stay a feature, uh, say a capital eventually
00:27:42.860 | for lead to lead to stay Austin. So passing these two together, we have the, you know, say a capital in
00:27:49.180 | Texas, uh, to stay Austin. Um, then they start to do some of this clamping work.
00:27:54.860 | Clamping is pretty interesting, right? So if we look at the most probable prediction,
00:27:59.420 | um, you know, capital of the state, Dallas, say Austin, Austin is most likely. If we take out
00:28:05.980 | this feature of say a capital capital of state, Texas. Well, uh, if we take out capital right now,
00:28:12.540 | it's just going to say Texas. If we take out Texas, it's just going to say capital of state,
00:28:17.660 | Dallas, say a capital, and then it's kind of confused, right? So, um, they have little different
00:28:24.460 | things as you, as you take out stuff. So if we take out capital state of Dallas, still Texas, if you take
00:28:30.140 | out, um, state, it's still going to say, it's going to say Austin now. So capital, Dallas, Texas still says
00:28:38.940 | Austin. From here, they start swapping in features. So if we swap in California, the feature for California
00:28:44.700 | is pretty interesting, right? We see ferry building marketplace, um, universal studios,
00:28:50.460 | sea world. You have a bunch of features that activate for California. Uh, what else have we got here?
00:28:58.620 | different features outdoor San Jose. These are cities. So these are cities in California.
00:29:03.660 | Um, Stockton, these are more cities, Riverside, Oakland, this one, the governor Republican. So this
00:29:11.500 | is kind of the political feature for California. Once they clamp this into the Dallas thing, if they
00:29:16.860 | replace this, the capital of the state containing Oakland is, um, they can get Cal, they, they can,
00:29:23.020 | oh, sorry. So they, they change the prompt, you know, the capital of the state containing Oakland,
00:29:27.340 | they find a California feature, a super feature of California. Then they can clamp it back in. Uh,
00:29:32.940 | when they clamp in the capital of Dallas, they replace it with our California feature. It says
00:29:37.420 | Sacramento. They do it to Georgia. They say it says Atlanta, uh, British Columbia says Victoria.
00:29:43.660 | They find like the, the British Columbia feature has stuff like, you know, Canada and whatnot.
00:29:48.060 | If they heavily add in China, it says Beijing. So this is kind of their process of how do we find
00:29:56.380 | these super features? Here's how we can find one. You know, we change the prompt to Oakland. We find
00:30:02.140 | something that represents California, a group of features. We swap that back into our original prompt
00:30:07.340 | of Dallas. And you know, now we get Sacramento. We can do the same thing for other things that we can
00:30:12.460 | kind of start to interpret this stuff. So that's kind of their, their first multi-step reasoning.
00:30:17.500 | So we can one, see that the model has this two level approach, right? So it first has to figure out,
00:30:23.340 | um, what state, then the capital of that state. And it's starting to do that. We can see that through
00:30:28.380 | the layers. The second one is we can start to clamp these features through, uh, Ted, do you want to
00:30:33.820 | pop in? Yeah. Just a super quick thing. So they do all of this circuit analysis on the replacement model,
00:30:40.300 | because it's way easier to analyze the replacement model. It's smaller, it's linear, it's all that
00:30:45.740 | stuff. But these experiments you show where they replace whatever Texas with California,
00:30:51.260 | those are done on the original LLM. That's, that's super important. So they're not trying to prove the
00:30:56.860 | replacement works this way. They're trying to prove the original LLM works the same way as the replacement.
00:31:02.860 | And so, um, in the chat, like, like this could all be a bunch of BS, but because the intervention
00:31:09.500 | works on the original model. So if you, if you said that, you know, the, the ligament in my leg
00:31:15.660 | is connected to vision and you, you cut that and I can't walk, but I can still see perfectly fine.
00:31:21.420 | Then your explanation is probably wrong. But if you say the optic nerve is, is really important
00:31:26.780 | for vision and you cut that and suddenly I'm blind then, but I can do everything else. I can walk,
00:31:32.700 | I can taste, I can do everything else just fine. That's pretty strong support that the, that the,
00:31:37.500 | the one thing you cut is critical component just for what you said it was.
00:31:42.060 | Yeah. Um, yeah, all this is still done on the original model. Uh, someone's asking what layers
00:31:49.740 | generate these super node features. So there's super nodes across different layers, right? So,
00:31:54.700 | uh, this is throughout different layers. There's one for California here, there's Oakland at this level.
00:32:01.340 | So it's kind of throughout, they have a lot of interactive charts that you can play through this
00:32:05.660 | to go through different layers. These are just kind of the hand cherry picked examples and they
00:32:10.860 | acknowledge this as well. They acknowledge that what they found is cherry picked and heavily
00:32:15.660 | biased towards what they thought was, you know, here's what we see. Here's what we should dig into.
00:32:21.100 | It's a bit of a limitation in the work, but nonetheless, it's still there. Um, another example that they
00:32:27.100 | show is, you know, uh, planning in poems. So how does Claude 3.5 haiku write a rhyming poem?
00:32:32.860 | So writing a poem requires satisfying two constraints at one time, right? There's two things that we have to do.
00:32:38.460 | The lines need to rhyme and they need to make sense. There's two ways that a model could do this. One is
00:32:44.300 | pure improvision, right? Um, model could just begin each line without regard of needing to rhyme. Uh,
00:32:51.340 | sorry, the model could write the beginning of each line without regard for needing to rhyme at the end.
00:32:55.900 | And then the last word just kind of has to rhyme with the first, or there's this planning step, right?
00:33:01.740 | So you can either just kind of start. And as you go, think of words that rhyme, or you can actually plan
00:33:07.820 | ahead. So this example tries to see, is there planning when, when I tell you to write a poem and I give you
00:33:13.740 | a word to start with, like, you know, write a poem and have something that rhymes with the word tape.
00:33:19.180 | I forced you to have the first word, and then you can start generating words that rhyme with tape.
00:33:24.060 | Or if I tell you to write a poem about something you can plan in advance before the first word is written.
00:33:29.340 | So, um, even though the models are, you know, trained to think one token at a time and predict the next
00:33:36.460 | token outside of, you know, thinking models, uh, we would assume that, you know, the model would rely
00:33:41.740 | on pure improvision, right? It will just kind of do it on the fly. But the interesting thing here is
00:33:47.100 | they kind of find a planning mechanism per se in what happens. So specifically the model often activates
00:33:54.300 | features corresponding to candidate end of next line words prior to writing the line.
00:33:59.020 | So before like the net, before the rhyming word is predicted, even if it's at the end of the line,
00:34:06.700 | we can see traces of it starting to come up pretty early on. Um, so for example, a rhyming, a rhyming
00:34:14.220 | couplet, he saw a carrot and had to grab it. His hunger was a powerful rabbit or starving like a rabbit.
00:34:21.100 | Um, these words start to show up pretty early on. So first let's look at, you know, where do these features
00:34:27.660 | come from? Um, what are the different features that form them? So for habit, um, you know, 50x very clear
00:34:35.340 | reason, best answer. Um, for habit there's, there's just different features, mobile app that gamifies
00:34:43.020 | habit tracking, habit tracker, habit formation, uh, budgeting, rapid habit formation, discussing
00:34:53.020 | habits with doctors. So, you know, once again, they've got this concept of habit. Uh, let's see where it starts to come in.
00:34:59.340 | So before they go into their thing, they talk about prior work. Um, sequence models, add a body of example in several ways.
00:35:07.020 | We provide a mechanistic account for how words are planned, forward planning, backward planning, the model...
00:35:14.700 | Oh, shit. Um...
00:35:16.700 | Here we are. Uh, the whole, the model holds multiple possible planned words in mind. We're able to edit the model's planned words.
00:35:28.700 | We discover, um, the mechanism with an unsupervised bottom-up approach. Model used to represent words,
00:35:36.380 | ordinary features. Okay. Planned words and their mechanistic role. So, um, we study how Claude completes
00:35:43.580 | the following prompt asking for a rhyming couplet. The model's output sampling the most likely token is shown
00:35:48.380 | in bold. So, uh, this is kind of the input, a rhyming couplet. He saw a carrot and had to grab it. The output we get is,
00:35:55.420 | his hunger was like a starving rabbit. So model, the output is coherent. It makes sense and it rhymes, right?
00:36:02.060 | So, uh, starving rabbit, carrot, kind of all rhymes there. To start, we focus on the last word of the
00:36:09.420 | second line and attempt to identify the circuit that can shoot that, uh, contributed to choosing rabbit. So
00:36:15.500 | this makes sense, right? Rabbits like carrots, um, grab it, rabbit, it rhymes. So there's kind of that
00:36:22.620 | two-step thing. Was it just the last token predicted or did we have some thought to it?
00:36:26.700 | Okay. So these are kind of the, the features. So it comma hunger was like starving. Okay. Let's,
00:36:34.460 | let's start to dig through this. So rhymes with, there's a feature here of rhymes with it, it sound, um, get, um,
00:36:45.340 | that they have features that activate across different languages and stuff that activate,
00:36:50.300 | uh, that, you know, have this sort of rhyming feature. Then they have rabbit and habit that came up,
00:36:55.420 | um, say rabbit. And then this feature of the dash T and then, oh, cool. We got rabbit. What does this show?
00:37:03.900 | The attribution graph above computed by, uh, attributing back from the rabbit output node shows
00:37:09.980 | an important group of features activate on the new line token before the beginning of the second line
00:37:14.620 | features activate over the it token, uh, activate. So, um,
00:37:20.540 | basically the second last output token where, um, grab it had features that activated these different,
00:37:30.940 | um, you know, sort of rhyming tokens. The candidates have, uh, the candidate completions in turn have
00:37:38.300 | positive edges to say rabbit features over the last token. So that's this hypothesis. We perform a
00:37:43.660 | variety of interventions on new line planning sites to see how probability, how it affects the probability
00:37:48.540 | of the last token. Okay. So let's, uh, 10 X down the word habit and we've got different changes, 10 X up
00:37:57.980 | and down new line, um, different things affect different things. The results confirm our hypothesis that features
00:38:04.460 | that planning features strongly influence the final token. So if we kind of take out that new line
00:38:10.780 | token, we can see, oh, it's a, it's not doing this anymore. Okay. Planning features only matter at
00:38:16.860 | planning location, planning words, influence immediate words, nothing too interesting here. Okay. Clamping
00:38:23.740 | was a line to lead to transformer. How do they map trans corridor back to transformer? Say we clamp Texas.
00:38:28.540 | So in, there's a question around the clamping stuff and how this is working. The previous SAE thing that
00:38:34.220 | they put out in May, it explains how they do all these clamping features. Uh, basically same thing.
00:38:39.740 | There's more in here as well. In both of these papers, they kind of go into the math about it as well, but
00:38:45.660 | keeping it high level. Let's just kind of try to see, um, some more of these planned words. So, um,
00:38:52.860 | yep, we can, we can sort of see as we take out different things, uh, we no longer have this planning
00:38:59.900 | step. Okay. I'm going to go quickly through the next few ones, ideally in the next like seven minutes,
00:39:05.980 | and then we'll leave the last 10 minutes for just questions and discussions on this. So we first,
00:39:11.180 | you know, we just saw how there's pre-planning in poems for rhyming. There's this multi-step sort of
00:39:17.660 | thinking that happens throughout layers. Uh, now we've got multilingual circuits. So models, uh, modern
00:39:24.460 | networks have highly abstract representations and unified concepts across multiple languages. So we have
00:39:30.220 | little understanding of how these features fit in larger, larger circuits. Let's see how it, um, you know,
00:39:36.060 | how does it go through the exact same prompt in different languages? Are there features that fire
00:39:43.500 | that are consistent through different languages? Um, also fun fact, I guess rabbits don't eat carrots.
00:39:49.740 | Carrots are like treats. Crazy, crazy. Someone knows about, um, rabbits. Okay. So, um, the opposite of small is,
00:39:58.220 | and then we would expect big in French. It's grand in Chinese. It's this character. Um, let's see if there's
00:40:05.260 | consistency across these features. So high level story features the same. The model recognizes using a language
00:40:11.340 | independent, um, representation. So very interesting. There's language independent, uh, representation. So this term of
00:40:20.700 | say large, uh, is something that, uh, that activates across all three languages. Let's see some of the features. So, um,
00:40:29.020 | large has stuff like, you know, 42nd order. Uh, there's a Spanish version in here, uh, short arm and long arm.
00:40:36.940 | It activates in this language. It activates in a numerical sense. It activates small things. Great. Um,
00:40:47.340 | this feature is kind of multilingually representing the word large. Um,
00:40:52.220 | same thing with antonyms. So, um,
00:40:57.580 | yeah, there's, there's kind of just these high level features that activate. So
00:41:03.420 | the opposite of small is little, uh, there's a synonym feature, antonym, antonym kind of synonym, multilingual,
00:41:10.300 | say small, say cold, say large. Um,
00:41:13.660 | very interesting. Editing the operation antonyms, the synonyms is kind of another one.
00:41:17.660 | They can kind of clamp this in. So, um, they show how that works. Editing small, the hot.
00:41:24.860 | Okay. Editing the output language. There's another thing that we can start to do. So if we start to
00:41:30.300 | swap in the features for different languages, you know, we can get output in different language. Um,
00:41:35.900 | more circuits for French, I think it's okay. You can go through this on your own time. Do models
00:41:41.820 | think in English? This is an interesting one. As researchers have begun to mechanistically
00:41:46.860 | investigate multilingual properties of models, there's been tension in our link in our literature.
00:41:51.420 | Researchers have found multilingual neurons and features and evidence of multilingual representations.
00:41:58.060 | On the other hand, there's present evidence that models, um, you know, they, they use English
00:42:04.220 | representation. It's, uh, so what should we make of this conflicting evidence? It seems to us that
00:42:09.820 | Claude 3.5's haiku is generally, is using genuinely multilingual features, especially in the middle
00:42:15.820 | layer. So in middle layers, we see multilingual features. Um, there in, there are important mechanistic
00:42:22.780 | ways in which English is privileged. For example, multilingual features have more significant direct
00:42:28.300 | weights to corresponding English output nodes while non-English outputs being more strongly meditated in
00:42:34.140 | the XY language features. So kind of interesting. There's still a bit of an English bias, but you know,
00:42:39.100 | there are definitely some inherent, um, multilingual features there. Okay. Next example is English.
00:42:47.500 | Uh, we want to see how does Claude add two numbers like 36 plus 59. Uh, we found that we can, uh,
00:42:54.060 | we found that it split the problem into multiple pathways, computing the result in a, at a rough
00:43:01.180 | precision parallel computing while one digits answer before reconstructing these to get the cue key, uh,
00:43:06.940 | the correct answer. We find a key step performed by a lookup table feature. Ooh, very interesting.
00:43:12.060 | Lookup table feature that translates the properties of inputs. Okay. Let's kind of see what's
00:43:17.340 | going on first. We visualize the role of addition problems using operators, um, show the activity
00:43:24.300 | of features on the equal token for prompts, uh, calculation AB. So addition features, calculate
00:43:31.420 | a plus B equals, they kind of have this lookup table, some features. It's very interesting how it's
00:43:37.340 | doing attention. Uh, this one gets a little bit complex in how we go through what's happening here in the
00:43:45.180 | case. And the sake of time, I think that's enough of a little overview. They can, of course, mess with
00:43:51.260 | its math. Let's go on to the next one. Medical diagnosis. This is a fun one. In recent years,
00:43:58.220 | researchers have explored medical applications for LLMs, for example, aiding clinicians in accurate
00:44:03.580 | diagnosis. So what happens? Thus, we are interested in whether our methods can shake lights on reasoning
00:44:10.300 | model on the reasoning models perform internally in medical contexts. We study an example scenario
00:44:16.780 | in which a model is presented information with a P about a patient and asked to suggest a follow-up
00:44:22.140 | question to inform diagnosis of the treatment. This mirrors common medical practice. Um,
00:44:28.540 | okay. So let's see what happens. Um, human, a 32 year old female, 30 week gestation period, this,
00:44:35.580 | this, this mild headache, nausea. Um, if only we can ask one symptom, what would she, what would,
00:44:42.540 | what would we ask assistant visual disturbances? So the model is most, the model's most likely
00:44:48.620 | completion here is visual disturbances. And this two key indicators in this issue. Okay. We noticed that the
00:44:56.220 | model activated a number of features that activate in context of this, um, you know, this issue in people.
00:45:03.900 | So what are these features in coming to this? Okay. Their, their UI is struggling, uh, slightly more
00:45:10.700 | deadly material, uh, gestational disease pressure. So there's a bunch of features that come up to this
00:45:18.220 | blood pressure, protein stroke. Um, some of the other features were on synonyms of this other
00:45:27.020 | activations in broad context, kind of interesting, right? So they do see this kind of internal
00:45:32.940 | understanding. They have more examples of this for different stuff. So, you know, if we could only ask
00:45:37.820 | stuff, it's whether he's experiencing chest pain in this one, whether there's a rash and they kind of go
00:45:43.180 | through what are some of the features that make this stuff up here. Uh, pretty interesting. I think
00:45:47.500 | you should check it out if interested, if interested. Okay. Uh, 10 minutes left. I think there's a lot
00:45:53.180 | more of these are clamping this there's entity recognition, there's refusals, but okay. I want to
00:45:59.740 | pause here. See if we have any other comments, questions, thoughts, things that we want to dig more into.
00:46:09.900 | I'm going to check chat, but see if there's, um, yeah, if anyone has any stuff they want to dig into,
00:46:16.380 | let's feel free, you know, pop in. Could you do a quick overview of the hallucination section?
00:46:25.260 | Yeah. Let's just keep going. Um, so entity recognition and hallucination. So basically,
00:46:34.380 | hallucination is where you make up false information, right? Hallucination is common when models are asked
00:46:39.420 | about obscure facts because they like to be confident. An example, consider this hallucination by, uh,
00:46:45.260 | given by Haiku 3.5. So prompt, uh, this guy plays the sport of completion pickleball, which is a paddle
00:46:53.100 | ball sport, uh, that consists of elements of this. The behavior is reasonable in the model's, uh, training
00:46:59.020 | data. A sentence seems likely to be associated with the name of a sport without any information of who this
00:47:04.700 | guy is the model says a plausible support, uh, plausible sport at random during fine tuning.
00:47:12.620 | However, models are trained to avoid such Bob, uh, behavior when acting in the assistant character,
00:47:19.740 | this leads to responses like the following. So base model Haiku without it's kind of, um,
00:47:25.260 | um, you know, RL chat tuning. It just completes this and says, Oh, the sentence sounds like a sport.
00:47:32.460 | I will give you a sport. Now, after their sort of training, what sport does this guy play answer in
00:47:38.460 | one word models like, Oh shit, I can't do that. I don't know who this is. I need context. Given that
00:47:43.820 | hallucination is some sense of natural behavior, which is mitigated by fine tuning. We take a look at the
00:47:49.420 | the service, uh, the circuits that prevent models from hallucinating. So they're not really in this
00:47:54.780 | sense, looking at hallucination and what caused it. They're looking at how they fixed it. So,
00:47:58.460 | uh, quick high level TLDR. We have base models. We do this RL or SFT and we convert them into chat
00:48:05.900 | models, right? In that we have this preference tuning. One of the things that they're trained to
00:48:10.060 | do is be a helpful assistant. And that the objective is kind of, if you don't know what to say,
00:48:16.060 | you know, you tell them, you don't know, and you ask for more context. So base model
00:48:20.220 | would just complete tokens and be like, this guy plays pickleball. Cause it sounds like he plays a
00:48:24.140 | sport. Um, there's probably a famous Michael or two or, or Batkin that play pickleball. Um,
00:48:31.020 | assistant model is like, yo, I don't know who this guy is. So let me ask for more information,
00:48:35.820 | but let's start to look at, um, what are these features that make that up? So hallucinations can be
00:48:41.980 | attributed to a misfire in the circuit. For example, when asking the model for papers written by a
00:48:48.060 | particular author, the model may activate some of these known act answer features,
00:48:52.940 | even if it locked lacks the specific knowledge of the author, uh, knowledge of author specific papers.
00:48:59.260 | This is one kind of interesting, right? So our results were related to recent findings of Fernando
00:49:06.780 | use sparse, uh, which use sparse, sparse autoencoders to find features that represent unknown
00:49:11.660 | entities. So this. So, okay. Human in which country is a great wall located? Uh, it says China in which
00:49:18.140 | country is this based? It's okay. Um, known answer, unknown answer, different features, difficult,
00:49:25.180 | uh, default, uh, default refusal circuits. There's a can't answer, um, feature directly activate broadly
00:49:34.140 | fire for human assistant prompts. The picture suggests that the can't answer feature was activated by
00:49:40.380 | default for human assistant prompts. In other words, the model is skeptical of user. So they kind of show
00:49:45.500 | this can't answer, can't answer also, uh, can't answer features are also pro pro promoted by a group of
00:49:53.820 | unfamiliar names. So names that it doesn't understand are, I guess, a feature, uh, these kind of just prompt
00:50:00.780 | it to say, I can't answer. I don't know. Okay. Now what about the known answer circuit? So where does Mike,
00:50:07.420 | what sport does Michael Jordan play? He plays basketball. So there's a group of known answer and known entity
00:50:13.420 | features. These are what accidentally misfire when you get hallucination. That's a bit of a spoiler,
00:50:19.420 | but you know, uh, known answer is like different features that kind of, you know, what answer is,
00:50:27.260 | what country is this based in? It knows Japan. What team does Devin Booker play on? It knows the answer.
00:50:33.820 | Where's the great wall located? These are kind of known internal facts. There's a feature for it. Once this fires,
00:50:39.740 | you're cooked, it's going to answer and you know, it'll hallucinate. Once this has gone off, there's
00:50:43.740 | strong evidence for that. Um, this graph, these graphs are kind of a little interesting. They kind
00:50:48.540 | of show both sides. Um, so this is kind of the traversal throughout the layers in the RL, right?
00:50:55.180 | So we had, um, this assistant feature because it's an assistant unknown name was a feature. And then,
00:51:01.740 | you know, that leads to, I can't answer because this thing has been RL to not answer stuff. I apologize.
00:51:07.740 | So can't answer. I apologize. I can't figure this out. That's where the next turns come up because that
00:51:12.140 | shows up after that. What about something we don't know? Michael Jordan. Oh, I know this answer. There's
00:51:17.100 | a bunch of stuff that, sorry. So first we have assistant and Michael Jordan in layer one known
00:51:23.820 | answer. Oh my God. Okay. No one answer. I know a bunch of these facts say basketball,
00:51:29.180 | Oh, basketball has said vertical vertical. Now let's once again, do a bunch of fun clamping stuff,
00:51:34.780 | right? So, um, if we have Michael, Michael Jordan and we have known answer and we clamp it,
00:51:44.300 | it says basketball. What if we clamp down known answers? If we take that feature, we turn it down,
00:51:49.980 | even though the question is what sport does Michael Jordan play? We clamp down known answer. Well,
00:51:55.580 | it can't answer because the other one that fires up is unknown answer. Um, what sport does it play if we,
00:52:02.540 | if we still have strong known answer and we add in unknown name, it still says basketball,
00:52:07.980 | um, little stuff here to kind of go through, but that's kind of a high level of what's going on.
00:52:13.820 | I thought the academic papers one is another interesting. So, um, same concept, but you know,
00:52:19.260 | this is this unknown name feature of stuff. So name paper written by Senpai, Senpai Karpathy. One notable
00:52:28.060 | paper is image net. Um, there's kind of the same thing, known answer unknown name. If we change them
00:52:34.940 | up, what happens? Um, pretty, pretty fun stuff, you know? Okay. That's kind of high level of what's
00:52:43.180 | happening in the hallucination. Refusals was another interesting one. It's kind of interesting to see
00:52:48.140 | some of the output to the base model and how their RL is like, you know, showing how this stuff works.
00:52:53.340 | They have known entity and unknown entity features and say, I don't know features. The I don't know
00:52:58.940 | feature is much less interesting. The unknown feature relates to self knowledge. Okay. Yeah. Just
00:53:05.260 | interesting thoughts on this. Cool. Three more minutes. Any other fun thoughts, questions, comments?
00:53:11.980 | I would recommend reading through just examples of this. Um, and if you haven't, the SAE one is pretty
00:53:17.820 | fun too. Here's kind of the limitations, what issues show up, um, discussion. What have we, what have we
00:53:29.100 | learned? Um, yeah, kind of high level. Interesting. The background of this is kind of this circuit tracing
00:53:36.460 | transcoder work. It's very interesting how they can just train a model with a reconstruction loss and
00:53:42.140 | just have it match the output because you know, these models are still only 30 million features,
00:53:47.500 | even though they have the same layers, it's still outputting the exact same outputs. Kind of interesting.
00:53:52.540 | Uh, do folks think the taxonomy of circuits, how circuits are divided will likely converge to be a
00:54:00.460 | same breakdown every model or with different models, do different things differently. I think that
00:54:05.020 | different models might do different things differently, right? Cause this is layer wise
00:54:08.860 | understanding different models have different architectures, different layers, whether they're
00:54:13.100 | MOEs, they're also trained in different ways, right? So the pre-training data set mixture kind of affects
00:54:19.420 | some of this. So what if you're trained on high value, you know, training data first, and then garbage
00:54:26.380 | at the end, you've probably got slop in your model, but you know, you might have different circuits that go
00:54:31.580 | throughout. Um, and then there's obviously some general variety. Um, they did though, they did
00:54:37.900 | actually in this one train it on a 18 layer language model, just a general model on a couple billion tokens.
00:54:43.900 | And it still has coherency in what you expect. This still goes back to like early transformer stuff. You know,
00:54:48.940 | we have a basic understanding of early layers are more general and layers are more niche and output specific, but.
00:54:55.660 | Okay, that's kind of, um, kind of time on the hour.
00:55:01.580 | I think next week and the week after we have a few volunteers, if Lama 4 drops a paper, we'll cover it,
00:55:10.700 | of course. But I think we have a few volunteers. We'll, we'll share in discord. What's what's coming
00:55:16.060 | soon? If anyone wants to volunteer a paper, if anyone wants to, you know, follow up, please share.
00:55:22.300 | Thanks, Ted, for sharing insights as well, by the way.
00:55:24.780 | Um, I think we have a, don't we have a potential speaker for next week?
00:55:32.300 | Yeah, I thought you had one. I also have one.
00:55:34.700 | Uh, yeah, but mine moved back after your guy came in.
00:55:39.020 | Oh, okay. Okay. Well, I'll, I'll share details. Um, in discord.
00:55:42.780 | Okay. Oh, there's questions in the court. Yeah.
00:55:46.140 | Uh, are there open weights, sparse autoencoder?
00:55:49.100 | Yes. I think Gemma trained some.
00:55:51.020 | Ooh, Gemma trained some. Um, there, there's some layer wise ones.
00:55:55.820 | So like there's some that have been done on like Lama 38 B for each layer, but not throughout the whole
00:56:01.180 | model. Um, but yeah, transcoder is different. It's not model wise autoencoder.
00:56:07.260 | Uh, they do give recipe and, you know, expected cost to do this yourself though.
00:56:14.620 | Okay, let's continue discussion in discord then. Thanks for attending guys.
00:56:24.140 | See you.