Anthropic: Circuit Tracing + On the Biology of a Large Language Model

So I presented the other three Anthropic SAE kind of Macinturb papers. I think I'll just share slides after this. So if people want to go through them, go through them. But basically the very, very first one that they did was like, okay, forget LLM's. Let's just see if we can interpret basic stuff in a transformer.

So let's just pull it up. Anthropic SAE. So the first one was basically let's take a toy model. I think it was just like, you know, three layers. Like they had a basic encoder and then, you know, the sparse auto encoder. And they just trained a toy model, like a couple hundred million parameters that input output, and they have this middle layer, which was just an encoder.

Can we start to interpret what's going on in that encoder? Turns out, yeah, they can find some features. After that, they had like, or that was just encoders. Then they started to make it sparse. This was kind of the big one that became pretty popular. I think this came out in May.

We covered it shortly after. They applied this SAE work to CloudSonnet. They found out, oh shit, we can find out features that happen in the model. So they basically train a sparse auto encoder to match these inputs and outputs. And then they start to interpret them and match out features.

High level TLDR is we can now map features that activate in this sparse auto encoder. So the whole thing is you train an encoder to stay sparse. You only want a very few number of features to activate for outputs. In this case, they found out stuff like Golden Gate Claude.

There's specific features that they trained in their little auto encoder that, you know, activate when specific topics come up. So they had a feature that would always fire up for words like Golden Gate. They had stuff for like tourism, for infrastructure. There were features that extended throughout like multiple concepts.

So, you know, it's not just one feature to one thing. But yeah, they have a pretty good long blog post on this. They started grouping them. They had different sizes. So they had a one mil, four mil, 34 mil size auto encoder. From there, you know, it's been a few months and now they're like, okay, no more essays.

Let's do circuit training. So basically essays were good, but we, we kind of had a holistic understanding, right? You can apply an essay for every layer and try to understand what happens in layers, or you can apply it for like, you know, just specific parts of the model. You can do it in tension blocks and you can try to interpret what parts of them are firing up, but this is where they started to come in with circuit training.

So circuit training is where you actually train a transcoder model to mimic the input model and you can do this across layers. So this model actually, it matches the input model layer by layer and it, you know, it maps out what's going on. I'll be right back in a second.

I was stuck. Sorry. Doggo is crying. Okay. So circuit training, this came out a few weeks ago, and this is kind of the high level overview of what they're doing here. Basically, they, they train this cross link transcoder and they start poking around in a cloud 3.5 haiku, and they start to find features that are consistent throughout layers.

Hey Siri, stop so many notifications. Um, so some of the interesting stuff here is they try to see like, can models internally think like when they answer questions, like things that take two different steps, does the model start to think through its response in advance, or is it just, you know, token predicting?

And they, they find interesting little case studies where actually the model is doing some thinking. So the first main example that they show here is like, um, there's this prompt of what is the capital of the state that includes like Austin and you're supposed to say Texas, right? So this is kind of a question that has two steps of thinking, right?

There's two levels of reasoning. First step is you have to think, um, what is the state that the city is in and then what's the capital of that state? So they kind of go through how they do all this, but let's, let's, let's start off by talking about this previous, um, previous paper that came out like a week ago about circuit training.

So circuit training is where they, they train this, um, transcoder model to replicate the exact input model. And then they start to do these attribution graphs to figure out what happened. So, um, high level, here's kind of an overview of what, oops, of what people have done in previous Macinterp work.

Um, um, we had transcoders, transcoders were, you know, alternatives to SAE that let us do replacement models. Then we have this cross layer transcoder, which is let's do transcoders that go throughout different model layers. Then we have attribution graphs and linear attribution between features. Um, they prune out the the ones that are not that relevant.

They have a little fill in, we'll go a little quick through this since it's kind of a second, uh, paper, but they did have a little overview that I thought was interesting here. Okay. Uh, big in building an interpretable replacement model. So this is kind of the architecture of what this model is.

So once again, they're going to create an entire model, call it a local replacement model that matches the exact, um, that matches the number of layers for the original transformer. So they, they train two of these. And so they start to give some statistics of what it would be like to train another one.

And I think they talk about how, how much compute this requires on like Gemma 2B and like a 9B model. But essentially what they're doing here is they take a model, they look at the architecture and they freeze the attention. And basically they replace this MLP. So the feed forward layers, they replace the MLP with this cross layer transcoder, and then they can start to make this sparse and have features that we can interpret from it.

So a bunch of math that if you're interested is pretty straightforward, actually, it's just, it's just a straight replacement. It's trained to match the exact input output. Um, so here's a cool little diagram. Basically you have different layers in a transformer, right? This is an original transformer model. You have attention blocks and you have the MLP, right?

So throughout different layers, there's attention, then there's feed forward, attention and feed forward. And then eventually you have output, you pick the most probable token and you know, that's your output. So in the replacement model, instead of these MLP feed forward networks, they're replacing them with these cross layer transcoders.

These cross layer transcoders speak to each other and we start to interpret, you know, we want to keep them sparse. So there's a sparsity factor. So only one feature activates, then we map that to something interpretable. Um, this blog post is actually very long, but that's how they make this local replacement model.

Um, Ted, you have a question? Not a question, but can, is it okay if I add a little bit of color here? Yeah. Um, so, so one of the things is the early research along this very same direction on CNNs didn't require any of this stuff. And the reason is because, um, uh, the conventional wisdom now is that the number of things that, that people wanted to represent in a CNN was approximately equal to the number of filters, the number of neurons, uh, uh, layers and such that you have in a CNN.

So CNN wants to find vertical lines, horizontal lines, diagonal lines, and then in the higher layers, triangles, circles, squares, and then eventually faces, arms, that kind of stuff. And you have approximately as many things in your network as you do concepts that you're trying to represent. So if all the data lives in, in essentially a vector space, if you guys remember your linear algebra, then everything can be represented as an orthogonal direction.

And there's this linear representation hypothesis that says that information is encoded in a direction, not in a magnitude, just in a direction. And if you have a small number of concepts, they can all be completely orthogonal. And if you take the dot product of a vector with any of your concepts, there will be no interference between concepts because they're all orthogonal to each other.

So if one is due east and one is due north, and you, you dot something with a canonical north vector to see how, whether or not north is present, whether you add more east-west or not changes nothing about the dot product in your north direction. The problem is when we get to LLMs, um, uh, no, uh, operations are additions.

There's no rotations in the linear representation hypothesis. So what you have to do is you have to sort of, if you have something east and you want to add north, you have to sort of like add a lot of north to make sure that you get north-northeast enough that your dot product with north is, is not close to zero anymore.

So the problem with LLMs is that we think that there are hundreds of millions, if not billions of concepts that an LLM needs to understand. And there are not enough neurons in the LLM to uniquely, or sorry, there's not enough space in the residual stream to uniquely represent all of these concepts.

So you might have, um, a model dimension that's what, 16,000, 30,000, some, somewhere around there, right? In a big model. That's not nearly enough to represent hundreds of millions or billions of concepts, each with orthogonal directions. So then ultimately what ends up happening is the model takes advantage of sparsity and it says, well, if I represented basketball as north and the Eiffel Tower as east, and I represented ethylene glycol as northeast, the odds that we're going to have the Eiffel Tower and ethylene glycol in the same sentence are pretty small, same paragraph, same sentence, whatever.

Uh, so that if I take the dot product against northeast, if either the Eiffel Tower or basketball shows up, I'm screwed, but the odds of them actually showing up at the same time are really small. Okay. So then the, so then that's the reason why you need an SAE or, um, in this case, a transcoder, uh, because you have more concepts than you have, uh, dimensions, uh, that you can just straight up analyze.

And so the, the, the cross coder has, uh, uh, uh, uh, uh, uh, a sparsity penalty, uh, akin to, uh, an L1 loss if you're familiar with lasso regression. Uh, and that's what encourages it to represent each of these different concepts as a unique column as a unique neuron in the matrix, as it were, um, instead of the current representation, they're all just sort of jammed in there.

Yeah. Um, basically when they train this, there's, there's two things that they train on. They use a sparsity penalty, which is, you know, if you've seen the other SAE work, uh, that enforces it to stay sparse. So, you know, single activations for concepts and then a reconstruction loss reconstruction loss is so that at inference time, instead of actually running like inference through haiku, we run inference of a prompt through our CLT model.

So our local reconstruction model, it has the exact same output as haiku or whatever you're training it on. So this toy model that we've trained exactly kind of one-to-one matches. Of course there's some degradation, but you know, it's trained with reconstruction loss. So it's trained to match the exact output of the big model that you trained on.

So technically, you know, you should be able to swap it in directly. And a lot of this works because, you know, you're freezing the attention layers and you're specifically training it on a loss to recreate the inputs. And from there, that's where we have this model that now has these sparse features.

But, um, yeah, thanks for that overview, Ted. It's, it's a little bit better for the math explanation of what's going on here, but, um, continuing through this, here's kind of what happened. So they have this, uh, reconstruction error. These are error nodes that happened between the original output and the replacement model.

Then they start to prune features that aren't. So since the model is sparse, right, there's only a few features per token that actually activate a different layer. So this is layer wise activation, right? This is our local replacement model. So for example, for the first layer here, uh, these three features activated and this one, these three, and this one, these two for these, they look through the traversal of what activated and what influenced the final output.

And then they start to prune, I think 95% of the ones that didn't have an effect on the output. And now we can see, okay, what neurons, what kind of activation features impact the output. And there, from there, we can start to, you know, generate these attribution graphs, attribution graphs, kind of combine these concepts.

So for these two, for these hierarchical, um, categories, once we cluster them and, you know, add them on top of each other, what do they represent? So we can see what different features make up, um, these different tokens. So I didn't find this one to be the most, um, you know, interpretable because it's on a token split, but they have a lot of these features for different, um, different concepts, right?

So for example, for the word digital here, if we look at it, it's starting to activate once there's words like smartphones, television companies, there's another feature that takes it in a different representation, right? So, um, in this one, there's digital suicide, there's color image, you know, this is like a bit of a different understanding of the word digital.

In this one, there's tech director, right? There's a DVD, which is digital. In this case, there's, um, mobile devices, same thing for analytics. So web analytics, commercial analytics, this feature talks about data, quantitative assessments, all, all different features that, you know, all different features that represent analytics in different, in different, um, domains.

So in this case, there's, um, let's see which other ones make sense. So performance metrics are a way to analyze, to represent analytics, routines or analytics. Um, but yeah, they kind of start to group these features into these different things. Then it comes to, uh, how they construct it.

Basically, they have output nodes that are output tokens, and then they prune the ones that don't, um, really have anything. There's input and output nodes as well. And then we kind of have this whole interactive chart where you can play around with it. Um, they make it very interactive.

Um, um, they kind of explain what this chart is like. So, uh, for labeling features, you know, they, they say how there's different understandings for different, for the same concept. Um, I think that's enough on circuit tracing. If there's questions, we can dig a little deeper and we can always come back to it.

But at a high level, what we've done so far is with a sparsity loss and a recreation loss, we've kind of created a new local model, which is not small, by the way, the model has to have the same layers as the original model, and you kind of have to retrain it to match output.

So this is not like cheap per se. It's pretty computationally expensive, but now we've been able to kind of peel back through different layers, what features kind of activate upon, uh, output. There's an interesting little section here that talks about how expensive this really is. So estimated compute requirements for CLT training to give a rough sense of compute requirements to train one.

We share estimated costs for CLTs based on the Gemma 2 series. So on a 2B model to, uh, run 2 million features and train on a billion tokens, it takes about 210 H100 hours on a 9B model. It takes almost 4,000 H100 hours, and that's for 5 million features on 3 billion tokens.

Now that's not cheap, right? Like this is 4,000 H100 hours. Most people don't have access to that. Um, but you know, they're able to do this on Haiku and then we go back into our main blog post of what features they found and what different little, um, interesting niches.

I'll take a little pause here and see if we have any questions on circuit tracing, what this CLT transcoder model is, um, any questions, any thoughts, any additions, any comments, just very high level. What we've done so far is we've retrained a model. It matches the layers. We call it the local replacement model.

It matches the layers of the original transformer. It freezes attention. It replaces the MLP or the feed forward network with this transcoder. And basically this transcoder is just trained to re this model is trained to re output the exact same outputs for inputs. And then we start to dig deeper at these little, um, sparse features and start to map them.

Uh, they do this, they show the cost of how much it would be then for the big one. So in this paper, they, they train it on two, two models, 18 layer language model, and then also on Claude Haiku. The Haiku one is a local model that has 30 million features and you know, you can kind of extrapolate how expensive that would be.

But quick pause, any, any thoughts on circuit tracing, any questions, or otherwise we can start to continue. The next section is let's start to look at some of these features. Let's see what happened. Can we, uh, they, they have a few different examples here. So multi-step reasoning, planning and writing poems, features that are multilingual features that kind of expect that, uh, mess with medical diagnosis, refusals, they start to do some stuff like different clamping.

So they clamp in different features. So for example, in this, what's the capital of Austin, if we take out Austin, well, you know, let's say we sub, uh, let's say we throw in the feature for Sacramento. The model will now output, um, California. Okay. Questions. Why can't we just directly train circuits?

So you kind of are training the circuit. So the circuit tracing is this transcoder. What you are training is this transcoder network, right? You keep attention flat, you replace it with the MLP, but you're training this circuit. Um, in terms of directly training on circuits, you're, you're kind of messing with that feed forward network, right?

Like technically this is the exact same thing as our MLP layer. It's just now you're forcing it to be sparse. Like we've trained a model to do the same thing, but if you train it with a sparse, uh, with a sparsity in fat, like sparsity from scratch, you probably won't get very far, right?

This is like, in my mind, it's similar to distillation where you can take a big model. You use a teacher forcing distillation loss to get a small model to mimic it. But that doesn't mean that you can just train a small model to be just as good. Um, okay.

If we predict smile string, I wonder what concept we can see. So there's like a very, very deep interactive bunch of demos here of different, uh, input output prompts, and you can see what features activate. So I found, um, global weights. Okay. Well, we'll find it cause it shows up again in the other, in the other one, but okay.

We'll, we'll start to go through the actual biology of an LLM. So going through this, um, okay. In this paper, we focus on applying attribution graphs to Claude 3.5 Haiku, which is Anthropics lightweight model. So they have this introductory example of multi-step reasoning. Uh, introductory example of multi-step reasoning, planning and poems, multilingual circuits, addition, where it shows how it does math, medical diagnosis.

Uh, we'll start to go through like the first three of these. And then I think we'll just open it up for people's thoughts and we can dig through the rest as needed. So brief overview is kind of that circuit training case study walking through this. Okay. Um, they do talk a lot about limitations.

If anyone's interested in Mechinterp, uh, they have a whole like limitations section. They have a future works questions. They have open questions that they would expect people to work on. But remember, unlike essays, which you can do on one layer, this stuff is pretty compute intensive. So pretty big models you're training, but, um, you know, it's always interesting stuff for people to work on.

Okay. Method overview. This is just high level again of what we just talked about. You freeze MLP, uh, sorry, you freeze attention. You change MLP to the CLT model. Then we have feature visualization. They have this error nodes that they have to add in. This is the local replacement model.

So Texas capital is Austin. It goes through these different features. Okay. Um, they group these related nodes on a single layer into super nodes. So we have one, we have, um, graphs, right? So basically graph networks are kind of useful in this sense because each node is kind of a concept, but then the edges between them can go throughout layers, right?

So on a layer wise, they call these super nodes and they kind of stack them together. So in this case, let's look at the features that activate for the word capital. So, um, obviously terms like city, uh, buildings, uh, there's another feature for, I guess this is a multilingual one.

There's one for businesses, you know, capital, uh, cyber attacks that happen, venture capital. What else have we got? We've got states, we've got the concepts of the United state, France. So countries, um, now we've got another feature that, you know, it actually fires up when we talk about specifics.

So Connecticut, um, I think there's one here for languages as well, which was pretty interesting. So like capital letters, you know, um, of course a bunch more cities. Um, that's kind of the basic graph, right? So for Texas, we've got stuff like income tax, big, far, um, Austin, different things that Texas is like.

So these are kind of these super clusters. Um, this is their example of intervention. If they clamp down the feature of Texas, well now, you know, Texas capital, well, instead we're going to go through capital, say a capital, then we observe that if we take out Texas, it instead decides that Sacramento is pretty important.

It's, it's the capital that it decides to predict. So, uh, we can clamp down on these. Not sure. I understand why transformer attention KV matrices are needed to be frozen. It's needed to be frozen because they don't want to train more than what they need in the circuit tracing, right?

They're basically doing this sparsity loss. And once you start messing with attention and training in this objective, you're kind of going to mess stuff up, right? So all they're really trying to do in circuit tracing is just train this, um, this replacement layer. They're, they're just training these sparse transcoders.

They're, they're not trying to, they're not trying to mess with attention. So attention is a lot of the training, but you know, perhaps they could unfreeze it and we'd start to get a weird aspect where, you know, now you have randomly your zero initialized weights. Um, and it's not what we're trying to look at, but you could also do this through, um, the, the attention layers are still kind of mapped.

Right. But, um, yeah, that's why we're not freezing. That's why we freeze attention. Okay. Uh, continuing through this, this is their first example of let's see if we can see multi-step reasoning in, um, cloud 3.5 Haiku. And this is not a thinking model. This is just a regular next token prediction model.

How does it come to the output? So let's consider the prompt, uh, fact, the capital of the state containing Dallas is, and then of course, Haiku is pretty straightforward. It answers, uh, Austin. So this step, this question, this prompt takes two steps, right? First, you have to realize that it's asking about the state containing Dallas.

So, um, it's asking about the capital of the state containing Dallas. So first, what state is Dallas in? I have to think, okay, it's in Texas. Second, I have to think, what is the capital of Texas? It's Austin. So kind of two steps to this answer. Right now, the question is, does Claude actually do these two steps internally or does it kind of just pattern match shortcut?

Like it's been trained enough to just realize, oh, this is obviously just Austin. So let's peel back what happens at different layers. Let's see what features activate and see if we have any traces of these, this sort of thinking work, right? Does it have these two steps? Um, previous work has shown that there is evidence of genuine, of genuine multi-hop reasoning to various degrees, but let's do it with their attribution graph.

So here's kind of, um, what they visualize. So first we find several features for the word, the exact word capital. So the word capital has different features, right? So there's a business capital. There's all this, um, capital of different countries. There's these different features that they group together. They actually have cities as well.

So Berlin, Athens, Bangkok, Tokyo, Dublin, um, top of buildings. One example, um, there's, there's several features. Okay. Then there's output features. So landmarks in Texas, these show up for, um, one feature activates on various landmarks. So there's a feature around suburban district, Texas history museum, some seafood place. Uh, we also find promote the same capital.

Okay. Uh, features that promote the output of the same capital generally. So responding with a variety of us state capitals, this feature talks about different capitals. So headquarters, state capital promote various countries, Maryland, Massachusetts, but going through all that, here's kind of where we get up. So fact, the capital of the state containing Dallas is when we look at capital, here's the different meanings of it, you know, um, state Dallas.

Then when we go one, one level deeper, it looks like, oh, there's this super node of say a capital, say a capital has capitals, crazy concept. It maps to capitals. Texas has, you know, examples of different things in Texas. So Houston, Austin, San Antonio, uh, features for, you know, different things croquet that happens here, this place teacher stuff.

Um, the attribution graph contains multiple interesting paths. We summarize them below. So the Dallas feature with some contribution from the state feature activates a group of features that represent concepts state of, uh, related to the state of Texas. So Dallas and state, Dallas and state have features of Texas.

Um, kind of interesting, right? Dallas and state have features of Texas in parallel features activated by the world capital activate another cluster used to say the name of a capital. So features of capital have features of say a capital, Texas features and stay a feature, uh, say a capital eventually for lead to lead to stay Austin.

So passing these two together, we have the, you know, say a capital in Texas, uh, to stay Austin. Um, then they start to do some of this clamping work. Clamping is pretty interesting, right? So if we look at the most probable prediction, um, you know, capital of the state, Dallas, say Austin, Austin is most likely.

If we take out this feature of say a capital capital of state, Texas. Well, uh, if we take out capital right now, it's just going to say Texas. If we take out Texas, it's just going to say capital of state, Dallas, say a capital, and then it's kind of confused, right?

So, um, they have little different things as you, as you take out stuff. So if we take out capital state of Dallas, still Texas, if you take out, um, state, it's still going to say, it's going to say Austin now. So capital, Dallas, Texas still says Austin. From here, they start swapping in features.

So if we swap in California, the feature for California is pretty interesting, right? We see ferry building marketplace, um, universal studios, sea world. You have a bunch of features that activate for California. Uh, what else have we got here? different features outdoor San Jose. These are cities. So these are cities in California.

Um, Stockton, these are more cities, Riverside, Oakland, this one, the governor Republican. So this is kind of the political feature for California. Once they clamp this into the Dallas thing, if they replace this, the capital of the state containing Oakland is, um, they can get Cal, they, they can, oh, sorry.

So they, they change the prompt, you know, the capital of the state containing Oakland, they find a California feature, a super feature of California. Then they can clamp it back in. Uh, when they clamp in the capital of Dallas, they replace it with our California feature. It says Sacramento.

They do it to Georgia. They say it says Atlanta, uh, British Columbia says Victoria. They find like the, the British Columbia feature has stuff like, you know, Canada and whatnot. If they heavily add in China, it says Beijing. So this is kind of their process of how do we find these super features?

Here's how we can find one. You know, we change the prompt to Oakland. We find something that represents California, a group of features. We swap that back into our original prompt of Dallas. And you know, now we get Sacramento. We can do the same thing for other things that we can kind of start to interpret this stuff.

So that's kind of their, their first multi-step reasoning. So we can one, see that the model has this two level approach, right? So it first has to figure out, um, what state, then the capital of that state. And it's starting to do that. We can see that through the layers.

The second one is we can start to clamp these features through, uh, Ted, do you want to pop in? Yeah. Just a super quick thing. So they do all of this circuit analysis on the replacement model, because it's way easier to analyze the replacement model. It's smaller, it's linear, it's all that stuff.

But these experiments you show where they replace whatever Texas with California, those are done on the original LLM. That's, that's super important. So they're not trying to prove the replacement works this way. They're trying to prove the original LLM works the same way as the replacement. And so, um, in the chat, like, like this could all be a bunch of BS, but because the intervention works on the original model.

So if you, if you said that, you know, the, the ligament in my leg is connected to vision and you, you cut that and I can't walk, but I can still see perfectly fine. Then your explanation is probably wrong. But if you say the optic nerve is, is really important for vision and you cut that and suddenly I'm blind then, but I can do everything else.

I can walk, I can taste, I can do everything else just fine. That's pretty strong support that the, that the, the one thing you cut is critical component just for what you said it was. Yeah. Um, yeah, all this is still done on the original model. Uh, someone's asking what layers generate these super node features.

So there's super nodes across different layers, right? So, uh, this is throughout different layers. There's one for California here, there's Oakland at this level. So it's kind of throughout, they have a lot of interactive charts that you can play through this to go through different layers. These are just kind of the hand cherry picked examples and they acknowledge this as well.

They acknowledge that what they found is cherry picked and heavily biased towards what they thought was, you know, here's what we see. Here's what we should dig into. It's a bit of a limitation in the work, but nonetheless, it's still there. Um, another example that they show is, you know, uh, planning in poems.

So how does Claude 3.5 haiku write a rhyming poem? So writing a poem requires satisfying two constraints at one time, right? There's two things that we have to do. The lines need to rhyme and they need to make sense. There's two ways that a model could do this. One is pure improvision, right?

Um, model could just begin each line without regard of needing to rhyme. Uh, sorry, the model could write the beginning of each line without regard for needing to rhyme at the end. And then the last word just kind of has to rhyme with the first, or there's this planning step, right?

So you can either just kind of start. And as you go, think of words that rhyme, or you can actually plan ahead. So this example tries to see, is there planning when, when I tell you to write a poem and I give you a word to start with, like, you know, write a poem and have something that rhymes with the word tape.

I forced you to have the first word, and then you can start generating words that rhyme with tape. Or if I tell you to write a poem about something you can plan in advance before the first word is written. So, um, even though the models are, you know, trained to think one token at a time and predict the next token outside of, you know, thinking models, uh, we would assume that, you know, the model would rely on pure improvision, right?

It will just kind of do it on the fly. But the interesting thing here is they kind of find a planning mechanism per se in what happens. So specifically the model often activates features corresponding to candidate end of next line words prior to writing the line. So before like the net, before the rhyming word is predicted, even if it's at the end of the line, we can see traces of it starting to come up pretty early on.

Um, so for example, a rhyming, a rhyming couplet, he saw a carrot and had to grab it. His hunger was a powerful rabbit or starving like a rabbit. Um, these words start to show up pretty early on. So first let's look at, you know, where do these features come from?

Um, what are the different features that form them? So for habit, um, you know, 50x very clear reason, best answer. Um, for habit there's, there's just different features, mobile app that gamifies habit tracking, habit tracker, habit formation, uh, budgeting, rapid habit formation, discussing habits with doctors. So, you know, once again, they've got this concept of habit.

Uh, let's see where it starts to come in. So before they go into their thing, they talk about prior work. Um, sequence models, add a body of example in several ways. We provide a mechanistic account for how words are planned, forward planning, backward planning, the model... Oh, shit. Um...

Here we are. Uh, the whole, the model holds multiple possible planned words in mind. We're able to edit the model's planned words. We discover, um, the mechanism with an unsupervised bottom-up approach. Model used to represent words, ordinary features. Okay. Planned words and their mechanistic role. So, um, we study how Claude completes the following prompt asking for a rhyming couplet.

The model's output sampling the most likely token is shown in bold. So, uh, this is kind of the input, a rhyming couplet. He saw a carrot and had to grab it. The output we get is, his hunger was like a starving rabbit. So model, the output is coherent. It makes sense and it rhymes, right?

So, uh, starving rabbit, carrot, kind of all rhymes there. To start, we focus on the last word of the second line and attempt to identify the circuit that can shoot that, uh, contributed to choosing rabbit. So this makes sense, right? Rabbits like carrots, um, grab it, rabbit, it rhymes.

So there's kind of that two-step thing. Was it just the last token predicted or did we have some thought to it? Okay. So these are kind of the, the features. So it comma hunger was like starving. Okay. Let's, let's start to dig through this. So rhymes with, there's a feature here of rhymes with it, it sound, um, get, um, that they have features that activate across different languages and stuff that activate, uh, that, you know, have this sort of rhyming feature.

Then they have rabbit and habit that came up, um, say rabbit. And then this feature of the dash T and then, oh, cool. We got rabbit. What does this show? The attribution graph above computed by, uh, attributing back from the rabbit output node shows an important group of features activate on the new line token before the beginning of the second line features activate over the it token, uh, activate.

So, um, basically the second last output token where, um, grab it had features that activated these different, um, you know, sort of rhyming tokens. The candidates have, uh, the candidate completions in turn have positive edges to say rabbit features over the last token. So that's this hypothesis. We perform a variety of interventions on new line planning sites to see how probability, how it affects the probability of the last token.

Okay. So let's, uh, 10 X down the word habit and we've got different changes, 10 X up and down new line, um, different things affect different things. The results confirm our hypothesis that features that planning features strongly influence the final token. So if we kind of take out that new line token, we can see, oh, it's a, it's not doing this anymore.

Okay. Planning features only matter at planning location, planning words, influence immediate words, nothing too interesting here. Okay. Clamping was a line to lead to transformer. How do they map trans corridor back to transformer? Say we clamp Texas. So in, there's a question around the clamping stuff and how this is working.

The previous SAE thing that they put out in May, it explains how they do all these clamping features. Uh, basically same thing. There's more in here as well. In both of these papers, they kind of go into the math about it as well, but keeping it high level. Let's just kind of try to see, um, some more of these planned words.

So, um, yep, we can, we can sort of see as we take out different things, uh, we no longer have this planning step. Okay. I'm going to go quickly through the next few ones, ideally in the next like seven minutes, and then we'll leave the last 10 minutes for just questions and discussions on this.

So we first, you know, we just saw how there's pre-planning in poems for rhyming. There's this multi-step sort of thinking that happens throughout layers. Uh, now we've got multilingual circuits. So models, uh, modern networks have highly abstract representations and unified concepts across multiple languages. So we have little understanding of how these features fit in larger, larger circuits.

Let's see how it, um, you know, how does it go through the exact same prompt in different languages? Are there features that fire that are consistent through different languages? Um, also fun fact, I guess rabbits don't eat carrots. Carrots are like treats. Crazy, crazy. Someone knows about, um, rabbits.

Okay. So, um, the opposite of small is, and then we would expect big in French. It's grand in Chinese. It's this character. Um, let's see if there's consistency across these features. So high level story features the same. The model recognizes using a language independent, um, representation. So very interesting.

There's language independent, uh, representation. So this term of say large, uh, is something that, uh, that activates across all three languages. Let's see some of the features. So, um, large has stuff like, you know, 42nd order. Uh, there's a Spanish version in here, uh, short arm and long arm.

It activates in this language. It activates in a numerical sense. It activates small things. Great. Um, this feature is kind of multilingually representing the word large. Um, same thing with antonyms. So, um, yeah, there's, there's kind of just these high level features that activate. So the opposite of small is little, uh, there's a synonym feature, antonym, antonym kind of synonym, multilingual, say small, say cold, say large.

Um, very interesting. Editing the operation antonyms, the synonyms is kind of another one. They can kind of clamp this in. So, um, they show how that works. Editing small, the hot. Okay. Editing the output language. There's another thing that we can start to do. So if we start to swap in the features for different languages, you know, we can get output in different language.

Um, more circuits for French, I think it's okay. You can go through this on your own time. Do models think in English? This is an interesting one. As researchers have begun to mechanistically investigate multilingual properties of models, there's been tension in our link in our literature. Researchers have found multilingual neurons and features and evidence of multilingual representations.

On the other hand, there's present evidence that models, um, you know, they, they use English representation. It's, uh, so what should we make of this conflicting evidence? It seems to us that Claude 3.5's haiku is generally, is using genuinely multilingual features, especially in the middle layer. So in middle layers, we see multilingual features.

Um, there in, there are important mechanistic ways in which English is privileged. For example, multilingual features have more significant direct weights to corresponding English output nodes while non-English outputs being more strongly meditated in the XY language features. So kind of interesting. There's still a bit of an English bias, but you know, there are definitely some inherent, um, multilingual features there.

Okay. Next example is English. Uh, we want to see how does Claude add two numbers like 36 plus 59. Uh, we found that we can, uh, we found that it split the problem into multiple pathways, computing the result in a, at a rough precision parallel computing while one digits answer before reconstructing these to get the cue key, uh, the correct answer.

We find a key step performed by a lookup table feature. Ooh, very interesting. Lookup table feature that translates the properties of inputs. Okay. Let's kind of see what's going on first. We visualize the role of addition problems using operators, um, show the activity of features on the equal token for prompts, uh, calculation AB.

So addition features, calculate a plus B equals, they kind of have this lookup table, some features. It's very interesting how it's doing attention. Uh, this one gets a little bit complex in how we go through what's happening here in the case. And the sake of time, I think that's enough of a little overview.

They can, of course, mess with its math. Let's go on to the next one. Medical diagnosis. This is a fun one. In recent years, researchers have explored medical applications for LLMs, for example, aiding clinicians in accurate diagnosis. So what happens? Thus, we are interested in whether our methods can shake lights on reasoning model on the reasoning models perform internally in medical contexts.

We study an example scenario in which a model is presented information with a P about a patient and asked to suggest a follow-up question to inform diagnosis of the treatment. This mirrors common medical practice. Um, okay. So let's see what happens. Um, human, a 32 year old female, 30 week gestation period, this, this, this mild headache, nausea.

Um, if only we can ask one symptom, what would she, what would, what would we ask assistant visual disturbances? So the model is most, the model's most likely completion here is visual disturbances. And this two key indicators in this issue. Okay. We noticed that the model activated a number of features that activate in context of this, um, you know, this issue in people.

So what are these features in coming to this? Okay. Their, their UI is struggling, uh, slightly more deadly material, uh, gestational disease pressure. So there's a bunch of features that come up to this blood pressure, protein stroke. Um, some of the other features were on synonyms of this other activations in broad context, kind of interesting, right?

So they do see this kind of internal understanding. They have more examples of this for different stuff. So, you know, if we could only ask stuff, it's whether he's experiencing chest pain in this one, whether there's a rash and they kind of go through what are some of the features that make this stuff up here.

Uh, pretty interesting. I think you should check it out if interested, if interested. Okay. Uh, 10 minutes left. I think there's a lot more of these are clamping this there's entity recognition, there's refusals, but okay. I want to pause here. See if we have any other comments, questions, thoughts, things that we want to dig more into.

I'm going to check chat, but see if there's, um, yeah, if anyone has any stuff they want to dig into, let's feel free, you know, pop in. Could you do a quick overview of the hallucination section? Yeah. Let's just keep going. Um, so entity recognition and hallucination. So basically, hallucination is where you make up false information, right?

Hallucination is common when models are asked about obscure facts because they like to be confident. An example, consider this hallucination by, uh, given by Haiku 3.5. So prompt, uh, this guy plays the sport of completion pickleball, which is a paddle ball sport, uh, that consists of elements of this.

The behavior is reasonable in the model's, uh, training data. A sentence seems likely to be associated with the name of a sport without any information of who this guy is the model says a plausible support, uh, plausible sport at random during fine tuning. However, models are trained to avoid such Bob, uh, behavior when acting in the assistant character, this leads to responses like the following.

So base model Haiku without it's kind of, um, um, you know, RL chat tuning. It just completes this and says, Oh, the sentence sounds like a sport. I will give you a sport. Now, after their sort of training, what sport does this guy play answer in one word models like, Oh shit, I can't do that.

I don't know who this is. I need context. Given that hallucination is some sense of natural behavior, which is mitigated by fine tuning. We take a look at the the service, uh, the circuits that prevent models from hallucinating. So they're not really in this sense, looking at hallucination and what caused it.

They're looking at how they fixed it. So, uh, quick high level TLDR. We have base models. We do this RL or SFT and we convert them into chat models, right? In that we have this preference tuning. One of the things that they're trained to do is be a helpful assistant.

And that the objective is kind of, if you don't know what to say, you know, you tell them, you don't know, and you ask for more context. So base model would just complete tokens and be like, this guy plays pickleball. Cause it sounds like he plays a sport. Um, there's probably a famous Michael or two or, or Batkin that play pickleball.

Um, assistant model is like, yo, I don't know who this guy is. So let me ask for more information, but let's start to look at, um, what are these features that make that up? So hallucinations can be attributed to a misfire in the circuit. For example, when asking the model for papers written by a particular author, the model may activate some of these known act answer features, even if it locked lacks the specific knowledge of the author, uh, knowledge of author specific papers.

This is one kind of interesting, right? So our results were related to recent findings of Fernando use sparse, uh, which use sparse, sparse autoencoders to find features that represent unknown entities. So this. So, okay. Human in which country is a great wall located? Uh, it says China in which country is this based?

It's okay. Um, known answer, unknown answer, different features, difficult, uh, default, uh, default refusal circuits. There's a can't answer, um, feature directly activate broadly fire for human assistant prompts. The picture suggests that the can't answer feature was activated by default for human assistant prompts. In other words, the model is skeptical of user.

So they kind of show this can't answer, can't answer also, uh, can't answer features are also pro pro promoted by a group of unfamiliar names. So names that it doesn't understand are, I guess, a feature, uh, these kind of just prompt it to say, I can't answer. I don't know.

Okay. Now what about the known answer circuit? So where does Mike, what sport does Michael Jordan play? He plays basketball. So there's a group of known answer and known entity features. These are what accidentally misfire when you get hallucination. That's a bit of a spoiler, but you know, uh, known answer is like different features that kind of, you know, what answer is, what country is this based in?

It knows Japan. What team does Devin Booker play on? It knows the answer. Where's the great wall located? These are kind of known internal facts. There's a feature for it. Once this fires, you're cooked, it's going to answer and you know, it'll hallucinate. Once this has gone off, there's strong evidence for that.

Um, this graph, these graphs are kind of a little interesting. They kind of show both sides. Um, so this is kind of the traversal throughout the layers in the RL, right? So we had, um, this assistant feature because it's an assistant unknown name was a feature. And then, you know, that leads to, I can't answer because this thing has been RL to not answer stuff.

I apologize. So can't answer. I apologize. I can't figure this out. That's where the next turns come up because that shows up after that. What about something we don't know? Michael Jordan. Oh, I know this answer. There's a bunch of stuff that, sorry. So first we have assistant and Michael Jordan in layer one known answer.

Oh my God. Okay. No one answer. I know a bunch of these facts say basketball, Oh, basketball has said vertical vertical. Now let's once again, do a bunch of fun clamping stuff, right? So, um, if we have Michael, Michael Jordan and we have known answer and we clamp it, it says basketball.

What if we clamp down known answers? If we take that feature, we turn it down, even though the question is what sport does Michael Jordan play? We clamp down known answer. Well, it can't answer because the other one that fires up is unknown answer. Um, what sport does it play if we, if we still have strong known answer and we add in unknown name, it still says basketball, um, little stuff here to kind of go through, but that's kind of a high level of what's going on.

I thought the academic papers one is another interesting. So, um, same concept, but you know, this is this unknown name feature of stuff. So name paper written by Senpai, Senpai Karpathy. One notable paper is image net. Um, there's kind of the same thing, known answer unknown name. If we change them up, what happens?

Um, pretty, pretty fun stuff, you know? Okay. That's kind of high level of what's happening in the hallucination. Refusals was another interesting one. It's kind of interesting to see some of the output to the base model and how their RL is like, you know, showing how this stuff works.

They have known entity and unknown entity features and say, I don't know features. The I don't know feature is much less interesting. The unknown feature relates to self knowledge. Okay. Yeah. Just interesting thoughts on this. Cool. Three more minutes. Any other fun thoughts, questions, comments? I would recommend reading through just examples of this.

Um, and if you haven't, the SAE one is pretty fun too. Here's kind of the limitations, what issues show up, um, discussion. What have we, what have we learned? Um, yeah, kind of high level. Interesting. The background of this is kind of this circuit tracing transcoder work. It's very interesting how they can just train a model with a reconstruction loss and just have it match the output because you know, these models are still only 30 million features, even though they have the same layers, it's still outputting the exact same outputs.

Kind of interesting. Uh, do folks think the taxonomy of circuits, how circuits are divided will likely converge to be a same breakdown every model or with different models, do different things differently. I think that different models might do different things differently, right? Cause this is layer wise understanding different models have different architectures, different layers, whether they're MOEs, they're also trained in different ways, right?

So the pre-training data set mixture kind of affects some of this. So what if you're trained on high value, you know, training data first, and then garbage at the end, you've probably got slop in your model, but you know, you might have different circuits that go throughout. Um, and then there's obviously some general variety.

Um, they did though, they did actually in this one train it on a 18 layer language model, just a general model on a couple billion tokens. And it still has coherency in what you expect. This still goes back to like early transformer stuff. You know, we have a basic understanding of early layers are more general and layers are more niche and output specific, but.

Okay, that's kind of, um, kind of time on the hour. I think next week and the week after we have a few volunteers, if Lama 4 drops a paper, we'll cover it, of course. But I think we have a few volunteers. We'll, we'll share in discord. What's what's coming soon?

If anyone wants to volunteer a paper, if anyone wants to, you know, follow up, please share. Thanks, Ted, for sharing insights as well, by the way. Um, I think we have a, don't we have a potential speaker for next week? Yeah, I thought you had one. I also have one.

Uh, yeah, but mine moved back after your guy came in. Oh, okay. Okay. Well, I'll, I'll share details. Um, in discord. Okay. Oh, there's questions in the court. Yeah. Uh, are there open weights, sparse autoencoder? Yes. I think Gemma trained some. Ooh, Gemma trained some. Um, there, there's some layer wise ones.

So like there's some that have been done on like Lama 38 B for each layer, but not throughout the whole model. Um, but yeah, transcoder is different. It's not model wise autoencoder. Uh, they do give recipe and, you know, expected cost to do this yourself though. Okay, let's continue discussion in discord then.

Thanks for attending guys. See you.

Anthropic: Circuit Tracing + On the Biology of a Large Language Model

Transcript