Back to Index

Stanford CS25: V1 I Transformer Circuits, Induction Heads, In-Context Learning


Chapters

0:0
0:26 People mean lots of different things by "interpretability". Mechanistic interpretability aims to map neural network parameters to human understandable algorithms.
14:13 What is going on???
44:14 The Induction Pattern

Transcript

Thank you all for having me. It's exciting to be here. One of my favorite things is talking about what is going on inside neural networks, or at least what we're trying to figure out is going on inside neural networks. So it's always fun to chat about that. Oh, gosh, I have to figure out how to do things.

Okay. What? I won't. Okay, there we go. Now we are advancing slides, that seems promising. So I think interoperability means lots of different things to different people. It's a very broad term and people mean all sorts of different things by it. And so I wanted to talk just briefly about the kind of interoperability that I spend my time thinking about, which is what I'd call mechanistic interoperability.

So most of my work actually has not been on language models or on RNNs or transformers, but on understanding vision confinates and trying to understand how do the parameters in those models actually map to algorithms. So you can like think of the parameters of a neural network as being like a compiled computer program.

And the neurons are kind of like variables or registers. And somehow there are these complex computer programs that are embedded in those weights. And we'd like to turn them back in to computer programs that humans can understand. It's a kind of reverse engineering problem. And so this is kind of a fun example that we found where there was a car neuron and you could actually see that we have the car neuron and it's constructed from like a wheel neuron.

And it looks for, in the case of the wheel neuron, it's looking for the wheels on the bottom. Those are positive weights and it doesn't want to see them on top. So it has negative weights there. And there's also a window neuron. It's looking for the windows on the top and not on the bottom.

And so what we're actually seeing there, right, is it's an algorithm. It's an algorithm that goes and turns, you know, it's just, it's, you know, saying, you know, well, a car has wheels on the bottom and windows on the top and chrome in the middle. And that's actually like just the strongest neurons for that.

And so we're actually seeing a meaningful algorithm and that's not an exception. That's sort of the general story that if you're willing to go and look at neural network weights and you're willing to invest a lot of energy in trying to first engineer them, there's meaningful algorithms written in the weights waiting for you to find them.

And there's a bunch of reasons. I think that's an interesting thing to think about. One is, you know, just no one knows how to go and do the things that neural networks can do. Like no one knows how to write a computer program that can accurately classify image net, let alone, you know, the language modeling tasks that we're doing.

No one knows how to like directly write a computer program that can do the things that GPT-3 does. And yet somehow breaking descent is able to go and discover a way to do this. And I want to know what's going on. I want to know, you know, how, what has it discovered that it can do in these systems?

There's another reason why I think this is important, which is, uh, is safety. So, you know, if we, if we want to go and use these systems in places where they have big effect on the world, um, I think a question we need to ask ourselves is, you know, what, what happens when these models have, have unanticipated failure modes, failure modes we didn't know to go and test for, to look for, to check for, how can we, how can we discover those things, especially if they're, if they're really pathological failure modes.

So the models in some sense, deliberately doing something that we don't want. Well, the only way that I really see that we, we can do that is if we can get to a point where we really understand what's going on inside these systems. Um, so that's another reason that I'm interested in this.

Now, uh, actually doing interpretedly on language models and transformers it's new, new to me. I, um, before this year, I spent like eight years working on trying to reverse engineer continents and vision models. Um, and so the ideas in this talk, um, are, are new things that I've been thinking about with my collaborators.

Um, and we're still probably a month or two out, maybe, maybe longer from publishing them. Um, and this was also the first public talk that I've given on it. So, uh, you know, the things that I'm going to talk about, um, they met there, there's, I think, honestly, it's still a little bit confused for me, um, and definitely are going to be confused in my articulation of them.

So if I, if I say things that are confusing, um, you know, please feel free to ask me questions. There might be some points for me to go quickly because there's a lot of content. Um, but definitely at the end, I will be available for a while to chat about this stuff.

Um, and, uh, yeah, also I apologize. Um, if, uh, if I'm unfamiliar with zoom and make, make mistakes. Um, but, uh, yeah. So, um, with that said, uh, let's dive in. Um, so I've wanted to start with a mystery, um, before we go and try to actually dig into, you know, what's going on inside these models.

Um, I wanted to motivate it by a really strange piece of discovery of a behavior that we discovered and, and wanted to understand. Um, uh, and by the way, I should say all this work is, um, uh, you know, is, is done with my, my colleagues and Anthropic and especially my colleagues, Catherine and Nelson.

Um, okay. So onto the mystery. Um, I think probably the, the most interesting and most exciting thing about, um, about transformers is their ability to do in context learning, or sometimes people will call it meta learning. Um, you know, the GP three paper, uh, goes and, and describes things as, um, you know, uh, language models are few shot learners.

Like there's lots of impressive things about GPT three, but they choose to focus on that. And, you know, now everyone's talking about prompt engineering. Um, and, um, Andre McCarthy was, was joking about how, you know, software 3.0 is designing the prompt. Um, and so the ability of language models of these, these large transformers to respond to their context and learn from their context and change their behavior and response to their context.

Um, you know, it really seems like probably the most surprising and striking and remarkable thing about that. Um, and, uh, some of my, my colleagues previously published a paper that has a trick in it that I, I really love, which is, so we're, we're all used to looking at learning curves.

You train your model and you, you know, as your model trains, the loss goes down. Sometimes it's a little bit discontinuous, but it goes down. Another thing that you can do is you can go and take a fully trained model and you can go and ask, you know, as we go through the context, you know, as we go and we predict the first token and then the second token and the third token, we get better at predicting each token because we have more information to go and predict it on.

So, you know, the first, the first con token, the loss should be the entropy of the unigrams and then the next token should be the entry of the bigrams and it falls, but it keeps falling and it keeps getting better. And in some sense, that's our, that's the model's ability to go and predict, to, to go and do in context learning the ability to go and predict, um, you know, to be better at predicting later tokens than you are at predicting early tokens.

That is, that is in some sense, a mathematical definition of what it means to be good at this magical in context, learning or meta learning that, that these models can do. And so that's kind of cool because that gives us a way to go and look at whether models are good at, at in context learning.

Chris, uh, if I could just ask a question, like a clarification question, when you say learning, there are no actual parameter updates that is the remarkable thing about in context learning, right? So yeah, indeed, we traditionally think about neural networks as learning over the course of training by going and modifying their parameters, but somehow models appear to also be able to learn in some sense.

Um, if you give them a couple of examples in their context, they can then go and do that later in their context, even though no parameters changed. And so it's, it's some kind of quite different, different notion of learning as you're, as you're, as you're gesturing out. Uh, okay.

I think that's making more sense. So, I mean, could you also just describe in context learning in this case as conditioning, as in like conditioning on the first five tokens of a 10 token sentence or the next five tokens? Yeah. I think the reason that people sometimes think about this as in context learning or meta learning is that you can do things where you like actually take a training set and you embed the training set in your context.

Like if you just two or three examples, and then suddenly your model can go and do, do this task. And so you can do few shot learning by embedding things in the context. Um, yeah, the formal setup is that you're, you're just conditioning on, on, on this context. And it's just that somehow this, this ability, like this thing, like there's, there's some sense, you know, for a long time, people were, were, were, I mean, I, I guess really the history of this is, uh, we started to get good at, at neural networks learning, right.

Um, and we could, we could go and train language, uh, train vision models and language models that could do all these remarkable things. But then people started to be like, well, you know, these systems are, they take so many more examples than humans do to go and learn. How can we go and fix this?

And we had all these ideas of meta learning develop where we wanted to go and, and train models explicitly to be able to learn from a few examples and people develop all these complicated schemes. And then the like, truly like absurd thing about, about transformer language models is without any effort at all, we get this for free that you can go and just give them a couple of examples in their context and they can learn in their context to go and do new things.

Um, I think that was like, like that was in some sense, the like most striking thing about the GPT-3 paper. Um, and so, uh, this, this, yeah, this, this ability to go and have the just conditioning on a context going, give you, you know, new abilities for free and, and the ability to generalize to new things is, is in some sense the, the most, yeah.

And to me, the most striking and shocking thing about, about transformer language models. That makes sense. I mean, I guess from my perspective, I'm trying to square like the notion of learning in this case with, you know, if you were, I were given a prompt of like one plus one equals two, two plus three equals five as the sort of few shot setup.

And then somebody else put, you know, like five plus three equals, and we had to fill it out. In that case, I wouldn't say that we've learned arithmetic because we already sort of knew it, but rather we're just sort of conditioning on the prompt to know what it is that we should then generate.

Right. Uh, but, but it seems to me like that's yeah. I think that's on a spectrum though, because you can, you can also go and give like completely nonsensical problems where the model would never have seen, um, see it like mimic this function and give a couple of examples of the function and the models never seen it before.

And it can go and do that later in the context. Um, and I think, I think what you did learn, um, in a lot of these cases, so you might not have, you might have, um, you might not have learned arithmetic, like you might've had some innate faculty for arithmetic that you're using, but you might've learned, Oh, okay.

Right now we're, we're doing arithmetic problems. Um, in any case, this is, I agree that there's like an element of semantics here. Um, yeah, no, this is helpful though, just to clarify exactly sort of what the, what you remember in context learning. Thank you for walking through it. Of course.

So something that's, I think, really striking about all of this, um, is well, okay. So we, we've talked about how we can, we can sort of look at the learning curve and we can also look at this, this in-context learning curve, but really those are just two slices of a two-dimensional, uh, space.

So like the, the, in some sense, the more fundamental thing is how good are we at producing the nth token at a different given point in training and something that you'll notice if you, if you look at this. Um, so when we were, when we talk about the loss curve, we're, we're just talking about, if you average over this dimension, if you, if you like average like this and, and project onto the, the training step, that's, that's your loss curve.

Um, and, uh, if you, the thing that we are calling the in-context learning curve is just this line. Um, uh, yeah, this, this line, uh, down the, the end here. Um, and something that's, that's kind of striking is there's, there's this discontinuity in it. Um, like there's this point where, where, you know, the model seems to get radically better in a very, very short timestamp span and going in predicting late tokens.

So it doesn't, it's not that different in early time steps, but in late time steps, suddenly you get better. And a way that you can make this more striking is you can, you can take the difference in, in your ability to predict the 50th token and your ability to predict the 500th token.

You can subtract from the, the, the 500th token, the 50th token loss. And what you see, um, is that over the course of training, you know, you're, you're, you're not very good at this and you got a little bit better. And then suddenly you have this cliff and then you never get better.

The difference between these at least never gets better. So the model gets better at predicting things, but it's ability to go and predict late tokens over early tokens never gets better. And so there's in the span of just a few hundred steps in training, the model has gotten radically better at its ability to go and, and do this kind of in-context learning.

And so you might ask, you know, what's going on at that point? Um, and this is just one model, but, um, well, so first of all, it's worth noting, this isn't a small, a small change. Um, so, um, the, you can, we don't think about this very often, but you know, often we just look at law schools and we're like, did the model do better than another model or worse than another model.

But, um, you can, you can think about this as in terms of Nats and that are, are, you know, it's just the information theoretic quantity in that. Um, and you can convert that into, to bits. And so like one, one way you can interpret this as it's, it's something roughly like, you know, the model 0.4 Nats is about 0.5 bits is about, uh, like every other token, the model gets to go and sample twice, um, and pick the better one.

It's actually, it's even stronger than that, but that's a sort of an underestimate of how big a deal going and getting better by 0.4 Nats. So this is like a real big difference in the models ability to go and predict late tokens. Um, and we can visualize this in different ways.

We can, we can also go and ask, you know, how much better are we getting at going and predicting later tokens and look at the derivative. And then we, we can see very clearly that there's, there's some kind of discontinuity in that derivative at this point. And we can take the second derivative then, and we can, um, with, well, derivative with respect to training.

And now we see that there's like, there's very, very clearly this, this, this line here. So something in just the span of a few steps, a few hundred steps is, is causing some big change. Um, we have some kind of phase change going on. Um, and this is true across model sizes.

Um, uh, you can, you can actually see it a little bit in the loss curve and there's this little bump here, and that corresponds to the point where you have this, you have this change. We, we actually could have seen in the loss curve earlier too. It's, it's this bump here, excuse me.

So, so we have this phase change going on and there's a, I think a really tempting theory to have, which is that somehow, whatever, you know, there, there's some, this, this, this change in the model's output and its behaviors and it's in a, in a, in a, in a sort of outward facing properties corresponds presumably to some kind of change in the algorithms that are running inside the model.

So if we observe this big phase change, especially in a very small window, um, in, in the model's behavior, presumably there's some change in the circuits inside the model that is driving. At least that's a, you know, a natural hypothesis. So, um, if we want to ask that though, we need to go and be able to understand, you know, what are the algorithms that's running inside the model?

How can we turn the parameters in the model back into this algorithm? So that's going to be our goal. Um, now it's going to recover us, require us to cover a lot of ground, um, in a relatively short amount of time. So I'm going to go a little bit quickly through the next section and I will highlight sort of the, the key takeaways, and then I will be very happy, um, to go and, uh, you know, explore any of this in as much depth.

I'm free for another hour after this call. Um, and just happy to talk in as much depth as people want about the details of this. So, um, it turns out the space change doesn't happen in a one-layer attentionally transformer, and it does happen in a two-layer attentionally transformer. So if we could understand a one-layer attentionally transformer and a two-layer only attention, attentionally transformer, that might give us a pretty big clue as to what's going on.

Um, so we're attention only. We're also going to leave out layer norm and biases to simplify things. So, you know, you, one way you could describe a attention only transformer is we're going to embed our tokens and then we're going to apply a bunch of attention heads and add them into the residual stream and then apply our unembedding and that'll give us our logits.

And we could go and write that out as equations if we want, multiply it by an embedding matrix, apply attention heads, and then compute the logits for the unembedding. Um, and the part here that's a little tricky is, is understanding the attention heads. And this might be a somewhat conventional way of describing attention heads.

And it actually kind of obscures a lot of the structure of attention heads. Um, I think that oftentimes it's, we, we make attention heads more, more complex than they are. We sort of hide the interesting structure. So what is this saying? Well, it's saying, you know, for every token, compute a value vector, a value vector, and then go and mix the value vectors according to the attention matrix and then project them with the output matrix back into the residual stream.

Um, so there's, there's another notation which you could think of this as a, as using tensor products or using, um, using, uh, well, I guess there's a few, a few left and right multiplying. There's, there's a few ways you can interpret this, but, um, I'll, I'll just sort of try to explain what this notation means.

Um, so this means for every, you know, X, our, our residual stream, we have a vector for every single token, and this means go and multiply independently the vector for each token by WV. So compute the value vector for every token. This one on the other hand means notice that it's now on the, A is on the left-hand side.

It means go and go and multiply the, uh, attention matrix or go and go into linear combinations of values, value vectors. So don't, don't change the value vectors, you know, point-wise, but go and mix them together according to the attention pattern, create a weighted sum. And then again, independently for every position, go and apply the output matrix.

And you can apply the distributive property to this, and it just reveals that actually didn't matter that you did the attention sort of in the middle. You could have done the attention at the beginning, you could have done it at the end. Um, that's, that's independent. Um, and the thing that actually matters is there's this WVWO matrix that describes what it's really saying is, you know, WVWO describes what information the attention head reads from each position and how it writes it to its destination.

Whereas A describes which tokens we read from and write to. Um, and that's, that's kind of getting more of the fundamental structure and attention. And attention head goes and moves information from one position to another and the process of, of which position gets moved from and to is independent from what information gets moved.

And if you rewrite your transformer that way, um, well, first we can go and write, uh, the sum of attention heads just as, as in this form. Um, and then we can, uh, go in and write that as the, the entire layer by going and adding in an identity.

And if we go and plug that all in to our transformer and go and expand, um, we, we have to go in and multiply everything through. We get this interesting equation. And so we get this one term, this corresponds to just the path directly through the residual stream. And it's going to want to store, uh, bigram statistics.

It's just, you know, all it gets is the previous token and tries to predict the next token. And so it gets to try and predict, uh, try to store bigram statistics. And then for every attention head, we get this matrix that says, okay, well, for, we have the attention pattern.

So it looks, that describes which token looks at which token. And we have this matrix here, which describes how for every possible token you could attend to, how it affects the logics. And that's just a table that you can look at. It just says, you know, for, for this attention head, if it looks at this token, it's going to increase the probability of these tokens in a one-layer attention only transformer.

That's all there is. Yeah, so this is just, just the interpretation I was describing. And another thing that's worth noting is this, the, according to this, the attention only transformer is linear if you fix the attention pattern. Now, of course it's, the attention pattern isn't fixed, but whenever you have the opportunity to go and make something linear, linear functions are really easy to understand.

And so if you can fix a small number of things and make something linear, that's actually, it's a lot of leverage. Okay. And yeah, we can talk about how the attention pattern is computed as well. Um, you, if you expand it out, you'll get an equation like this and, uh, notice, well, I think, I think it'll be easier.

Okay. The, I think the core story there to take away from all of these is we have these two matrices that actually look kind of similar. So this one here tells you if you attend to a token, how are the logics affected? And it's, you can just think of it as a giant matrix of, for every possible token input token, how, how is the logic, how are the logics affected by that token?

Are they made more likely or less likely? And we have this one, which sort of says, how much does every token want to attend to every other token? Um, one way that you can, you can picture this is, uh, okay, that's really, there's really three tokens involved when, when we're thinking about an attention head, we have the token that we're going to move information to, and that's attending backwards.

We have the source token that's going to get attended to, and we have the output token whose logics are going to be affected. And you can just trace through this. So you can ask what happens. Um, how, how does the, the attending to this token affect the output? Well, first we embed the token.

Then we multiply by WV to get the value vector. The information gets moved by the attention pattern. We multiply by W O to add it back into the residual stream, but yet hit by the unembedding and we affect the logics. And that's where that one matrix comes from. And we can also ask, you know, what decides, you know, whether a token gets a high score when we're, when we're computing the attention pattern.

And it just says, you know, embed, embed the token, turn it into a query, embed the other token, turn it into a key and dot product them and see you that's where those, those two matrices come from. So I know that I'm going quite quickly. Um, maybe I'll just briefly pause here.

And if anyone wants to ask for clarifications, um, this would be a good time. And then we'll actually go in reverse engineer and say, you know, everything that's going on in a one-layer attentionally transformer is now in the palm of our hands. It's a very toy model. No one actually uses one-layer attentionally transformers, but we'll be able to understand the one-layer attentionally transformer.

So just to be clear, so you're saying that the quite key circuit is learning the attention baits and like essentially is responsible for running the sort of like the attention between different tokens. Yeah. Yeah. So, so this, this matrix, when it, yeah. And, you know, all, all three of those parts are, are learned, but that's, that's what expresses whether a attention pattern.

Yeah. That's what generates the attention patterns gets run for every pair of tokens. And you can, you can, you can think of values in that matrix as just being how much every token wants to attend to every other token. If it was in the context, um, we're, we're doing positional learnings here.

So there's a little bit that we're sort of aligning over there as well, but sort of in, in a global sense, how much does every token want to attend to every other token? Right. Yeah. And the other circuit, like the optimal circuit is using the attention that's calculated to yes.

Um, like affect the final outputs. It's sort of saying, if, if the attention had assumed that the attention had attends to some token. So let's set aside the question of how that gets computed. Just assume that it tends to some token, how would it affect the outputs if it attended to that token?

And you can just, you can just calculate that. Um, it's just a, a big table of values that says, you know, for this token, it's going to make this token more likely, this token will make this token less likely. Okay. Okay. And it's completely independent. Like it's just two separate matrices.

They're, they're not, you know, the, the formulas that might make them seem entangled, but they're actually separate. Right. To me, it seems like the, um, like just supervision is coming from the output value circuit and the query key circuit seems to be more like unsupervised kind of thing. Cause there's no.

Hmm. I mean, there are just, I think in the sense that every, in, in, in a model, like every, every neuron is in some sense, you know, like a signal is, is, is somehow downstream from the ultimate, the ultimate signal. And so, you know, the output value signal, the output value circuit is getting more, more direct is perhaps getting more direct signal.

Correct. Um, but yeah. We will be able to dig into this in lots of detail in as much detail as you want, uh, in a little bit. So we can, um, maybe I'll push forward. And I think also actually an example of how to use this to reverse engineer a one layer model will maybe make it a little bit more, more motivated.

Okay. So, um, just, just to emphasize this, there's three different tokens that we can talk about. There's a token that gets attended to there's the token that does the attention to call it the destination. And then there's the token that gets affected, get it, get the next token, which it's probabilities are affected.

Um, and so something we can do is notice that the, the only token that connects to both of these is the token that gets attended to. So these two are sort of, they're, they're bridged by their, their interaction with the source token. So something that's kind of natural is to ask for a given source token, you know, how does it interact with both of these?

So let's, let's take, for instance, the token perfect, which tokens, one thing we can ask is which tokens want to attend to perfect. Well, apparently the tokens that most want to attend to perfect are, are, and looks and is, and provides. Um, so R is the most looks as the next most and so on.

And then when we attend to perfect, and this is with one, one single attention, and so, you know, it'd be different if we did a different attention, attention, and it wants to really increase the probability of perfect. And then to a lesser extent, super and absolute and cure. And we can ask, you know, what, what sequences of tokens are made more likely by this, this particular, um, set of, you know, this, this particular set of things wanting to attend to each other and, and becoming more likely.

Well, things are the form we have our, our token that we attended back to, and we have some, some skip of some number of tokens. They don't have to be adjacent, but then later on, we see the token R and it tends back to perfect and increases the probability of perfect.

So you can, you can think of these as being like, we're, we're sort of creating, changing the probability of what we might call, might call skip trigrams, where we have, you know, we skip over a bunch of tokens in the middle, but we're, we're affecting the probability really of, of trigrams.

So perfect are perfect, perfect, look super. Um, we can look at another one. So we have the token large, um, these tokens contains using specify, want to go and look back to it and an increase of probability of large and small. And the skip trigrams that are affected are things like large using large, large contains small, um, and things like this.

Um, if we see the number two, um, we increase the probability of other numbers and we affect probably tokens or skip diagrams, like two, one, two, two has three. Um, now you're, you're all in, uh, in a technical field. So you'll probably recognize this one. We have, uh, have Lambda and then we see backslash and then we want to increase the probability of Lambda and sorted and Lambda and operator.

So it's all, it's all latex. Um, it wants to, um, it's, if it sees Lambda, it thinks that, you know, maybe next time I use a backslash, I should go and put in some latex, uh, math symbol. Um, also same thing for HTML. We see NBSP for non-breaking space.

And then we see an ampersand. We want to go and make that more likely. The takeaway from all of this is that a one-layer attention only transformer is totally acting on these skip trigrams. Um, every, everything that it does, I mean, I guess it also has this pathway by which it affects by grams, but mostly it's just affecting these skip trigrams.

Um, and there's lots of them. It's just like these giant tables of skip trigrams that are made more or less likely. Um, there's lots of other fun things that does. Sometimes the tokenization will split up a word in multiple ways. So, um, like we have indie, uh, well, that's, that's not a good example.

We have like the word pike, and then we, we see the token P and then we predict Ike, um, and we predict spikes and stuff like that. Um, or, uh, these, these ones are kind of fun. Maybe they're actually worth talking about for a second. So we see the token Lloyd, and then we see an L and maybe we predict Lloyd, um, or R and we predict Ralph, um, C Catherine.

Um, but we'll see in a second that, well, yeah, we'll, we'll come back to that in a sec. So we, we increased the probability of things like Lloyd, Lloyd, and Lloyd Catherine or picks map. Um, if anyone's worked with QT, um, it's, we see picks map and we increased the probability of, um, P X map again, but also Q canvas.

Um, yeah, but of course there's a problem with this, which is, um, it doesn't get to pick which one of these goes with which one. So if you want to go and make picks map, picks map, and picks map Q canvas more probable, you also have to go and create, make picks map, picks map, P canvas more probable.

And if you want to make Lloyd, Lloyd, and Lloyd Catherine more probable, you also have to make Lloyd, Cloyd, and Lloyd Lathren more probable. And so there's actually like bugs that transformers have, like weird, at least in, you know, in these, these really tiny one-layer attention only transformers there, there's these bugs that, you know, they seem weird until you realize that it's this giant table of skip trigrams that's, that's operating.

Um, and the, the nature of that is that you're going to be, um, uh, yeah, you, you, it sort of forces you, if you want to go and do this to go in and also make some weird predictions. Is there a reason why the source tokens here have a space before the first character?

Yes. Um, that's just the, I was giving examples where the tokenization breaks in a particular way and because spaces get included in the tokenization, um, when there's a space in front of something and then there's an example where the space isn't in front of it, they can get tokenized in different ways.

Got it. Cool. Thanks. Yeah. Great question. Um, okay. So some, just to abstract away some common patterns that we're seeing, I think, um, one pretty common thing is what you might describe as like B AB. So you're, you go and you, you see some token and then you'll see another token that might proceed that token.

And then you're like, ah, probably the token that I saw earlier is going to occur again. Um, or sometimes you, you predict a slightly different token. So like me, maybe an example of the first one is two, one, two, but you could also do two has three. And so three, isn't the same as two, but it's kind of similar.

So that's, that's one thing. Another one is this, this example where you have a token that something that's tokenized together one time and that's split apart. So you see the token and then you see something that might be the first part of the token and then you predict the second part.

Um, I think the thing that's really striking about this is these are all in some ways are really crude, kind of in context learning. Um, and in particular, these models get about 0.1 Nats rather than about 0.4 Nats up in context learning, and they never go through the phase change.

So they're doing some kind of really crude in context learning. And also they're dedicating almost all their attention heads to this kind of crude in context learning. So they're not very good at it, but they're, they're, they're dedicating their, um, their capacity to it. Uh, I'm noticing that it's 1037.

Um, I, I want to just check how long I can go. Cause I, maybe I should like super accelerate. Of course. Uh, I think it's fine because like students are also asking questions in between. So you should be good. Okay. So maybe my plan will be, but I'll talk until like 1055 or 11.

And then if you want, I can go and answer questions for a while after, after that. Yeah, it works. Fantastic. So you can see this as a very crude, kind of in context learning. Like basically what we're saying is it's sort of all of this labor of, okay, well, I saw this token, probably these other tokens, the same token or similar tokens are more likely to go and occur later.

And look, this is an opportunity that sort of looks like I could inject the token that I saw earlier. I'm going to inject it here and say that it's more likely that's like, that's basically what it's doing. And it's dedicating almost all of its capacity to that. So, you know, these, it's sort of the opposite of what we thought with RNNs in the past, like used to be that everyone was like, oh, you know, RNNs it's so hard to care about long distance contacts.

You know, maybe we need to go and like use dams or something. No, if you, if you train a transformer, it dedicates and you give it a long, a long enough context, it's dedicating almost all of its capacity to this type of stuff. Just kind of interesting. There are some attentionants which are more primarily positional.

Usually we, you know, the model that I've been training that has two layer or it's only a one layer model has 12 attentionants and usually around two or three of those will become these more positional sort of shorter term things that do something more like, like local trigram statistics and then everything else becomes these skip trigrams.

Yeah, so some takeaways from this. Yeah, you can, you can understand one layer attentionally transformers in terms of these OV and QK circuits. Transformers desperately want to do in context learning. They desperately, desperately, desperately want to go and look at these long distance contacts and go and predict things.

That's just so much, so much entropy that they can go and reduce out of that. The constraints of a one layer attentionally transformer force it to make certain bugs if it wants to do the right thing. And if you freeze the attention patterns, these models are linear. Okay. A quick aside, because so far this type of work has required us to do a lot of very manual inspection.

Like we're looking through these giant matrices, but there's a way that we can escape that. We don't have to use, look at these giant matrices if we don't want to. We can use eigen values and eigen vectors. So recall that an eigen value and an eigen vector just means that if you, if you multiply that vector by the matrix, it's equivalent to just scaling.

And often in my experience, this hasn't been very useful for interpretability because we're, we're usually mapping between different spaces. But if you're mapping onto the same space, eigen values and eigen vectors are a beautiful way to think about this. So we're going to draw them on a, a radial plot.

And we're going to have a log radial scale because they're going to vary, their magnitude's going to vary by many orders of magnitude. Okay. So we can just go and, you know, our, our OB circuit maps from tokens to tokens. That's the same vector space on the input and the output.

And we can ask, you know, what does it mean if we see eigen values of a particular kind? Well, positive eigen values, and this is really the most important part, mean copying. So if you have a positive eigen value, it means that there's some set of, of tokens where if you, if you see them, you increase their probability.

And if you have a lot of positive eigen values, you're doing a lot of copying. If you only have positive eigen values, everything you do is copying. Now, imaginary eigen values mean that you see a token and then you want to go and increase the probability of unrelated tokens.

And finally, negative eigen values are anti-copying. They're like, if you see this token, you make it less probable in the future. Well, that's really nice because now we don't have to go and dig through these giant matrices that are vocab size by vocab size. We can just look at the eigen values.

And so these are the eigen values for our one-layer attentionally transformer. And we can see that, you know, for many of these, they're almost entirely positive. These ones are, are sort of entirely positive. These ones are almost entirely positive. And then really these ones are even almost entirely positive.

And there's only two that have a significant number of imaginary and negative eigen values. And so what this is telling us is it's, it's just in one picture, we can see, you know, OK, they're really, you know, 10 out of 12 of these, of these attention heads are just doing copying.

They just, they just are doing this long distance, you know, well, I saw a token, probably it's going to occur again, type stuff. That's kind of cool. We can, we can summarize it really quickly. OK. Now, the other thing that you can, yeah, so this is, this is for a second, we're going to look at a two-layer model in a second.

And we'll, we'll see that also a lot of its heads are doing this kind of copying-ish stuff. They have large positive eigenvalues. You can do a histogram, like, you know, one, one thing that's cool is you can just add up the eigenvalues and divide them by their absolute values.

And you've got a number between zero and one, which is like how copying, how copying is just the head, or between negative one and one, how copying is just the head. And you can just do a histogram and you can see, oh yeah, almost all of the heads are doing, doing lots of copying.

You know, it's nice to be able to go and summarize your model in a, and I think this is sort of like we've gone for a very bottom-up way. And we didn't start with assumptions about what the model was doing. We tried to understand its structure. And then we were able to summarize it in useful ways.

And now we're able to go and say something about it. Now, another thing you might ask is what do the eigenvalues of a QK circuit mean? And in our example so far, they haven't been that, they wouldn't have been that interesting. But in a minute, they will be. And so I'll briefly describe what they mean.

A positive eigenvalue would mean you want to attend to the same tokens. An imaginary eigenvalue, and this is what you would mostly see in the models that we've seen so far, means you want to go in and attend to a unrelated or different token. And a negative eigenvalue would mean you want to avoid attending to the same token.

So that will be relevant in a second. Yeah, so those are going to mostly be useful to think about in multilayer attentional learning transformers when we can have chains of attention heads. And so we can ask, well, I'll get to that in a second. Yeah, so that's a table summarizing that.

Unfortunately, this approach completely breaks down once you have MLP layers. MLP layers, you know, now you have these non-linearity since you don't get this property where your model is mostly linear and you can just look at a matrix. But if you're working with only attentionally transformers, this is a very nice way to think about things.

OK, so recall that one-layer attentionally transformers don't undergo this phase change that we talked about in the beginning. Like right now, we're on a hunt. We're trying to go and answer this mystery of what the hell is going on in that phase change where models suddenly get good at in-context learning.

We want to answer that. And one-layer attentionally transformers don't undergo that phase change, but two-layer attentionally transformers do. So we'd like to know what's different about two-layer attentionally transformers. OK, well, so in our previous-- when we were dealing with one-layer attentionally transformers, we were able to go and rewrite them in this form.

And it gave us a lot of ability to go and understand the model because we could go and say, well, you know, this is bigrams. And then each one of these is looking somewhere. And we had this matrix that describes how it affects things. And yeah, so that gave us a lot of ability to think about these things.

And we can also just write in this factored form where we have the embedding, and then we have the attention heads, and then we have the unembedding. OK, well-- oh, and for simplicity, we often go and write W-O-V for W-O-W-V because they always come together. It's always the case.

It's, in some sense, an illusion that W-O and W-V are different matrices. They're just one low-rank matrix. They're never-- they're always used together. And similarly, W-Q and W-K, it's sort of an illusion that they're different matrices. They're always just used together. And keys and queries are just sort of-- they're just an artifact of these low-rank matrices.

So in any case, it's useful to go and write those together. OK, great. So a two-layer attentionally transformer, what we do is we go through the embedding matrix. Then we go through the layer 1 attention heads. Then we go through the layer 2 attention heads. And then we go through the unembedding.

And for the attention heads, we always have this identity as well, which corresponds just going down the residual stream. So we can go down the residual stream, or we can go through an attention head. Next up, we can also go down the residual stream, or we can go through an attention head.

And there's this useful identity, the mixed product identity that any tensor product or other ways of interpreting this obey, which is that if you have an attention head and we have, say, we have the weights and the attention pattern and the W-O-V matrix and the attention pattern, the attention patterns multiply together, and the O-V circuits multiply together, and they behave nicely.

OK, great. So we can just expand out that equation. We can just take that big product we had at the beginning, and we can just expand it out. And we get three different kinds of terms. So one thing we do is we get this path that just goes directly through the residual stream where we embed and un-embed, and that's going to want to represent some bigram statistics.

Then we get things that look like the attention head terms that we had previously. And finally, we get these terms that correspond to going through two attention heads. Now, it's worth noting that these terms are not actually the same as-- they're-- because the attention head, the attention patterns in the second layer can be computed from the outputs of the first layer, those are also going to be more expressive.

But at a high level, you can think of there as being these three different kinds of terms. And we sometimes call these terms virtual attention heads because they don't exist in the sense-- like, they aren't explicitly represented in the model, but they, in fact, they have an attention pattern.

They have no E-circuit. They're in almost all functional ways like a tiny little attention head, and there's exponentially many of them. Turns out they're not going to be that important in this model, but in other models, they can be important. Right, so one thing that I said is it allows us to think about attention heads in a really principled way.

We don't have to go and think about-- I think there's-- people look at attention patterns all the time, and I think a concern you have is, well, there's multiple attention patterns. The information that's being moved by one attention head, it might have been moved there by another attention head and not originated there.

It might still be moved somewhere else. But in fact, this gives us a way to avoid all those concerns and just think about things in a single principled way. OK, in any case, an important question to ask is, how important are these different terms? Like, we could study all of them.

How important are they? And it turns out you can just-- there's an algorithm you can use where you knock out attention-- knock out these terms, and you go and you ask, how important are they? And it turns out by far the most important thing is these individual attention head terms.

In this model, by far the most important thing, the virtual attention heads basically don't matter that much. They only have an effective 0.3 nats using to the above ones, and the bigrams are still pretty useful. So if we want to try to understand this model, we should probably go and focus our attention on-- the virtual attention heads are not going to be the best way to go and focus our attention, especially since there's a lot of them.

There's 124 of them for 0.3 nats. It's very little that you would understand for studying one of those terms. So the thing that we probably want to do-- we know that these are bigram statistics. So what we really want to do is we want to understand the individual attention head terms.

This is the algorithm. I'm going to skip over it for time. We can ignore that term because it's small. And it turns out also that the layer 2 attention heads are doing way more than layer 1 attention heads. And that's not that surprising. Layer 2 attention heads are more expressive because they can use the layer 1 attention heads to construct their attention patterns.

So if we could just go and understand the layer 2 attention heads, we'd probably understand a lot of what's going on in this model. And the trick is that the attention heads are now constructed from the previous layer rather than just from the tokens. So this is still the same, but the attention pattern is more complex.

And if you write it out, you get this complex equation that says, you embed the tokens, and you're going to shuffle things around using the attention heads for the keys. Then you multiply by WQK. Then you shuffle things around again for the queries. And then you go and multiply by the embedding again because they were embedded.

And then you get back to the tokens. But let's actually look at them. So one thing that's-- remember that when we see positive eigenvalues in the OB circuit, we're doing copying. So one thing we can say is, well, 7 out of 12-- and in fact, the ones with the largest eigenvalues are doing copying.

So we still have a lot of attention heads that are doing copying. And yeah, the QK circuit-- so one thing you could do is you could try to understand things in terms of this more complex QK equation. You could also just try to understand what the attention patterns are doing empirically.

So let's look at one of these copying ones. I've given it the first paragraph of Harry Potter, and we can just look at word attempts. And something really interesting happens. So almost all the time, we just attend back to the first token. We have this special token at the beginning of the sequence.

And we usually think of that as just being a null attention operation. It's a way for it to not do anything. In fact, if you look, the value vector is basically 0. It's just not copying any information from that. But whenever we see repeated text, something interesting happens. So when we get to "Mr."-- tries to look at "and." It's a little bit weak.

Then we get to "D," and it attends to "ers." That's interesting. And then we get to "ers," and it attends to "ly." And so it's not attending to the same token. It's attending to the same token, shifted one forward. Well, that's really interesting. And there's actually a lot of attention heads that are doing this.

So here we have one where now we hit the potter's pot, and we attend to "ters." Maybe that's the same attention head I don't remember when I was constructing this example. It turns out this is a super common thing. So you go and you look for the previous example, you shift one forward, and you're like, OK, well, last time I saw this, this is what happened.

Probably the same thing is going to happen. And we can go and look at the effect that the attention head has on the logits. Most of the time, it's not affecting things. But in these cases, it's able to go and predict when it's doing this thing of going and looking one forward.

It's able to go and predict the next token. So we call this an induction head. An induction head looks for the previous copy, looks forward, and says, ah, probably the same thing that happened last time is going to happen. You can think of this as being a nearest neighbors.

It's like an in-context nearest neighbors algorithm. It's going and searching through your context, finding similar things, and then predicting that's what's going to happen next. The way that these actually work is-- I mean, there's actually two ways. But in a model that uses rotary attention or something like this, you only have one.

You shift your key. First, an earlier attention head shifts your key forward one. So you take the value of the previous token, and you embed it in your present token. And then you have your query in your key, go and look at-- yeah, try to go and match. So you look for the same thing.

And then you go and you predict that whatever you saw is going to be the next token. So that's the high-level algorithm. Sometimes you can do clever things where actually it'll care about multiple earlier tokens, and it'll look for short phrases and so on. So induction heads can really vary in how much of the previous context they care about or what aspects of the previous context they care about.

But this general trick of looking for the same thing, shift forward, predict that, is what induction heads will do. Lots of examples of this. And the cool thing is you can now you can use the QK eigenvalues to characterize this. You can say, well, we're looking for the same thing, shifted by one, but looking for the same thing.

If you expand through the attention nodes in the right way, that'll work out. And we're copying. And so an induction head is one which has both positive OV eigenvalues and also positive QK eigenvalues. And so you can just put that on a plot, and you have your induction heads in the corner.

So your OV eigenvalues, your QK eigenvalues, and I think actually OV is this axis, QK is this one axis, doesn't matter. And in the corner, you have your eigenvalues or your induction heads. And so this seems to be-- well, OK, we now have an actual hypothesis. The hypothesis is the way that that phase change we were seeing, the phase change is the discovery of these induction heads.

That would be the hypothesis. And these are way more effective than this first algorithm we had, which was just blindly copy things wherever it could be plausible. Now we can go and actually recognize patterns and look at what happened and predict that similar things are going to happen again.

That's a way better algorithm. Yeah, so there's other attention heads that are doing more local things. I'm going to go and skip over that and return to our mystery, because I am running out of time. I have five more minutes. OK, so what is going on with this in-context learning?

Well, now we have a hypothesis. Let's check it. So we think it might be induction heads. And there's a few reasons we believe this. So one thing is going to be that induction heads-- well, OK, I'll just go over to the end. So one thing you can do is you can just ablate the attention heads.

And it turns out you can color-- here we have attention heads colored by how much they are an induction head. And this is the start of the bump. This is the end of the bump here. And we can see that they-- first of all, induction heads are forming. Like previously, we didn't have induction heads here.

Now they're just starting to form here. And then we have really intense induction heads here and here. And the attention heads, where if you ablate them, you get a loss. And so we're looking not at the loss, but at this meta learning score, the difference between-- or in-context learning score, the difference between the 500th token and the 50th token.

And that's all explained by induction heads. Now, we actually have one induction head that doesn't contribute to it. Actually, it does the opposite. So that's kind of interesting. Maybe it's doing something shorter distance. And there's also this interesting thing where they all rush to be induction heads. And then they discover only a few went out in the end.

So there's some interesting dynamics going on there. But it really seems like in these small models, all of in-context learning is explained by these induction heads. OK. What about large models? Well, in large models, it's going to be harder to go and ask this. But one thing you can do is you can ask, OK, we can look at our in-context learning score over time.

We get this sharp phase change. Oh, look. Induction heads form at exactly the same point in time. So that's only correlational evidence. But it's pretty suggestive correlational evidence, especially given that we have an obvious-- the obvious effect that induction heads should have is this. I guess it could be that there's other mechanisms being discovered at the same time in large models.

But it has to be in a very small window. So I really suggest the thing that's driving that change is in-context learning. OK. So obviously, induction heads can go and copy text. But a question you might ask is, can they do translation? There's all these amazing things that models can do that it's not obvious in-context learning or this sort of copying mechanism could do.

So I just want to very quickly look at a few fun examples. So here we have an attention pattern. Oh, yeah. I guess I need to open Lexiscope. Let me try doing that again. Sorry. I should have thought this through a bit more before this talk. Chris, could you zoom in a little, please?

Yeah, yeah. Thank you. OK. I'm not-- my French isn't that great. But my name is Christopher. I'm from Canada. What we can do here is we can look at where this attention head attends as we go and we do this. And it'll become especially clear on the second sentence.

So here, we're on the period, and we attend to "je." Now we're on-- and "je" is "I" in French. OK. Now we're on the "I," and we attend to "sui." Now we're on the "am," and we attend to "do," which is "from," and then "from" to "Canada." And so we're doing a cross-lingual induction head, which we can use for translation.

And indeed, if you look at examples, this is where it seems to-- it seems to be a major driving force in the model's ability to go and correctly do translation. Another fun example is-- I think maybe the most impressive thing about in-context learning to me has been the model's ability to go and learn arbitrary functions.

Like, you can just show the model a function. It can start mimicking that function. Well, OK. I have a question. Yes. So do these induction heads only do kind of a look-ahead copy? Or can they also do some sort of complex structure recognition? Yeah, yeah. So they can both use a larger context-- previous context-- and they can copy more abstract things.

So the translation one is showing you that they can copy, rather than the literal token, a translated version. So it's what we might call soft induction head. And yeah, you can have them copy similar words. You can have them look at longer contexts. You can look for more structural things.

The way that we usually characterize them is whether-- in large models, just whether they empirically behave like an induction head. So the definition gets a little bit blurry when you try to encompass these more-- there's sort of a blurry boundary. But yeah, there seem to be a lot of attention heads that are doing sort of more and more abstract versions.

And yeah, my favorite version is this one that I'm about to show you, which is used-- let's isolate a single one of these-- which can do pattern recognition. So it can learn functions in the context and learn how to do it. So I've just made up a nonsense function here.

We're going to encode one binary variable with the choice of whether to do a color or a month as the first word. Then we're going to-- so we have green or June here. Let's zoom in more. So we have color or month, and animal or fruit. And then we have to map it to either true or false.

So that's our goal. And it's going to be an XOR. So we have the binary variable represented in this way. We do an XOR. I'm pretty confident this was never in the training set, because I just made it up, and it seems like a nonsense problem. OK, so then we can go and ask, can the model go and predict that?

Well, it can, and it uses induction heads to do it. And what we can do is we can look at the-- so we look at a colon where it's going to go and try and predict the next word. And for instance here, we have April dog. So it's a month and then an animal, and it should be true.

And what it does is it looks for a previous-- previous cases where there was an animal-- a month and then an animal, especially one where the month was the same-- and goes and looks and says that it's true. And so the model can go and learn-- learn a function, a completely arbitrary function, by going and doing this kind of pattern recognition induction head.

And so this, to me, made it a lot more plausible that these models actually can do-- can do in-context learning. Like, the generality of all these amazing things we see these large language models do can be explained by induction heads. We don't know that. It could be that there's other things going on.

It's very possible that there's lots of other things going on. But it seems a lot more plausible to me than it did when we started. I'm conscious that I am actually, over time-- I'm going to just quickly go through these last few slides. So I think thinking of this as an in-context nearest neighbors, I think, is a really useful way to think about this.

Other things could absolutely be contributing. This might explain why transformers do in-context learning over long-context better than LSTMs. And LSTM can't do this, because it's not linear in the amount of compute it needs. It's, like, quadratic or n log n if it was really clever. So transformers-- or LSTM's impossible to do this.

Transformers do do this. And actually, they diverge at the same point. But if you look-- well, I can go into this in more detail after, if you want. There's a really nice paper by Marcus Hutter trying to predict and explain why we observe scaling laws in models. It's worth noting that the arguments in this paper go exactly through to this example, this theory.

In fact, they work better for the case of thinking about this in-context learning with, essentially, a nearest neighbors algorithm than they do in the regular case. So yeah, I'm happy to answer questions. I can go into as much detail as people want about any of this. And I can also, if you send me an email, send me more information about all this.

And yeah, again, this work is not yet published. You don't have to keep it secret. But just if you could be thoughtful about the fact that it's unpublished work and probably is a month or two away from coming out, I'd be really grateful for that. Thank you so much for your time.

Yeah, thanks a lot, Chris. This was a great talk. So I'll just open it to some general questions. And then we can do a round of questions from the students. So I was very excited to know, so what is the line of work that you're currently working on? Is it extending this?

So what do you think is the next things you try to do to make it more interpretable? What are the next? Yeah. I mean, I want to just reverse engineer language models. I want to figure out the entirety of what's going on in these language models. And one thing that we totally don't understand is MLP layers.

We understand some things about them, but we don't really understand MLP layers very well. There's a lot of stuff going on in large models that we don't understand. I want to know how models do arithmetic. I want to know-- another thing that I'm very interested in is what's going on when you have multiple speakers.

The model can clearly represent-- it has a basic theory of mind, multiple speakers in a dialogue. I want to understand what's going on with that. But honestly, there's just so much we don't understand. It's sort of hard to answer the question because there's just so much to figure out.

And we have a lot of different threads of research in doing this. But yeah, the interpretability team at Anthropic is just sort of-- has a bunch of threads trying to go and figure out what's going on inside these models. And sort of a similar flavor to this of just trying to figure out, how do the parameters actually encode algorithms?

And can we reverse engineer those into meaningful computer programs that we can understand? Got it. Another question I had is, so you were talking about how the transformers are trying to do meta-learning inherently. So it's like-- and you spent a lot of time talking about the induction heads, and that was very interesting.

But can you formalize the sort of meta-learning algorithm they might be learning? Is it possible to say, oh, maybe this is a sort of internal algorithm that's going that's making them good meta-learners or something like that? I don't know. Yeah, I mean, I think that there's roughly two algorithms.

One is this algorithm we saw in the one-layer model. And we see it in other models, too, especially early on, which is just try to copy-- you saw a word, probably a similar word is going to happen later. Look for places that it might fit in and increase the probability.

So that's one thing that we see. And the other thing we see is induction heads, which you can just summarize as in-context nearest neighbors, basically. And it seems-- possibly there's other things, but it seems like those two algorithms and the specific instantiations that we are looking at seem to be what's driving in-context learning.

That would be my present theory. Yeah, it sounds very interesting. Yeah. OK, so let's open-- make a round of questions. So yeah, feel free to go ahead for questions. Thank you.