Stanford CS25: V5 I On the Biology of a Large Language Model, Josh Batson of Anthropic

So today, it's my pleasure to welcome Joshua Batson from Anthropic. So he'll be talking about on the biology of a large language model, which should be a very interesting talk. So Josh leads the circuits effort of the Anthropic Mechanistic Interpretability Team. Before Anthropic, he worked on biogenomics and computational microscopy at the Chan Zuckerberg Biohub.

And his academic training is in pure mathematics, which is very cool and impressive. And just another thing, which is some more recordings for this quarter have released, like Karina and Divs Talks. So feel free to check those out on our YouTube playlist. And for the folks on Zoom, feel free to ask questions either on the Zoom or the Slido with the code CS25.

And without further ado, I'll hand it off to Josh. Thank you. Clap. Come on. We're getting started. Okay. It's a pleasure to be here. It's crazy to me that there is a class on Transformers now, which I think, you know, were invented rather recently. So we've got, like, an hour for the class, I guess, and then, like, 15, 20 minutes for questions.

Feel free to just, like, interrupt me with questions. Like, as he said, like, my training's in pure mathematics and, like, people just interrupt each other all the time very rudely. And it's totally fine. And so if I want to just cut you off and move on to something, I will.

But, you know, this can kind of be as interactive as you guys would like it to be. And that's true for the people on Zoom, too. Okay. So this talk is titled On the Biology of a Large Language Model, which is also the title of a paper, which is what we call our, like, 100-page interactive blog post that went out a few weeks ago.

And if you're here, you probably know something about large language models. The biology word is, like, you know, was our sort of choice here. And maybe you should contrast it to, like, you know, a series of papers called On the Physics of Large Language Models, where you're sort of thinking of them as, like, dynamical systems over the course of training.

But we sort of think of, you know, interpretability being in relationship to neural networks as, which are trained by gradient descent, right, as biology is to living systems that are developed through evolution. So you have some process that gives rise to complexity, and you can just study the objects that are produced for, like, how they do the kind of miraculous things that they do.

So models do lots of cool things. I don't need to tell you guys this. This is an example. Oh, I don't know when it's from now. Maybe six months ago, which is, like, 10 years in AI time. But I quite enjoyed this. This is just somebody who was working on natural language processing for Circassian, which is, like, a very low-resource language.

Not many people speak it. There's not many documents. And he'd been sort of tracking the state of the art in NLP for many years, like, you know, using the best models at the time to try to, like, just help translate into this language, translate from this language, you know, help preserve it.

And tried it with a version of Claude, where I think this was probably Sonnet 3.5, where he just shoved this master list of Russian Circadian translations into the context window. So just, like, painstakingly gathered over years. Rather than train a model, just put it in the context window and then just ask the model to do, like, other translations.

And it could not only translate it successfully, but also kind of, like, break down the grammar, right? And so just in-context learning with these models was sort of enough to beat the state of the art for the NLP-specific models that he'd been working on for a while. So, like, that's cool, would be my summary of that.

But models are also weird. So this was also Claude. And someone asked it what day is tomorrow on Leap Day. And it just, like, got in a big fight, you know. If today is February 29th, 2024, then tomorrow would be March 1st. However, 2024 is not a leap year, which is untrue.

So February 29th is not a valid date in the Gregorian calendar. Okay, now it starts looking at the rules. Following this year, 2000 will be a leap year. 2100 will not be, like, fine. And then it says the next leap year after 2024 will be 2028, which is true.

And then so if we assume you meant February 28th, 2024, the last valid date in February, then it gives a... It's just, like, what is going on, right? There's, like, some smorgasbord of, like, correct recollection of facts, correct reasoning from the facts, and then, like, disregarding them out of consistency with this, like, initial, it's just pretty weird for it to be Leap Day.

So, like, that's odd, right? I mean, if a person were doing this, you would wonder, like, what was in the brownies they had consumed. There's just this, like... Now, children are sort of like this, so maybe that's an interesting topic. And then I just love this. AI art will make designers obsolete.

AI accepting the job. And it's got so many fingers. This is, like, now out of date, right? Like, people have figured out how to keep the finger count down to at most five per hand. You know, the new Chachi Biki model, right, can do, like, extremely realistic people, and they all have five fingers now.

But, like, that wasn't exactly solved by figuring out, like, why are there so many fingers on these? You know, just sort of, like, other methods got through it. But, like, presumably, you kind of bat down some of this weirdness, and now it's just the weirdness is more sophisticated. So, as the models get better, you need to get better at understanding just where, like, the craziness has sort of gone.

So, as the frontier moves forward, like, maybe it's not five fingers, but there might be subtly other things that are wrong. And when I think about interpretability, which is sort of, like, what did the model learn exactly? How is it represented inside the model? How does it manifest in behaviors?

I am sort of thinking ahead to when, you know, most of the simple interactions seem to go well. And then it's, like, are these going well because fundamentally the model learned something, like, deep and true? Or because you've managed to kind of beat this down, like, the finger problem.

But, like, if you went out to the edge of the model's capabilities, it would all be seven fingers again. But you can't tell. And also, because it seems pretty reliable, you've delegated a lot of decision-making and trust to these models. And it's in the corner you can no longer verify where things get weird.

So, for that reason, we kind of want to understand, you know, what's going on with these capabilities. All right. So, this will be a bit of a review at the beginning of, like, why models are hard to understand. Some strategies for picking them apart. And then three main lessons I think we've learned about how models work inside that, like, aren't obvious from a black box way of engaging with them.

So, here are sort of three statements that are, like, somewhere between myths or things that are out of date or, like, true if you interpret them in one way philosophically but missing the point. And so, these statements are that models just pattern match to similar training data examples, that they use only, like, shallow and simple heuristics and reasoning, and that they just kind of work one word at a time.

You know, just kind of, like, gutting out the next thing and some, like, eruption of spontaneity. And I think, you know, we find that models learn and can compose pretty abstract representations inside. That they perform rather complex and often heavily parallel computations is not serial, so they're doing a bunch of things at once at the same time.

And also, that they plan many tokens into the future. So, even though they say one word at a time, they're often thinking ahead quite far to be able to make something coherent, which will work for that. Okay. This is probably a review for this class, but we've made these nice slides, and I think it's nice to go through anyway.

So, you have a chat bot, which is, hello, how can I assist you is an answer. How does this actually happen, right? You know, it is saying one word at a time by just predicting the next word. So, how goes through Claude and predicts can, and how can goes to I, and how can I goes to assist, and how can I assist goes to you.

So, you can reduce the problem to the computation that gives you the next word, and that is a neural network here. I've drawn, like, a fully connected network. To turn language into language, passing through numbers, you have to turn things into vectors first. So, there's an embedding, so every word or token in the vocabulary has an embedding, which is a list of numbers or a vector.

Morally speaking, you basically just concatenate those together and run them through, you know, a massive, you know, neural network with a lot of weights. Out comes a score for every word in the vocabulary, and the model says the highest scoring word, modulosum, temperature, to introduce randomness. Transformer architectures are more complex.

There's these residual connections, alternating attention, and MLP blocks, which you could think of as, like, baking in really strong priors into a massive MLP. But, in some sense, that's just about efficiency. So, a metaphor we found used, which is also biological, is that language models should be thought of as grown, not built.

Like, you kind of start with this randomly initialized thing, an architecture, which is like a scaffold. You, like, give it some data. Maybe that's, like, a nutrients, and then the loss is, like, the sun, and it grows towards that. And so, you get this kind of organic thing, which has been made by the end of it.

And, you know, but the way that it grew, you don't really have any access to. The scaffold you have access to. But that's like looking at a model at a knit, which tends not to be that interesting. Okay. So, of course, like, we have the models. And so, there is a tautological answer to what are they doing, which I already told you, which is they turn the words into numbers.

They do a bunch of mat moles. They apply, like, simple functions. It's just math all the way through. And then you get something out. And it's like, that's it. That's what the model does. And I think that's an unsatisfying answer because you can't reason about it, right? Like, that answer to how do models work, like, doesn't tell you about what behaviors they should or shouldn't be able to do or kind of any of those things.

So, the first thing you might hope is that the neurons inside the neural network might have interpretable roles. You know, people were hoping it's going back to the first networks in the 80s. And there was a bit of a resurgence in deep learning, a kind of small one. Chris Ola, who leads the team at Anthropic, got really into this 10 years ago.

You just, like, look at a neuron and you ask, when does this fire? For what inputs is this neuron active? And then you just sort of see, like, do those form a coherent class, right? This is the car detector neuron in a vision model. This is the eye detector.

This is, like, the edge detector if it's early in the model, you know. They found, like, the Donald Trump neuron in a clip model, for example. But it turns out that in language models, when you do this, you just say, which sentences cause this neuron to activate? The answer doesn't make that much sense.

So, here's a visualization of a bunch of example text for which a neuron in a model activates. And there's just a lot of stuff. There's, like, code and some Chinese and some math and hemlock for Socrates. It's not especially clear. And, like, of course, there's no reason it would need to be, right?

Like, it's just learn to function. And so asking for a neuron to be interpretable is, like, a bit of a Hail Mary. And it's pretty cool that that, like, works sometimes. But it's, like, not particularly systematic. So there's a prior from neuroscience is that, like, maybe while there's a whole bunch of neurons going on or something, that at any given moment, maybe the model's, like, not thinking of that many things at once.

Maybe there's some sparsity here, where if there were a map of the concepts the model is using or the subroutines it's using or something, that on any given token, it's only using a few at a time. And that's just, you know, maybe a slightly better guess than maybe the neurons are interpretable.

It's, like, not necessarily a great prior, but it's something you can work with. And so you can fit linear combinations of neurons such that each activation vector is a sparse combination of these dictionary elements. This is called dictionary learning, classical ML. And we just did it. So you just, like, gather a bunch of activations from the model when you put through a trillion texts or something.

And then you take those vectors, you look for dictionaries, and then you look at when are those dictionary components active. And lo and behold, it's way better. So, you know, we had a paper last year, which was just, like, here's a bunch of them where we fit about 30 million features on quad 3 sonnet on the middle layer of the model.

So you just say, you know, what are the atoms of computation or representation inside that model? This was one of my favorites, where this linear combination is present, or it's, like, a dot product where the vector is large. When the input is about the Golden Gate Bridge. And that is true if it is, like, an explicit mention of the Golden Gate Bridge in English on the left.

Also true if it's translated into another language. Also true if it's an image of the Golden Gate Bridge. Also true, it turns out, if it's an indirect mention. So you're, like, I was driving from San Francisco to Marin, right, which, you know, you cross the bridge to do. And the same feature is active there, right?

So it's, like, some relatively general concept, also in, like, San Francisco landmarks, et cetera. So these combinations of neurons are interpretable. We were happy with this. There's things that are more abstract, you know, notions of inner conflict. There was a feature for, like, bugs in code that, like, fired on kind of small bugs, like division by zero or typos or something.

In many different programming languages, if you suppressed it, the model would act like there wasn't a bug. If you activated it, the model would give you a traceback as if there were a bug. And so, you know, it sort of had this general properties. But there was something very unsatisfying about this, which is, like, how does it know it's the Golden Gate Bridge?

And so what? You know, what does it do with that information, right? So even if you manage to, like, piece apart the representations, that's just a cross-section. It kind of doesn't give you the why. And the how, it just gives you the what. And so we wanted to find ways of connecting these features together.

So you start from the input words, and you're going to process these to higher-order representations, and eventually it'll say something, and try to trace that through. How are the features chosen to be shared earlier? Through optimization. So very concretely, you have a matrix, which is, like, you take a billion examples of activation vectors, and then you've got demodel here.

That's a matrix. And you try to factorize that matrix into a product of a fixed dictionary of atoms times a sparse matrix of which atoms are present in which example. And you just have an objective function, which is this should reconstruct the data and some L1 penalty to encourage sparsity.

It's a joint optimization problem. Yeah, that's right. That's right. So we tried being clever about this. There's, like, a beautiful, rich literature on dictionary learning. But after a few months of that, the bitter lesson got us. And it turned out that you could just use, like, a one-layer sparse autoencoder, which you can, like, put in Torch and then just, like, train on your GPUs.

And, like, that scaling was more important than being clever yet again. So that's what it is. It's a sort of sparse autoencoder. So here's a prompt. Okay, so the capital of the state containing Dallas is Austin. This is true. That's because the state containing Dallas is Texas. Texas? Good.

Did the model do that? Was it, like, Texas? So Austin? Or is it, like, I don't know. It's seen a lot of training data. Like, was this just in it? And it's just, like, reciting the answer. You know, people, there's a lot of eval contamination. Like, a lot of Mlu made it into people's training sets.

So you get high scores. Well, just knew the answer to that, literally, because it had seen it. So did it literally see this before? Or is it kind of thinking Texas like you did? It's thinking about Texas. So I'll tell you, this is like a cartoon, and we can slowly through this, like, pick, you know, break away these abstractions to, like, literally what we did.

But I'll sort of start by saying that you should think of each of these as sort of bundles of features, which are, again, atoms we sort of learned through an optimization process. We label the role of each feature by looking at when it's active and trying to see if we can describe kind of in language, you know, what that feature is doing.

And then the connections between them are actually direct causal connections as the model, you know, processes this in a forward pass. And so what we do is we break apart the model into pieces. We ask, you know, which pieces are active, can we interpret them separately, and then how does it flow?

And so in this case, we found, you know, a bunch of features related to capitals, a bunch related to states in the abstract, a bunch related to Dallas. We found some features that were, like, you can think of as, like, motor neurons. They make the model do something. In this case, they make it say the name of a capital.

Like, they make it say a bunch of state capitals or country capitals or something. So that's a start. But also it has to get it right. And so, you know, there's some mapping from Dallas into Texas. And there's a pile of features there. Some are, like, you know, discussions of, like, Texas politics.

Some are, like, phrases like everything's bigger in Texas. And once you have Texas in a capital, you get saying Austin in particular. And, you know, if you're saying a state capital and also you're saying Austin, what you get is, like, Austin coming out. There's some interesting straight lines, though.

You know, so, like, Texas also feeds into Austin directly, right? If you're thinking about Texas, you might just be thinking about Austin. And so, you know, I'll get into more of how we build this. But, you know, this gives you a sort of a picture in terms of these atoms or dictionary elements we've learned.

And you might want to check that they make sense. So then you can do interventions on the model, the leading pieces of this. Like, in neuroscience, these would be ablations of neurons or something. And then you see if the output of the model changes as you would expect. Someone on Zoom asked, though, words like that.

Also, Austin frequently hear together an internet attack suggesting simple statistical functions. So is this a way of over-complicating things? I think that people have found that transformers have outperformed back-of-end for most tasks. So, you know, Houston also occurs, you know, with Dallas and the training data. And the model doesn't say Houston.

So it must be using the capital and Dallas. In this case, I think you could say, you know, maybe if you just had capital and Dallas, then it would say Austin. And that's actually true, I think. Some of these edges were pretty weak. And it turns out you could just say the capital, you know, make something ungrammatical.

Just like, you know, the capital of Dallas is. And it will say Austin. And that's an interesting thing, where I think you look at the graph and this edge is weak. And it indicates that actually maybe it is just like if you have capital and Dallas in proximity, you might get Austin.

Which is then causally true. Yes. It's like a, I mean, how about this, a layer, a layer, a layer, a layer. The structure of that, I mean, the capsule is a, you don't have the layer and the above the layer. Yeah. So this is the flow through layers of the model.

Okay. Upstairs. And we've done some proving here. So I will show in more detail. Let me just see when we're going to get to that. Yeah. So I'm going to give you a lot more detail in just about a slide on the technical side. Yes. We did a scan where we trained dictionaries of different sizes and we, you know, you get some tradeoff of the compute you spend and the amount or the accuracy of the approximation.

Like, you know, these are supposed to reconstruct the activations. How well does that happen? The bigger the dictionary, the better it does. Also, the denser it is, the less sparse it is, the better that does. But you pay some price on interpretability at some point. So we did a bunch of sweeps and then we did something that seems good enough.

There's still a bunch of errors and, you know, I'll show you later how those show out to, you know, mean that we can't explain, you know, a lot of things. So, yeah, literally what we do here is train the sparse replacement model. So the basic idea is you have a model of residual stream, those are LPs.

We're going to forget about attention right now. And we're going to try to approximate that with these cross-layered transcoders. And so what do I mean by that? So a transcoder is something that, like, you know, moves information. So it emulates the MLPs, but the cross-layer is this ensemble of them emulates all the MLPs.

There's an architecture called dense net from a few years ago, right, where every layer writes to every subsequent layer. And this is like that. So the basic idea is that this ensemble of CLTs have to take in the inputs of all the MLPs stacked together and produce the vector of all of their outputs at once.

And a reason to do this is, like, there's no particular reason to think that, like, the atomic units of computation have to live in one layer. If there's two consecutive layers in a deep model, those neurons are almost interchangeable. People have done experiments. You can actually swap the order of transformer layers without damaging performance that much, which means that, like, indexing that hard on the existing layer is maybe unwise.

So we just say, okay, you know, these can skip to the end. This ends up making the interpretability easier because sometimes it's just bigram statistics. I see this word. I say that word. That's evident at layer one, but you have to keep propagating it all the way through to get it out.

And so here we could make that be one feature instead of, you know, dozens of features and consecutive layers interacting. And then we just train that optimization. You've got a loss on accuracy and a loss on sparsity. Okay. And so here we replace the neurons with these features. We use just the attention from the base models.

We do not try to explain the tension. We just flow through it. But we try to explain what the MLPs are doing. And now, instead of the neurons, which are sort of uninterpretable, like, on the left here, we have stuff that makes more sense. On the right, this is a, say, a capital feature.

And I think it's probably more specific. So this is, like, in, you know, these, like, literal state-to-capital mappings. Yeah. So I don't know what reasonable, I'd say it's expedient. It loses a lot. I think that from a practical perspective, though, if you want to model the action of the attention layer, or anything that moves information between tokens, then the input would have to be the activations at all of the token positions.

And here we only can do one token position at a time. So you can learn something much simpler. And I think the learning problem is just, like, a lot easier. It's also, like, slightly less clear what the thing to replace the attention layer is that would be interpretable, because it needs to be a system for both transforming information and moving it.

And sparsity isn't a good prior there. You've got, like, a four tensor instead of a two tensor, and we aren't sure what the right answer is. So we just did this for now. Is there a potential fear that once you are replacing it, you are implying your interpretation of what the relevant features are?

Right. So there's two questions there. One is, like, what do you lose by doing your replacement? And the other is, like, you know, how much are you leaning on your interpretations of the components you get? And what is our element feature? Yeah. Or something that you have to put into the sparse autocoter, right?

Yeah. Well, so the sparse autocoter just produces a bunch of these, right? And then, but we do have to interpret what comes out. And so for any particular graph, you know, we can go, we have now, the attention is frozen. We have a forward pass of the model, sort of through the replacement, where we've got a bunch of the features.

We've got these triangles represent, these diamonds are the errors. So, like, these don't perfectly reconstruct the base model, so you have to include an error term. And then you can just, you know, track the influence directly. These are just linear maps until you get to the end. And, you know, there'll be a lot of these active, fewer than there are neurons, but, you know, order hundreds per token.

But they don't all matter for the model's output. So then you can sort of go from the output, like Austen here, backwards, and say which features were directly causally relevant for saying Austen, and then which features were causally relevant for those being active, and then which are causally relevant for those being active.

And in that way, you get a much smaller thing, something you could actually look at to try to understand why it said this literal thing here. And now you're in, but this is all still math, right? Now you have this graph. But if you want to interpret this, then you need to look at, you know, these individual components and see if you can make sense of them.

And now you're back to, like, looking at when was this active and which examples, hoping that's kind of interpretable and, you know, that the connections between them make any sense. And, you know, often these will be in some rough categories. So here's, like, one Texas feature is, like, again, the, like, everything's bigger in Texas.

Texas is a big state known for its cowboys and cowgirls. And this other one is about, you know, politics and the judicial system. And, you know, I'm not saying that, like, these are, like, the right way to break apart the network. But they are a way to break apart the network.

And if you look at the flow in here, you know, these Texas features feed into the, say, Austin ones, right? There is a path here. And so we sort of will manually group some of these based on, at this point, human interpretation to get a map of what's going on.

Yeah. There's a Zoom question. In CLT architecture, we freeze attention blocks and replace MLPs with transcoders that speak to each other. This is complex, right? So won't this be adding unnecessary interpretability when the underlying connections could be a lot simpler? Also, do you mind repeating the questions so the folks on Zoom?

Can you hear? Yes. The question is, boy, this seems like a lot of extra work. You know, there's a lot of connections between these features, right? They're mediated by a dungeon. And I would agree. But we couldn't find a way of doing less work and have the units be interpretable, right?

One way to think about this is the base components of the model aren't that interpretable. So if you want to interpret things as interactions of components, like, if you want to use, like, the great reductionist strategy that's been quite effective, and you can break organs into, you know, cells and understand how the cells interact, you, like, have to break it apart in some way.

You lose a lot when you do that. But you gain something, which is you can talk about how those parts work. And so this was our best guess today as, like, parts to study. Okay. So we're still in schematic land here. Once we have grouped things to make a graph like this, we can do interventions.

So, and these interventions are in the base model, not in our complicated replacement model. They amount to basically adding in vectors that are the feature outputs to another one. And you can see if the perturbations make sense. So if we swap out the Texas by muting those and add in the California features from another prompt, then the model will say Sacramento.

If you put in Georgia, it will say Atlanta. If you put in the Byzantine Empire, it will say Constantinople. And so here it's a sign that, like, we did capture, like, an abstract thing in a manipulable way, right? You don't put it in and get gibberish, right? And, you know, if you're just doing bigram statistics or something, it's not clear how you would get this separability.

Okay. So I'm going to get into these three kind of motifs that we see a lot. The abstract representations in a medical context and a multilingual context. Parallel processing motifs, which is about arithmetic, some jailbreaks, and hallucinations. And then also some elements of planning. Yeah. Do these holographs, maybe you're going on a certain scale, or do they appear at, like, a much smaller point model?

Yeah. So this approach of, like, training a replacement model of an attribution graph is just math. So you can do whatever you want. And then it's like, are they interpretable, you know, as a question? We do this on a small 18-layer model in one of the papers, which can't do very much, and find pretty interpretable things.

So I think this works at all scales. I know people who are doing these on, like, you know, sort of, now, how interesting is it? If your model can't do anything, maybe it's not that interesting. But I think if your model is narrow, purpose, and small, this would still be useful.

Since you're talking about biology here, this all seems like something that maybe in a different field has been done to kind of human brains and human brains. Do you know, is there any kind of overlap on literature that you are relying on from the real, like, medicine side of things?

For some inspiration, I think in particular this, like, you know, the idea of doing the perturbation, the causal perturbation, and seeing what happens. It's like a very neuroscience, optogenetics-y thing to do. Fortunately or unfortunately, like, our experimental setup is so much better than theirs. So I think we're now, like, well past what people can do in neuroscience.

Because we have, like, one brain, and we can study it a billion times, and we can intervene on everything and measure everything. And they're in there, like, trying to capture, like, 0.1% of neurons, like, at a time resolution a thousand times worse than the actual activity. So, um, yeah.

Maybe the other way around, like, we can do them. Yeah. Yeah, yeah. Um, okay, I want to give you, you know, just, I think it's, it's, it's good to, like, be able to, like, actually, um, see some of these. So this is, in the, in the paper itself, this is, like, what we make these cartoons from.

Um, okay, so each of these nodes is, like, a feature, and an edge is, like, a causal influence. Um, here's, um, you know, uh, one whose direct effect on the output is saying Austin, mostly, and also some Texas things. Um, it's getting inputs from things related to Texas and from states.

And, um, you know, these, like, the, the art of this is, like, now you're doing the interpretation. You're, like, bouncing around, looking at these components, looking at when they're active, looking at what they connect to, um, and trying to figure out what's going on. Um, and so the cartoons I'll show you are given by grouping sets of these based on common properties.

So I said these are, say, a capital, and you can see that these are different from each other, but they all do involve the models, you know, saying capitals in some context. And so we just sort of piled those on top of each other. And this is what, like, I don't, you know, how many features should you have?

I don't know. There's obviously no right answer here, I think. Um, it's not a perfect approximation, but you sort of break it into 30 million pieces, then you put the pieces back together in ways that make a bit of sense, and then you do interventions to check that, like, what you learned is real.

Um, okay. So now. Uh, and then maybe this is the biggest difficulty in studying this kind of systems. And clearly, LLMs, they kind of start displaying this kind of, hey, so if you start breaking it down, at the same time you end up losing those emergent properties, so how can you balance that, uh, you know, and you can end up losing all these, these properties that just emerge when everything is together?

That's a good question. Um, so I think one thing that makes LLMs kind of different is this flow of information from the input to the output, um, in which it seems like the, the, like, latent spaces work at a higher level of, of, of representation or complexity as you move through it.

Um, and so, you know, like, all cells are somehow, like, the same size, they communicate with each other, but it's all, like, lateral communication. Um, and, um, you know, I think you do see the sum in the brain as you go from, like, you know, the, the first, um, light sensitive cells through the visual cortex.

Um, where, um, you will get, you know, ultimately, like, a cell which is sensitive to a particular face, um, but that comes from things sensitive to less specific things, and ultimately things that, like, just detect, like, edges and shapes and those kinds of structures. And so, um, I think as you move through that, um, you, you do get these higher levels of abstraction.

And so, um, you know, when I say, you could, there's a feature that seems to correspond to errors in code. Um, this is one of our atomic units, you know, it's, like, one of the things we're getting when we shatter this is, like, sensitive to errors in code, but, like, in a very general way.

Um, and so, to some extent, I think that, like, the, these are built up hierarchically, but that it's just, maybe they're still units. There's another thing that I think we don't have any traction on, which is, like, if it's doing some, like, in context, like, dynamical system stuff, um, that, I think that will be much harder to, to understand.

Um, and you might need, like, much larger ensembles of these. Um, okay. Um, let's slideshow again. Okay. Um, so, here's, like, a medical exam-style differential diagnosis question. Uh, 32-year-old female at 30 weeks gestation presents with severe right upper quadrant pain, et cetera, given this context. Um, if we could only ask for one other symptom, what should it be?

Um, does anybody, is there a doctor in the house? Does anybody know what the, what the medical condition is here? Um, yeah. So, um, ask about visual disturbances, and the reason is because this is, like, the most likely diagnosis is preeclampsia here, um, which is a severe, severe complication of pregnancy.

Um, and, you know, we're sort of able to trace through by, like, looking at these features and how they're sort of grouping together and how it flows, you know, that is pulling things from pregnancy, from, you know, that, that region, the headache, the high blood pressure, the liver tests, um, to, to preeclampsia.

Um, and then from that to other symptoms of preeclampsia. Um, and then from those, you know, ultimately, um, there's actually two complete visual disturbances, um, and, you know, there's a tiny box here showing, like, you know, what do we mean by that? Those are the, this component is also active on training examples discussing, you know, causing vision loss, um, floaters, you know, in the eye, which is another visual disturbance, lead to loss of vision.

Um, the other answer would be proteinuria. And so, you know, it is, it is, in the same way we're thinking about Texas, it's thinking about preeclampsia in some way before, like, saying what these, what these are. And we can see that intermediate state. Um, it is thinking about other potential diagnoses, like, uh, biliary system disorders.

And if we suppress the preeclampsia, um, then, um, it will say, instead of visual disturbances with a rationale, because this presentation suggests preeclampsia, it will say decreased appetite, because the scenarios suggest, um, biliary disease. Right? And so, it is thinking of these options when you turn one off, then you get, like, a coherent answer consistent with the other one.

So it's one of those main effect that you can intervene so clearly, that this is only represented in exactly one condition in the, in the network, and that by intervening there, you know, you don't need to intervene in a multiple place. Well, so, so, no, um, um, because, two reasons, three reasons, okay, the three reasons are, one, you know, this node is a group of them, right, it's a few of these features, um, that are all related to preeclampsia.

The second is because the cross-layer transcoder, each of them writes to many layers, um, and then the third is we're creaming it, so here, we're turning it off double the amount that it was on, um, you know, we, we overcorrect, and you often need to do that to get the full effect, probably because there's redundant mechanisms and stuff, and so, I don't think this is the only place it is necessarily, but, but it's enough.

So, um, yeah. Um, earlier in your talk, you talked a bit about for the semanticity of neurons. Yeah. So, for example, if you do one of these interventions, like, the, what are the pre-exam, CI, right, um, and let's say I deploy the model from some other doc, would you expect, like, um, like, what is the distribution of things that this will also hit, if, if you can, um, sort of give some sense of, like, what is the distribution of things that this will also hit, if, if you can, um, sort of give some sense of, like, um, what is the distribution of things that this will also hit, if, if you can, um, sort of give some sense of, Yeah, um, that's a great question.

We didn't dig into that as much here. We did a little more with our last paper. Um, you know, I think that you push too hard and the model goes off the rails, like, in a big way. Um, I, you know, here, this is less for the purpose of shaping model behavior and more for the purpose of validating our hypothesis in this one example.

Um, so we, we were not like, and now let's take a model, we've deleted this everywhere. It's like, no, the model's going through answering this question. Right here, we're turning off this part of its brain, you know, and then we're seeing what happens. Couple, uh, questions on Zoom. One is, uh, in practice, when you trace backwards through these features, do you see kind of explosion of relevant features as you go earlier in the model?

Or is it some function of distance from the output? Um, yeah, the question is like, um, as you go backwards through the model, you know, is the number of relevant features go up? Uh, yes. Um, uh, sometimes, you know, sometimes you do get convergence. Um, you know, like the early features, right, might be about, you know, a few symptoms or something, and those are quite important.

Um, and, um, I don't think we've made a good plot to answer that question. I am going to push onwards a little bit, and I'll come back to questions soon. Gotcha. Uh, just one more. How much are these examples carefully selected or crafted? Does this work for most sentences?

Um, you learn something about most things you try. So I'd say, like, 40% of the prompts that we try, we can see some non-trivial part of what's going on. It's not the whole picture, and it, like, sometimes doesn't work. But, um, it also just takes, you know, 60 seconds of user time to kick one of these off.

Um, and then you wait, and then it's done, and it comes back, and then you get to learn something about the model. Rather than having to, like, construct a whole bunch of a priori hypotheses of what's going on. You just sort of get it. Once you've built the machine, you get it for free, but this doesn't tell you everything.

Okay, so, um, here's a question. Like, what language are models thinking in? Um, if there's a, you know, is it a universal language? Is it English? Is it, like, just, like, there's a French Claude and a Chinese Claude inside of Claude? Who gets gated to? This kind of depends on the question.

Um, and so to answer this, we looked at three sentences, which are the same, um, sentence in three languages. So, the opposite of small is big. Le contraire de petit est grand. And I can't speak Mandarin. But, maybe someone here can say, what, what, someone, surely is a Mandarin speaker in this room.

Like, there's no chance. Thank you. I came to-- what's the final character? Oh, no. There you go. I came to see it because we were doing these, like, this is part of the paper, like, so many times. But, like, it's the only character I can recognize this right now.

Um, so, so this is, like, even more cartoony, right, than the last one. I'll get into the more detailed version later. But, basically, what we see is at the beginning, there are some of these, like, opposites in specific languages, right? Contraire in French, opposite in English. And the quotation, it's like, this is like an open quote in the language I'm speaking, right?

So, the quote that follows will probably be in the same language. But then there's this, like, complex of features about, you know, antonyms in many languages, saying large in many languages, smallness in many languages that go together. And then you spit it back out in the language of interest.

So, the claim is that this, this is a sort of multilingual core here, where smallness goes with oppositeness to give you largeness. And then largeness plus, you know, this quote is in English, give you the word large. Largeness plus this quote is in French, gives you, um, give you grand.

And, um, we did a bunch of these, like, kind of patching experiments. You could say instead of the opposite, you know, you could say a synonym of, um, and just, like, drop that in. So, the same thing in all of them. And now you'll get, um, little, um, or minuscule in French.

Um, and so it's sort of like patching in a different state. You know, you can, you can, you can, it's the same feature you're putting in in these three places, and you get the same change in behavior later. Um, and we sort of looked at, this does have a big effect with scale.

So, you can look, like, how many features overlap between a pair of sentences that are just translations of each other as you move through the model. And at the beginning, it's just the, the, the tokens in that language. There's no overlap. Um, the, you know, uh, Mandarin when it's tokenized has nothing in common with English when it's tokenized.

So, at the beginning and the end, there's nothing. Um, but, uh, as you move through the model, you see a pretty large portion of the components that are active are the same, regardless of the language that it was, that it was put in. Um, and so, this is English, Chinese, French, Chinese, and English, French.

Um, this is, like, a random baseline if the sentences are unrelated. So, it's not just, like, the middle's more common. But when you compare pairs that are translations, this is in a 18-layer model, and this is in our, like, small production model haiku. Um, so, this generalization is kind of increasing with scale.

Yes. So, it's almost like it can kind of interpret concepts in, like, a multi-layer model. Yes. Have you ever tried, um, doing problems that are metaphors that are kind of, like, specific to that language that don't really have translation? Um, no. If you have some examples of that, I'd love to try it.

Um, I think that would be fun. Yeah. Yes. Yes. I would be happy for collaboration. Yes. So, would you say this kind of implies that in the center of the network is the most abstract representation of any given concept? Yeah. Because-- Yeah. I think I would basically agree with this plot, which is, like, a little to the right of the center.

Yeah. Yeah. At which point, you start to have to figure out what to do with it. Because in the end, the model has to say a thing. And then have you seen if you have a harder time finding autosemantic features in the middle of the network where it's very abstract rather than towards the end where it's very concrete?

It's, if anything, it's actually the opposite. Um, because the-- like, if you'll permit me to be philosophical, like, the point of a good abstraction is that it applies in many situations. Um, and so, um, in the middle there should actually be these, like, common abstractions. It's, like, dealing with the very particulars of this phrasing or this grammar-matical scenario that it's quite bespoke.

And so, you would need way more features to unpack it. Okay. Yeah. Oh. One more question. Uh, yes. Um, is the same, like, operation stored redundantly across multiple layers? Like, do you find that, uh, like, for example, if you need five reasoning steps to-- to get something done for-- like, like, in one example, but then in another example, like, you need eight reasoning steps to get something done.

Um, do you redundantly have to store the, like, follow-up operation on both of those layers? Yeah. I think this is, like, a very deep question. Um, so it's repeated for the audience. It's, like, do we see redundancy of the same operation in many places? And, of course, this kind of-- it's, like, a thing people complain about with these models.

Um, right? Which is, like, well, if it knows A and it knows B, why can't it do them in a row in its head? And it, like, might just literally be that A is after B in its head and have to do a forward pass. And so unless it gets to, like, think out loud, it literally can't compose those operations.

Um, and you can see this quite easily, you know, if you just ask, you know, like, who is the-- what's the birthday of the father of the star of the movie from 1921? You know, like, it might be able to do all of those, but, like, you can't actually do all those lookups consecutively.

That's, like, the first thing. So there is-- there has to be redundancy. And then we do see this. It was one of the reasons for the cross-coder setup is to try to zip up some of that redundancy. Um, I think one of my-- one of my favorite plots is not from the so-called biology paper, but is from the, um, sister paper, uh, here.

Uh, let's find Copenhagen. Okay. So, like, on the left is if you do sort of-- you do this decomposition per layer. And on the right is you do the cross-layer thing. And, like, basically, you know, it's just, like, bouncing back around. You know, it's, like, the Copenhagen is, like, propagated.

It's talking to Denmark. And it just, like, happens in, like, many, many places in the model. They all perform, you know, like, small improvements to it. Um, you know, there's another perspective here, which is, like, the, like, neural ODE, like, gradient flow perspective, where it's, like, all tiny adjustments in the same direction all the time.

And I think the real models are somewhere in between. Okay. With the overlap of the features here, as the models get larger, you found, like, the change, like, data set varies significantly, or also, like, ordering in parts of the data set. Maybe you have a very different representation? Yeah.

We haven't done systematic studies of, like, you know, data set ordering effect. Yeah, I guess, you're talking about, like, what's a machine fails. You mentioned that sometimes, like, the plots are, like, not difficult. I guess, I'm curious if you were out of the other where, like, the action should drafts look really compelling, but you're not even compelling.

Like, we've maybe been lucky. You could have a plot where it's, like, how compelling was the hypothesis from the attribution graph? And then, like, how well did it work when you intervened? And the best ones did work. But there are some small failure cases we talk about. Okay. I want to dive into the parallel motif because I think it's actually super interesting.

And it's a unique feature of the transformer architecture, right? It's massively parallel. So to give a really simple example of this, imagine that you want to add 100 numbers. So the easiest way to do it is you start with one. You add the next. You add the next to the sum.

You add the next to the sum. And you do 100 serial steps. To do that with a transformer, it would need to be 100 layers deep. But there's another thing you can do, which is you could add pairs consecutively in one layer. And then add the pairs of pairs consecutively in the next.

And then the pairs of those pairs in the next. And so in log n depth, you could add up the 100 numbers. And, you know, given the depth constraints here and the sophistication of what we ask these models to do, you know, it makes sense that they would attempt to do many things at once, pre-compute things you might need, kind of slap them all together.

And I'll give you a few examples of this. So if you ask the model to add 36 and 59, it will say 95. If you ask it how it did that, it will say it used the standard carrying algorithm, which is not true. What happens is somewhat more like this.

So first, it parses out each number into, like, there's some component where it's like literally 59, 59, but there's also something for all numbers ending in 9. And something which is like that number range and the same up there. And then you kind of have two streams. One where it's getting the last digit, right?

And then another on top where it's getting the magnitude, right? And even inside the magnitude, there's sort of like a narrow band magnitude and a really wide band magnitude. And then those give you a sort of medium band. And then if the sum is in this range and it ends in a 5, then it's actually 95.

That narrows it down and then it gives you the answer, which is cool. It's not how I would do it. But then again, it wasn't trained by like a teacher being like, here's something to do. It just got like whacked every time it got it wrong. Or like rewarded every time it got it right, right?

In training. And, you know, I don't think we're going to have this next in here. So I want to show you one of my favorite things from the addition section. Which is like, OK, so, you know, there's a feature in here. OK, like if anybody here like, I feel like this is like a word cell shape rotator test.

So for the shape rotators who like kind of mathematical thinking, like I love this section. So this is like, this is the graphs we make to visualize on the arithmetic prompts. This is like, is it active on the prompt A plus B for A and B and 1 to 100?

That's a grid. And so vertical lines mean like when the second operand is in a range, it's active. You know, these dots is like, is it a 6 and a 9? You know, so there's a grid. There's a grid. There's bands. You know, a band here is a line x plus y equals constant.

Right? And so those are where the sum is. So we were looking at this to sort of figure out what these did. But this feature here I really like. So everything in the graph you can hover over, which is kind of neat, and see the feature. And we looked at cases in the data set when this thing was active.

OK? So on this narrow domain, it's active when things ending in 6 get added to things ending in 9. But on the data set it's active in like all these other cases. So this is like, finest order in fragments, federal proceedings, volume 35. It's like, OK. So in some sense, that has to be a 9 plus a 6.

This is the volume 35, supposedly, if this interpretation is correct. Here's just like a list of numbers. There's more journals. There's like these coordinates. And so the claim here, if our method is working, is that there's one component that in this context means ends in 6 plus ends in 9.

But it's also active in these. So this is really working if secretly every one of these examples is the model adding a 6 to a 9. And it's going to reuse the module for doing that across those examples. And so we dug in. And I couldn't really understand these.

So here's one example where this is the token where that feature was active. And I just put it in Claude. And I was like, what is this? And it's like, ah, this is a table of astronomical measurements. And it's split it out in like a nicely formatted table. And the first two columns are the start and end time of an observation period.

And this is the minutes of an end time that it's predicting. And if you read down the table, the start to end interval is like 38 minutes, 37 minutes. But it creeps up over the course of the experiment to be like just under 39 minutes. And this measurement interval started at a minute with a 6.

And the 6 plus the 9 equals ending in a 5. And so the model was just like gutting out next token predictions for like arbitrary sequences that data was trained on. And learned in that, of course, to recognize the context of what it's supposed to do. But then it needs a bit where it has like the arithmetic table.

Right? Like 6 plus 9, you just got to look it up. So it has that somewhere. And it's using that same lookup in this like very different context where it needs to add those things. This was another one where it turned out this was a table. And it's predicting this amount.

And these are arithmetic sequences. This is, I guess, the total cost that's going up. And so the amount that it's going up by, you know, is this where you're carrying. That's about 9,000 plus 26,000, 35,000. This was maybe my favorite where like why is it firing here to predict the year?

And the answer is because it's volume 36. This journal was founded in-- the first edition was in 1960. So the 0th would have been in 1959. 59 plus 36 is 95. And so it's using the same little bit to do the addition there. Right? And so I think when we talk about generalization like, you know, and the abstraction, this was for me like a pretty moving example where I was like, okay, like did learn this little thing.

But then it's using it all over the place. Okay, that's maybe like not the most like mission critical thing in the world. So let's talk about hallucinations. So models, they're great. They'll always answer your question. But sometimes they're wrong. And that is just because of like pre-training. Like they're meant to just predict a plausible next thing.

Okay? Great. Like it should say something. If it knows nothing, it should just give a name. If it knows like anything, it should give like, you know, a name in the correct language. Right? If it knows more, maybe like a common name from the era or like just some basketball player or whatever.

Right? And so that's what it's trained to do. And then you're like, you go to fine tuning and you're like, no, like I want you to be an assistant character. Not just a generic simulator. Not just a generic simulator. And when you simulate the assistant, I want the assistant to say, I don't know.

Like when the base model certainty is somehow, you know, low. And that's like a big switch to try to make. And so we were curious, like how does that happen? Right? How does this like, you know, refusal to speculate get fine tuned in? And then when, why does that fail?

And so here's sort of two prompts that get at this. One is, what sport does Michael Jordan play? Answer in one word. It says basketball. The other is, what sport does Michael Batkin play? Which is just the person we made up. Answer in one word. And it says, I apologize.

I can't find a definitive record of a sports figure named Michael Batkin. Okay. So here, these graphs are a little bit different. They've got suppressive edges highlighted here. You know, this is common in neuroscience, like inhibition. And we've drawn some features that aren't active on this prompt, but are on this one and vice versa.

So if they're in gray, it means it's inactive here. But part of the reason it's inactive in particular is it's being suppressed by something that is active. And what we found was sort of this cluster of four features in the middle. You know, Michael Jordan. And then we have, like, a feature for, like, it's active when the model recognizes, you know, answers to questions.

There's another feature for, like, unknown names. And then there's a generic feature for, like, I can't answer that. And that generic feature I can't answer is just fueled by the assistant, which is always on when the model's answering. So it's just, like, there's just an always I don't know to any question.

And then that gets downmodulated when it recalls something about a person. So when Michael Jordan is there, that suppresses the unknown name, boosts the known answer. Both of those suppress the can't answer. And that leaves room for the path of actually recalling the answer to come through. And it can say basketball.

So that's a cool strategy. But getting back to model depth, there's an interesting problem, which is that, like, it might take a while for the model to come up with an answer. But it also would have to, like, at some point decide to refuse and, like, get that going.

And those are happening in parallel. So you could have a little bit of a mismatch where, like, by the, you know, it has to decide now whether or not it's going to answer or not. But it hasn't, like, done all that it can do to get a good answer.

Right? And so you can get some divergence where for very hard questions, for example, but it still might know. It has to be like, okay, do I think I'm going to get there? And that's, like, a little bit of a tricky knot, but can't fully self-reflect on the answer before saying it.

And so, okay, this is just the intervention. You juice the known answer, and it will hallucinate. That might go back and play chess. So this is a fun one. So if you ask for a paper by Andre Karpathy, formerly of Stanford, it gives a very famous paper that he didn't write.

Why is that? Well, it's like, trust me, I've heard of Andre Karpathy, like, from the name. But then there's the part which is trying to, like, recall the paper. And then it gives that answer. Right? And then, actually, it says that. And then, you know, I think it's covered up here.

But then, like, if you're like, are you sure? It's like, no, I don't really think he wrote that. Because at that point, then the model as an input gets both the person and the paper and can, like, do calculation earlier in the network again. Now we can juice this.

We can suppress the known answer bit. And eventually it will apologize and refuse. There's a fun one in the paper where the model hasn't heard of me. I didn't write that section. Jack wrote that section. And it refuses to speculate about papers I've written. And then if you turn off the, like, unknown entity and you, like, give the known answer, then it says I'm famous for inventing the Batson principle, which I hope to one day do.

OK. So there's a lot more we could talk about here. I'm just going to, like, speed run these for vibes. And you can, like, read the paper where we do a lot more. There's jailbreaks trying to understand how they work. And, you know, some of it is, like, you get the model to say something without yet recognizing what it's saying.

And then once it's said it, it's kind of, like, on that track. And it has to balance being coherent verbally with, like, I shouldn't say that. And it takes a while for it to cut itself off. And we find that if we suppress punctuation, which would be an appropriate grammatical time to cut yourself off, you can get it to, like, keep doing more of the jailbreak.

And that's a competing mechanism thing, right? There's a part which is recognizing what am I talking about? What should I do? And then there's a part which is completing it. And they're fighting for who's going to win. OK. Do you know if, like, in the process of the model to plan ahead, it really, like, gets open to the example?

Cool. Let's talk about planning. Yeah. OK. So I think it might be very disappointed if I didn't talk about this one. So this is a poem, a rhyming couplet written by Claude. He saw a carrot and had to grab it. Hunger was like a starving rabbit. It's, like, kind of good.

So how does it do this? Right? It's, like, kind of tricky, right? Because to write a rhyming thing, you better end with a word that rhymes, right? But you also need to, like, have it kind of make semantic sense. And if you wait to the very end, you can back yourself into a corner where there's no next word that would be, like, metrically correct and rhyme that would make sense.

You kind of, you know, logically should be thinking ahead a little bit of, like, where you're trying to go. And we do see this. So actually, he saw a carrot and had to grab it. New line. And so on that new line token, it's actually-- there's a feature for, like, you know, things rhyming with it.

So this is, like, after words ending in it or eat in poems. And those feed into rabbit and habit features. And then the rabbit feature is sort of being used to get starving and then ultimately rabbit. And we can suppress these. So if we suppress the rhyming with it thing, we get blabber, grabber, salad bar.

And that's because the ab sound is still there. So it'll just, like, rhyme with the ab part. If we inject green, you know, it will now write a line ending-- or rhyming with green. Sometimes ending with it. If we put in a rhyme with a different thing, it'll sort of go with E.

I think-- I don't know if we have it here. But if you just literally suppress the rabbit, it will make something ending in habit when it rhymes. And this was pretty neat. This is like the smoking gun. This is like, oh, OK. Like, literally, here's a model component. When we look at the data set examples when this is active, it's literal instances of the word rabbit and bunny.

A forward pass on this model, that feature is active. On the new line at the end of the sentence. And the model writes a rhyme ending in rabbit. And if we turn this off, then it, like, doesn't do that anymore. So it's, like, very definitely thinking about, like, that is a place to take this.

And then that influences the line that's coming out. And so that's a place where, even though it's saying one token at a time, it's, like, has done some planning in some sense. Like, here's a target destination. And then, like, writing something to get there. There's equivalent things. There's an incredible thing with unfaithfulness.

I'll just say sometimes the model is lying to you. And if you look at how it got to its answer, you can tell. In this case, it's using a hint and working backwards from the hint so that its math answer will agree with you. And you can tell because you can literally see, like, you can see it taking your hint, which is the number four, working backwards to divide by five to give you a .8.

So when you multiply by five, you will get four and it will agree with you, which is not what you want. What you would want is something like this, where it is only using information from the question to give you the answer to the question. But if you just look at the explanations, they look the same.

They look like it's doing math, right? And so here there's a competing thing of, like, should I use the hint, which would have made sense in pre-training. It'd let you guess the answer better and you're rewarded if you say the next token. Well, so maybe the human's right and you should use the hint.

Should I do the math actually? And these are competing strategies. They're happening kind of at the same time. On the right, you know, this one wins. One question. So what makes it better? How does that work? Yeah. Because they're both available. Yes. How does it come into this, like, oh, obviously there's more time constraint.

Yes. Is there some incentive or motivation driving that? I think, like, that's the question. So I think, you know, I think in this paper we really did-- we were able to say, like, which strategies were used when it got to this answer. But the why, I don't think we've really nailed down.

You know, I think that, to some extent, it could be we could look at some of this more carefully. I think in the hallucination case, we had a bit of a hint, right? It was, like, recognizing the entity. So I'm going to do the refusal thing or let it through.

Here, though, my strong suspicion is that, like, it's just doing both. But in a case where it's more confident in one answer, that shouts louder. So it doesn't know what the cosine of this is. So all that's left is following the hint. But I think a big caveat is that, like, we're not modeling attention at all.

And attention is very good at selecting, right? You've got this QK gating, right? That's, like, a bilinear, you know, bilinear thing. You could really pick with that. And we're not modeling how those choices are made. So I wouldn't be surprised if, for a lot of these, attention is crucially involved in choosing what strategy to use while the MLPs are heavily involved in executing on those strategies.

And in that case, we would be totally blind to what's going on here. I mean, that's, like, a billion-dollar question. I literally knew the answer to that, and it worked really well. I couldn't tell you. I would just, like, go make Claude the best model in the world because it's always accurate.

But I don't know the answer, so I can speculate. I mean, I think, like, it is, in some sense, an impossible problem for exactly this reason. And so I think, like, you could try to train better and have the models be better calibrated on this health knowledge. You could, you know, I think with the thinking tags, basically, with, like, the reasoning models, there's a straightforward way where you do let the model check things.

And I think models are much better on reflection than they are on the forward pass because I think one forward pass is just limited for, like, these physical reasons. I think as an effective strategy, that might be more the way than, like, not allowing, like, how do you keep the creativity?

And another possibility is you could make the model dumber somehow. So maybe you could make a model which doesn't hallucinate but it's just dumber because it uses a bunch more capacity just for, like, checking itself in the forward pass. And it could be when models are smart enough, people might take that trade off.

Yeah, I think that's one of the sins here. It's possible that, like, with a recurrent thing, you could just give it a few more loops to check stuff. I mean, if you could fully adapt to compute, you could just have it go until it's some level of confidence, right, and get variable compute per token and then sort of bail if it's not.

I think there is a trick-y thing about hallucination is, like, people think of it as being, like, a well-defined thing. But, like, you know, it's producing, like, reams of text. Like, which word is the one that went wrong, you know? And there are some very factual questions where that's true, but I think if you think more generally, like, what would make a given token a hallucination, it's, like, a little bit less clear.

I want to just see if there's anything-- yeah, OK. So there's nothing else here other than, like, if you want more, read the stuff. And I get this-- probably this ends formally in, like, two or three minutes. Is that-- yeah. So I will just be done now. And then people can clap and people can leave if they want, but I will just stay for questions for a while.

And I'm happy to, like, continue doing questions as long as people want. So thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Stanford CS25: V5 I On the Biology of a Large Language Model, Josh Batson of Anthropic

Transcript