Decoding the Decoder LLM without de code: Ishan Anand

I hope you're all having a good conference. And I hope you're ready. Because if you came to this conference or the AI engineering field without a machine learning degree, then this is going to be your crash course in how machine learning models actually work under the hood. Let's bring up the slides.

There we go. Thank you. OK, so I'm Ishan, and I'm dressed in scrubs. Because today, we're all going to be AI brain surgeons. And our patient will be none other than GPT-2, an early precursor to ChatGPT. And our operating table will be a table, but it will be a table of numbers.

It will be an Excel spreadsheet. This Excel spreadsheet implements all of GPT-2 small entirely in pure Excel functions. No API calls. No Python. In theory, you can understand GPT-2 just by going tab by tab, function by function, through this spreadsheet. But you want to hold on to those VLOOKUPs, because there's over 150 tabs and over 124 million cells for every single one of the parameters in GPT-2 small.

I will give you the abbreviated tour. So we'll do three things today in our little med school. First, we'll study the anatomy of our patient, how he's put together. Then we're going to put him through a virtual MRI to see how he thinks. And then finally, we're going to change his thinking with a little AI brain surgery.

OK, let's start with anatomy. You're probably familiar with the concept that large language models are trained to complete sentences, to fill in the blank of phrases like this one. Mike is quick. He moves. And as a human, you might reasonably guess quickly. But how do we get a computer to do that?

Well, here's a fill in the blank that computers are very good at. 2 plus 2 equals 4, right? They're really good at math. In fact, you can make it very complex, and they do it very well. So what we're going to do, in essence, is we're going to take a word problem and turn it into a math problem.

In order to do that, we take our whole sentence or phrases, and we break them into subword units called tokens. And then we map each of those tokens onto numbers called embeddings. And I've shown it for simplicity here as a single number, but the embedding for each token is many, many, many numbers, as we'll see in a bit.

And then instead of the simple arithmetic shown here, we're doing the much more complex math of multi-headed attention and the multilayer perceptron. Multilayer perceptron, just another name for a neural network. And then finally, instead of getting one precise exact answer like you used to get in elementary school, we're going to interpret the result as a probability distribution as to what the next token should be.

So here's our setup. We get input text. We turn that text into tokens. We turn those tokens into numbers. We do some number crunching. And then we reverse the process. We turn the numbers back out into tokens or text. And then you get our next token prediction. So this handy chart shows where each of those actions maps to one or more tabs inside our friendly patient spreadsheet.

Let's take a look. So the first thing you do is we get our prompt, right? Here the prompt is, Mike is quick. He moves. And then it will output, after about 30 seconds, since we're running in a spreadsheet-- don't use this in production-- the next predicted token of quickly.

So the first step is to split this into tokens. Now you see that every word here goes into a single token. But that's not always the case. In fact, it's not uncommon to be two or more tokens. Let me give you some examples. So here's another version of the sheet.

Let me zoom this up so you can see it a little better. I've put actually some fake words. Re-injury is a real word, but funology isn't a real word. But you know what it means, right? Because it's the word fun with ology put together. Those are the morphemes, as linguists like to call them.

And the tokenization algorithm actually is able to recognize that in some cases. Whoa, there we go. Right there. You see fun split into a fun and ology. If we zoom that one up. There we go. But it doesn't always work. So notice how re-injury got split up right here.

It's rain injury. And that's because the algorithm is a little dumb. It just picks the most common subword units it finds in its iterations. And it doesn't always map to your native intuition. And so in practice, machine learning experts feel like it's a necessary evil. And then the next step is we have to map each of these tokens to the embeddings.

So let's go back to the original one. And that's in this tab here. So we have each of our tokens in a separate row. And then right here, starting in column three, is where our embeddings begin. So this is the row right here. The second row is all the embeddings for Mike.

Now, in the case of GPT2 small, the embeddings are 768 numbers. So we're starting in column three. So that means if we go to column 770, we will see the last end of this. And so there is the end of our embeddings for Mike. And each one of these, again, is the embedding for each token.

OK. Then we get to the layers. This is the heart of the number crunching. So there are two key components. There's attention and then the neural network or multi-layer perceptron. And in the intention phase, basically, the tokens look around at the other tokens next to them to figure out the context in which they sit.

So the token "he" might look at the word "Mike" to look at the antecedent for its pronoun. Or "moves" might look at the word "quick" because "quick" actually has multiple meanings. "Quick" can mean movement in physical space. It can mean "smart" as in "quick of wit." It can mean a body part, like the "quick of your fingernail." And in Shakespearean English, it can mean "alive or dead," like "the quick are the dead." And seeing that the word "moves" here helps it disambiguate for the next layer, the perceptron, that, oh, we're talking about moving in physical space.

So maybe it's quickly, or maybe it's fast, or maybe it's around, but it's certainly not something about your fingernail. So let's see where this is all happening. So these are layers. Now, there's 12 of them. So this is block 0 all the way to block 11. Each one's a tab.

And then if you go up here-- we can't go through all of this in the time we have-- but this is one of the attention heads. This is step 7. This is where you can see where each token is paying attention to every other token. And you'll notice that there's a bunch of zeros up at the top right.

And that's because no token is allowed to look forward. They can only look backwards in time. And you'll see here that Mike is looking at Mike 100% of the time. Higher values mean more attention. These are all normalized to one. Here is the word "he," or the token "he," I should say.

And you'll notice 0.48. So about half of its attention is focused on the antecedent of its pronoun. Now, this is just one of many heads. If I scroll to the right, you'll see a lot more. There aren't always as directly interpretable as that. But it gives you a sense of how the attention mechanism works.

And then if we scroll further down, we'll see the multilayer perceptron right here. If you know something about neural nets, you know there's just a large combination of multiplications or a matrix multiply, and so I don't know if you can see this in the back. There's a mmult, which is how you do an Excel matrix multiply.

And that's basically multiplying it times its weight. And then here we put it through its activation function to get the next prediction. OK. Let's keep going. OK. Next, we have the language head. And this is where we actually reverse the process. So what we do is we take the last token and we unembed it and reverse the embedding process we did before.

And we probabilistically look at which are the tokens the closest to the final last tokens unembedding. And we interpret that as a probability distribution. Now, if you're at temperature zero, like we are in this spreadsheet, then you just take the thing with the highest probability. But if your temperature is higher, then you sample it according to some algorithm like beam search.

Let's take a look. And we'll go here. So again, I don't know if you can see in the back, but this function here is basically-- there we go. This function in the back basically is taking block 11, the output of the very last block. It's putting it through a step called layer norm.

Then we multiply it, another mmult, times the unembedding matrix. And these are what are known as our logits. And then to predict the next most likely token, we just go to the next one. And if you can see this function, it basically is looking at max of the previous column you saw in the previous sheet.

And it's taking the highest probability token just like that. And that's our predicted token. We get a token ID, then we look it up in the matrix, and we know what the next likely token is. OK. So that's the forward pass of how GPT-2 works. But how do all those components work together?

So let's take our patient and put them through a virtual MRI so we can see how he thinks. Before we do that, there's something I forgot to mention. These are called residual connections. Inside every layer, there's an addition operation. And what this lets the model do is it lets it route information around and completely skip any part of these layers, either attention or the perceptron.

And so you can reimagine the model as actually a communication network or a communication stream. So the residual stream here is every one of those tokens. And information is flowing through them like an information superhighway. And what each layer is doing is we've got attention moving information across the lanes of this highway.

And then the perceptron trying to figure out what the likely token is for every single lane of the highway. But there are multiple of these layers. So they're really reading and writing to each other information in this communication bus. What we can do is we can do a technique called Logit Lens.

We can take the language head we talked about earlier and stick it in between every single layer of the network. And what was it thinking at that layer? So that's what I've done in this sheet. So I gave it the prompt. If today is Tuesday, tomorrow is, and the predicted token is Wednesday.

And GPT-2 does this correctly for all seven days. And what you see in this chart is essentially the columns here from three through nine are all those lanes of the information superhighway. And, for example, here at block three, this is the top most predicted token at the last token position.

So it predicted not. The second most likely word was going to be still. Then it was going to be just. These are all wrong. So let's look for what we know is the right answer, Wednesday. So over here at block zero, we see Wednesday. It's at the bottom of the Tuesday stream for some reason on that highway.

Well, it makes sense it would be close to Tuesday. And then it completely disappears. And then, oh, over here towards the last few layers, suddenly we see tomorrow, forever, Tuesday, Friday. It knows we're talking about time. We're talking about days. And it gets Wednesday, but it's still the third most likely token.

And then, finally, it moves it up to the final position, and then it locks it into place. So what's going on here? Well, a series of researchers basically took this logit lens technique on steroids and isolated that only four components out of the entire network were responsible for doing this correctly over all seven days.

What they found was that all you needed was the perceptron from layer zero, attention from layer nine, and actually only one head, the perceptron from layer nine, and then attention from layer 10. And that's kind of what we saw on the sheet, right? At the top, we saw Wednesday, and then it disappeared until the later layers pulled it back up and up in probability towards the end of the process.

So it's an example of where you can see each layer acting as a communication bus, trying to jointly figure out and create what they call a circuit to accomplish a task. Okay. We are now out of med school and ready for surgery. So, you may have heard about the pioneering work that Anthropic has done about scaling monosemanticity.

This gave rise to what was known as Golden Gate Claude. It was a version of Claude that was very obsessed with the Golden Gate Bridge. To some, it felt like it thought it was the Golden Gate Bridge. Conceptually, here's how this process worked. You have a large language model, and then you have this residual stream we talked about earlier.

And then you use another AI technique, an autoencoder. This one's a sparse autoencoder. And you ask it to look at the residual stream and separate it out into interpretable features. And you then try and deduce what each feature is. And then you can actually turn up and down each of these features back in the residual stream in order to amplify or suppress certain concepts.

It turns out a team of researchers, led by Joseph Bloom, Neil Nanda, and others, are building out sparse autoencoder features for open source models like GPT-2 small. So, here, for example, is layer 2's feature 7650. I don't know if you can see it in the back. It's basically everything Jedi.

So, gone to our friendly patient again. And I've taken the vector for that feature while we wait for Excel to wake up. There it is. That first row is essentially what they call the decoder vector corresponding to Jedi. And then I've basically multiplied by a coefficient. And then I've basically formatted it so that I can inject it right into the residual stream.

This is the start of the block. You can see that steer block 2. It's basically just taking that vector I showed you and adding it into the residual stream. Simple addition. Now we go to our prompt. And originally, normally, you ask GPT-2, Mike pulls out his. Makes sense. He pulls out his phone.

But if we turn the Jedi steering vector on, I'll give you one guess what he's probably going to pull out. Let's see. Okay. So, now we hit calculate now. And this is where you get to witness the 30 seconds it takes. And while we wait for it to run, a couple notes.

So, first of all, the way Anthropic did their steering was slightly different, but similar in spirit. There's a few other ways to do this kind of steering. One of those is called representation engineering, where the steering vector is deduced via PCA, or principal component analysis. And there's another technique called activation steering, where what you do is you'd take the thing you want to amplify, like Jedi, and you'd run the model through just on that token.

And then you'd run on something you might want to suppress, like in this case, phone. And then you'd create a Jedi minus phone vector and inject that into the residual stream. Okay. There it is. There it is. Mike pulls out his lightsaber. There we go. We have done it.

Our operation has been a success. We've created the world's first GPT-2 Jedi. Stick that on LMSS Arena. Okay. Well, hopefully I've given you a little better insight into how large language models work, but also why they work. But the root message I want to leave with is that to be a better AI engineer, it does help to unlock the black box.

Partly this is about just knowing your tools and their behavior and their limitations better. But also, we're in a very fast-moving field. And if you want to understand the latest research, it helps to know how these work. And then last but not least, when you communicate with non-technical stakeholders, there's very often a perception of magic.

And the more you can clear that up, the more you can clear up misunderstandings. I'll give you just one example of where this bubbles up, where architecture bubbles up to how you use them. So this is the instructions for RWKV, which is a different type of model. But the template for normal transformers at the top, the template for an RWKV prompt is at the bottom.

And what's interesting is that they recommend you swap the traditional order of instructions and context because the attention mechanism or the pseudo-attention mechanism in RWKV can't look back the same way a regular transformer can. So it's a great example of where model architecture matters all the way up to prompting.

OK. Here are the references for the research we talked about today. And then if you want to learn more, you can go to spreadsheetsareallyouneed.ai. And you can download this spreadsheet and you can run it on your own device. If you want to see me go through every single step of this spreadsheet, I just launched a course on Maven today.

And the link to it is on that website as well. And that's it. Thank you. Thank you. And that's it. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. We'll see you next time.

Decoding the Decoder LLM without de code: Ishan Anand

Transcript