back to indexDecoding the Decoder LLM without de code: Ishan Anand

00:00:18.680 |
Because if you came to this conference or the AI engineering 00:00:30.200 |
in how machine learning models actually work under the hood. 00:00:46.180 |
Because today, we're all going to be AI brain surgeons. 00:00:50.220 |
And our patient will be none other than GPT-2, 00:01:06.520 |
This Excel spreadsheet implements all of GPT-2 small entirely 00:01:18.380 |
In theory, you can understand GPT-2 just by going tab by tab, 00:01:22.840 |
function by function, through this spreadsheet. 00:01:27.520 |
because there's over 150 tabs and over 124 million cells 00:01:32.480 |
for every single one of the parameters in GPT-2 small. 00:01:39.980 |
So we'll do three things today in our little med school. 00:01:42.780 |
First, we'll study the anatomy of our patient, how he's put together. 00:01:46.780 |
Then we're going to put him through a virtual MRI to see how he thinks. 00:01:51.580 |
And then finally, we're going to change his thinking with a little AI brain surgery. 00:02:01.780 |
that large language models are trained to complete sentences, 00:02:04.640 |
to fill in the blank of phrases like this one. 00:02:09.200 |
And as a human, you might reasonably guess quickly. 00:02:14.500 |
Well, here's a fill in the blank that computers are very good at. 00:02:20.460 |
In fact, you can make it very complex, and they do it very well. 00:02:29.960 |
In order to do that, we take our whole sentence or phrases, 00:02:33.520 |
and we break them into subword units called tokens. 00:02:36.760 |
And then we map each of those tokens onto numbers called embeddings. 00:02:40.100 |
And I've shown it for simplicity here as a single number, 00:02:42.700 |
but the embedding for each token is many, many, many numbers, 00:02:46.680 |
And then instead of the simple arithmetic shown here, 00:02:50.080 |
we're doing the much more complex math of multi-headed attention 00:02:55.280 |
Multilayer perceptron, just another name for a neural network. 00:02:58.760 |
And then finally, instead of getting one precise exact answer 00:03:03.780 |
we're going to interpret the result as a probability distribution 00:03:21.100 |
We turn the numbers back out into tokens or text. 00:03:26.320 |
So this handy chart shows where each of those actions maps to one 00:03:30.880 |
or more tabs inside our friendly patient spreadsheet. 00:03:35.560 |
So the first thing you do is we get our prompt, right? 00:03:40.960 |
And then it will output, after about 30 seconds, 00:03:48.960 |
So the first step is to split this into tokens. 00:03:52.520 |
Now you see that every word here goes into a single token. 00:03:57.240 |
In fact, it's not uncommon to be two or more tokens. 00:04:03.700 |
Let me zoom this up so you can see it a little better. 00:04:08.380 |
Re-injury is a real word, but funology isn't a real word. 00:04:13.120 |
Because it's the word fun with ology put together. 00:04:15.380 |
Those are the morphemes, as linguists like to call them. 00:04:36.800 |
So notice how re-injury got split up right here. 00:04:41.240 |
And that's because the algorithm is a little dumb. 00:04:47.280 |
And it doesn't always map to your native intuition. 00:04:55.580 |
And then the next step is we have to map each of these tokens 00:05:03.520 |
So we have each of our tokens in a separate row. 00:05:06.680 |
And then right here, starting in column three, 00:05:11.120 |
The second row is all the embeddings for Mike. 00:05:13.960 |
Now, in the case of GPT2 small, the embeddings are 768 numbers. 00:05:20.580 |
So that means if we go to column 770, we will see the last end of this. 00:05:24.740 |
And so there is the end of our embeddings for Mike. 00:05:31.420 |
And each one of these, again, is the embedding for each token. 00:05:41.980 |
There's attention and then the neural network or multi-layer perceptron. 00:05:45.140 |
And in the intention phase, basically, the tokens look around at the other tokens next to them 00:05:52.300 |
So the token "he" might look at the word "Mike" to look at the antecedent for its pronoun. 00:05:58.140 |
Or "moves" might look at the word "quick" because "quick" actually has multiple meanings. 00:06:08.540 |
It can mean a body part, like the "quick of your fingernail." 00:06:11.040 |
And in Shakespearean English, it can mean "alive or dead," like "the quick are the dead." 00:06:15.620 |
And seeing that the word "moves" here helps it disambiguate for the next layer, the perceptron, 00:06:21.620 |
that, oh, we're talking about moving in physical space. 00:06:23.620 |
So maybe it's quickly, or maybe it's fast, or maybe it's around, 00:06:27.700 |
but it's certainly not something about your fingernail. 00:06:37.360 |
And then if you go up here-- we can't go through all of this in the time we have-- 00:06:43.760 |
This is where you can see where each token is paying attention to every other token. 00:06:48.180 |
And you'll notice that there's a bunch of zeros up at the top right. 00:06:50.940 |
And that's because no token is allowed to look forward. 00:06:56.940 |
And you'll see here that Mike is looking at Mike 100% of the time. 00:07:03.560 |
Here is the word "he," or the token "he," I should say. 00:07:06.940 |
So about half of its attention is focused on the antecedent of its pronoun. 00:07:13.940 |
If I scroll to the right, you'll see a lot more. 00:07:15.940 |
There aren't always as directly interpretable as that. 00:07:18.940 |
But it gives you a sense of how the attention mechanism works. 00:07:20.940 |
And then if we scroll further down, we'll see the multilayer perceptron right here. 00:07:25.940 |
If you know something about neural nets, you know there's just a large combination of multiplications 00:07:30.940 |
or a matrix multiply, and so I don't know if you can see this in the back. 00:07:34.940 |
There's a mmult, which is how you do an Excel matrix multiply. 00:07:38.940 |
And that's basically multiplying it times its weight. 00:07:40.940 |
And then here we put it through its activation function to get the next prediction. 00:07:50.940 |
And this is where we actually reverse the process. 00:07:53.940 |
So what we do is we take the last token and we unembed it and reverse the embedding process 00:08:02.940 |
And we probabilistically look at which are the tokens the closest to the final last tokens unembedding. 00:08:10.940 |
And we interpret that as a probability distribution. 00:08:12.940 |
Now, if you're at temperature zero, like we are in this spreadsheet, then you just take the thing with the highest probability. 00:08:18.940 |
But if your temperature is higher, then you sample it according to some algorithm like beam search. 00:08:29.940 |
So again, I don't know if you can see in the back, but this function here is basically-- there we go. 00:08:39.940 |
This function in the back basically is taking block 11, the output of the very last block. 00:08:43.940 |
It's putting it through a step called layer norm. 00:08:45.940 |
Then we multiply it, another mmult, times the unembedding matrix. 00:08:53.940 |
And then to predict the next most likely token, we just go to the next one. 00:08:59.940 |
And if you can see this function, it basically is looking at max of the previous column you saw in the previous sheet. 00:09:05.940 |
And it's taking the highest probability token just like that. 00:09:11.940 |
We get a token ID, then we look it up in the matrix, and we know what the next likely token is. 00:09:16.940 |
So that's the forward pass of how GPT-2 works. 00:09:20.940 |
But how do all those components work together? 00:09:22.940 |
So let's take our patient and put them through a virtual MRI so we can see how he thinks. 00:09:26.940 |
Before we do that, there's something I forgot to mention. 00:09:31.940 |
Inside every layer, there's an addition operation. 00:09:34.940 |
And what this lets the model do is it lets it route information around and completely skip any part of these layers, 00:09:43.940 |
And so you can reimagine the model as actually a communication network or a communication stream. 00:09:49.940 |
So the residual stream here is every one of those tokens. 00:09:52.940 |
And information is flowing through them like an information superhighway. 00:09:55.940 |
And what each layer is doing is we've got attention moving information across the lanes of this highway. 00:10:01.940 |
And then the perceptron trying to figure out what the likely token is for every single lane of the highway. 00:10:08.940 |
So they're really reading and writing to each other information in this communication bus. 00:10:13.940 |
What we can do is we can do a technique called Logit Lens. 00:10:16.940 |
We can take the language head we talked about earlier and stick it in between every single layer of the network. 00:10:30.940 |
If today is Tuesday, tomorrow is, and the predicted token is Wednesday. 00:10:33.940 |
And GPT-2 does this correctly for all seven days. 00:10:36.940 |
And what you see in this chart is essentially the columns here from three through nine are all those lanes of the information superhighway. 00:10:43.940 |
And, for example, here at block three, this is the top most predicted token at the last token position. 00:10:53.940 |
The second most likely word was going to be still. 00:10:58.940 |
So let's look for what we know is the right answer, Wednesday. 00:11:01.940 |
So over here at block zero, we see Wednesday. 00:11:04.940 |
It's at the bottom of the Tuesday stream for some reason on that highway. 00:11:07.940 |
Well, it makes sense it would be close to Tuesday. 00:11:11.940 |
And then, oh, over here towards the last few layers, suddenly we see tomorrow, forever, Tuesday, Friday. 00:11:20.940 |
And it gets Wednesday, but it's still the third most likely token. 00:11:23.940 |
And then, finally, it moves it up to the final position, and then it locks it into place. 00:11:28.940 |
Well, a series of researchers basically took this logit lens technique on steroids and isolated 00:11:36.940 |
that only four components out of the entire network were responsible for doing this correctly 00:11:41.940 |
What they found was that all you needed was the perceptron from layer zero, attention from 00:11:48.940 |
layer nine, and actually only one head, the perceptron from layer nine, and then attention 00:11:54.940 |
And that's kind of what we saw on the sheet, right? 00:11:56.940 |
At the top, we saw Wednesday, and then it disappeared until the later layers pulled it back up and 00:12:01.940 |
up in probability towards the end of the process. 00:12:04.940 |
So it's an example of where you can see each layer acting as a communication bus, trying to 00:12:09.940 |
jointly figure out and create what they call a circuit to accomplish a task. 00:12:15.940 |
We are now out of med school and ready for surgery. 00:12:18.940 |
So, you may have heard about the pioneering work that Anthropic has done about scaling 00:12:23.940 |
This gave rise to what was known as Golden Gate Claude. 00:12:26.940 |
It was a version of Claude that was very obsessed with the Golden Gate Bridge. 00:12:30.940 |
To some, it felt like it thought it was the Golden Gate Bridge. 00:12:34.940 |
Conceptually, here's how this process worked. 00:12:36.940 |
You have a large language model, and then you have this residual stream we talked about 00:12:42.940 |
And then you use another AI technique, an autoencoder. 00:12:46.940 |
And you ask it to look at the residual stream and separate it out into interpretable features. 00:12:51.940 |
And you then try and deduce what each feature is. 00:12:55.940 |
And then you can actually turn up and down each of these features back in the residual stream 00:12:59.940 |
in order to amplify or suppress certain concepts. 00:13:03.940 |
It turns out a team of researchers, led by Joseph Bloom, Neil Nanda, and others, are building 00:13:09.940 |
out sparse autoencoder features for open source models like GPT-2 small. 00:13:14.940 |
So, here, for example, is layer 2's feature 7650. 00:13:28.940 |
And I've taken the vector for that feature while we wait for Excel to wake up. 00:13:38.940 |
That first row is essentially what they call the decoder vector corresponding to Jedi. 00:13:43.940 |
And then I've basically multiplied by a coefficient. 00:13:46.940 |
And then I've basically formatted it so that I can inject it right into the residual stream. 00:13:54.940 |
It's basically just taking that vector I showed you and adding it into the residual stream. 00:14:01.940 |
And originally, normally, you ask GPT-2, Mike pulls out his. 00:14:07.940 |
But if we turn the Jedi steering vector on, I'll give you one guess what he's probably 00:14:17.940 |
And this is where you get to witness the 30 seconds it takes. 00:14:21.940 |
And while we wait for it to run, a couple notes. 00:14:24.940 |
So, first of all, the way Anthropic did their steering was slightly different, but similar 00:14:29.940 |
There's a few other ways to do this kind of steering. 00:14:31.940 |
One of those is called representation engineering, where the steering vector is deduced via PCA, 00:14:38.940 |
And there's another technique called activation steering, where what you do is you'd take the 00:14:43.940 |
thing you want to amplify, like Jedi, and you'd run the model through just on that token. 00:14:48.940 |
And then you'd run on something you might want to suppress, like in this case, phone. 00:14:51.940 |
And then you'd create a Jedi minus phone vector and inject that into the residual stream. 00:15:16.940 |
Well, hopefully I've given you a little better insight into how large language models work, 00:15:23.940 |
But the root message I want to leave with is that to be a better AI engineer, 00:15:30.940 |
Partly this is about just knowing your tools and their behavior and their limitations better. 00:15:35.940 |
And if you want to understand the latest research, it helps to know how these work. 00:15:38.940 |
And then last but not least, when you communicate with non-technical stakeholders, 00:15:45.940 |
And the more you can clear that up, the more you can clear up misunderstandings. 00:15:48.940 |
I'll give you just one example of where this bubbles up, where architecture bubbles up to 00:15:53.940 |
So this is the instructions for RWKV, which is a different type of model. 00:15:58.940 |
But the template for normal transformers at the top, the template for an RWKV prompt is 00:16:04.940 |
And what's interesting is that they recommend you swap the traditional order of instructions 00:16:09.940 |
and context because the attention mechanism or the pseudo-attention mechanism in RWKV can't 00:16:14.940 |
look back the same way a regular transformer can. 00:16:16.940 |
So it's a great example of where model architecture matters all the way up to prompting. 00:16:22.940 |
Here are the references for the research we talked about today. 00:16:25.940 |
And then if you want to learn more, you can go to spreadsheetsareallyouneed.ai. 00:16:30.940 |
And you can download this spreadsheet and you can run it on your own device. 00:16:34.940 |
If you want to see me go through every single step of this spreadsheet, I just launched a course 00:16:42.940 |
And the link to it is on that website as well.