Decoding the Decoder LLM without de code: Ishan Anand

00:00:00.000 | I hope you're all having a good conference.

00:00:15.680 | And I hope you're ready.

00:00:18.680 | Because if you came to this conference or the AI engineering

00:00:24.520 | field without a machine learning degree,

00:00:27.320 | then this is going to be your crash course

00:00:30.200 | in how machine learning models actually work under the hood.

00:00:35.400 | Let's bring up the slides.

00:00:40.920 | There we go.

00:00:41.840 | Thank you.

00:00:42.780 | OK, so I'm Ishan, and I'm dressed in scrubs.

00:00:46.180 | Because today, we're all going to be AI brain surgeons.

00:00:50.220 | And our patient will be none other than GPT-2,

00:00:54.820 | an early precursor to ChatGPT.

00:00:58.660 | And our operating table will be a table,

00:01:01.540 | but it will be a table of numbers.

00:01:03.920 | It will be an Excel spreadsheet.

00:01:06.520 | This Excel spreadsheet implements all of GPT-2 small entirely

00:01:12.540 | in pure Excel functions.

00:01:15.040 | No API calls.

00:01:16.720 | No Python.

00:01:18.380 | In theory, you can understand GPT-2 just by going tab by tab,

00:01:22.840 | function by function, through this spreadsheet.

00:01:25.440 | But you want to hold on to those VLOOKUPs,

00:01:27.520 | because there's over 150 tabs and over 124 million cells

00:01:32.480 | for every single one of the parameters in GPT-2 small.

00:01:36.060 | I will give you the abbreviated tour.

00:01:39.980 | So we'll do three things today in our little med school.

00:01:42.780 | First, we'll study the anatomy of our patient, how he's put together.

00:01:46.780 | Then we're going to put him through a virtual MRI to see how he thinks.

00:01:51.580 | And then finally, we're going to change his thinking with a little AI brain surgery.

00:01:57.460 | OK, let's start with anatomy.

00:01:59.620 | You're probably familiar with the concept

00:02:01.780 | that large language models are trained to complete sentences,

00:02:04.640 | to fill in the blank of phrases like this one.

00:02:07.480 | Mike is quick.

00:02:08.220 | He moves.

00:02:09.200 | And as a human, you might reasonably guess quickly.

00:02:12.280 | But how do we get a computer to do that?

00:02:14.500 | Well, here's a fill in the blank that computers are very good at.

00:02:17.640 | 2 plus 2 equals 4, right?

00:02:19.460 | They're really good at math.

00:02:20.460 | In fact, you can make it very complex, and they do it very well.

00:02:23.940 | So what we're going to do, in essence,

00:02:25.900 | is we're going to take a word problem

00:02:27.860 | and turn it into a math problem.

00:02:29.960 | In order to do that, we take our whole sentence or phrases,

00:02:33.520 | and we break them into subword units called tokens.

00:02:36.760 | And then we map each of those tokens onto numbers called embeddings.

00:02:40.100 | And I've shown it for simplicity here as a single number,

00:02:42.700 | but the embedding for each token is many, many, many numbers,

00:02:45.340 | as we'll see in a bit.

00:02:46.680 | And then instead of the simple arithmetic shown here,

00:02:50.080 | we're doing the much more complex math of multi-headed attention

00:02:53.620 | and the multilayer perceptron.

00:02:55.280 | Multilayer perceptron, just another name for a neural network.

00:02:58.760 | And then finally, instead of getting one precise exact answer

00:03:02.040 | like you used to get in elementary school,

00:03:03.780 | we're going to interpret the result as a probability distribution

00:03:07.180 | as to what the next token should be.

00:03:10.160 | So here's our setup.

00:03:12.500 | We get input text.

00:03:13.640 | We turn that text into tokens.

00:03:15.740 | We turn those tokens into numbers.

00:03:17.880 | We do some number crunching.

00:03:19.780 | And then we reverse the process.

00:03:21.100 | We turn the numbers back out into tokens or text.

00:03:23.580 | And then you get our next token prediction.

00:03:26.320 | So this handy chart shows where each of those actions maps to one

00:03:30.880 | or more tabs inside our friendly patient spreadsheet.

00:03:34.320 | Let's take a look.

00:03:35.560 | So the first thing you do is we get our prompt, right?

00:03:38.220 | Here the prompt is, Mike is quick.

00:03:40.300 | He moves.

00:03:40.960 | And then it will output, after about 30 seconds,

00:03:43.560 | since we're running in a spreadsheet--

00:03:44.740 | don't use this in production--

00:03:46.240 | the next predicted token of quickly.

00:03:48.960 | So the first step is to split this into tokens.

00:03:52.520 | Now you see that every word here goes into a single token.

00:03:55.780 | But that's not always the case.

00:03:57.240 | In fact, it's not uncommon to be two or more tokens.

00:04:00.120 | Let me give you some examples.

00:04:02.140 | So here's another version of the sheet.

00:04:03.700 | Let me zoom this up so you can see it a little better.

00:04:06.940 | I've put actually some fake words.

00:04:08.380 | Re-injury is a real word, but funology isn't a real word.

00:04:11.820 | But you know what it means, right?

00:04:13.120 | Because it's the word fun with ology put together.

00:04:15.380 | Those are the morphemes, as linguists like to call them.

00:04:18.440 | And the tokenization algorithm actually

00:04:20.640 | is able to recognize that in some cases.

00:04:23.200 | Whoa, there we go.

00:04:24.800 | Right there.

00:04:25.640 | You see fun split into a fun and ology.

00:04:32.020 | If we zoom that one up.

00:04:34.420 | There we go.

00:04:35.540 | But it doesn't always work.

00:04:36.800 | So notice how re-injury got split up right here.

00:04:39.580 | It's rain injury.

00:04:41.240 | And that's because the algorithm is a little dumb.

00:04:43.580 | It just picks the most common subword units

00:04:45.500 | it finds in its iterations.

00:04:47.280 | And it doesn't always map to your native intuition.

00:04:49.920 | And so in practice, machine learning experts

00:04:52.220 | feel like it's a necessary evil.

00:04:55.580 | And then the next step is we have to map each of these tokens

00:04:58.600 | to the embeddings.

00:04:59.900 | So let's go back to the original one.

00:05:01.900 | And that's in this tab here.

00:05:03.520 | So we have each of our tokens in a separate row.

00:05:06.680 | And then right here, starting in column three,

00:05:08.840 | is where our embeddings begin.

00:05:09.980 | So this is the row right here.

00:05:11.120 | The second row is all the embeddings for Mike.

00:05:13.960 | Now, in the case of GPT2 small, the embeddings are 768 numbers.

00:05:18.960 | So we're starting in column three.

00:05:20.580 | So that means if we go to column 770, we will see the last end of this.

00:05:24.740 | And so there is the end of our embeddings for Mike.

00:05:31.420 | And each one of these, again, is the embedding for each token.

00:05:35.460 | OK.

00:05:36.840 | Then we get to the layers.

00:05:38.200 | This is the heart of the number crunching.

00:05:40.360 | So there are two key components.

00:05:41.980 | There's attention and then the neural network or multi-layer perceptron.

00:05:45.140 | And in the intention phase, basically, the tokens look around at the other tokens next to them

00:05:49.920 | to figure out the context in which they sit.

00:05:52.300 | So the token "he" might look at the word "Mike" to look at the antecedent for its pronoun.

00:05:58.140 | Or "moves" might look at the word "quick" because "quick" actually has multiple meanings.

00:06:03.400 | "Quick" can mean movement in physical space.

00:06:05.960 | It can mean "smart" as in "quick of wit."

00:06:08.540 | It can mean a body part, like the "quick of your fingernail."

00:06:11.040 | And in Shakespearean English, it can mean "alive or dead," like "the quick are the dead."

00:06:15.620 | And seeing that the word "moves" here helps it disambiguate for the next layer, the perceptron,

00:06:21.620 | that, oh, we're talking about moving in physical space.

00:06:23.620 | So maybe it's quickly, or maybe it's fast, or maybe it's around,

00:06:27.700 | but it's certainly not something about your fingernail.

00:06:29.920 | So let's see where this is all happening.

00:06:32.080 | So these are layers.

00:06:33.080 | Now, there's 12 of them.

00:06:34.080 | So this is block 0 all the way to block 11.

00:06:36.120 | Each one's a tab.

00:06:37.360 | And then if you go up here-- we can't go through all of this in the time we have--

00:06:40.700 | but this is one of the attention heads.

00:06:42.480 | This is step 7.

00:06:43.760 | This is where you can see where each token is paying attention to every other token.

00:06:48.180 | And you'll notice that there's a bunch of zeros up at the top right.

00:06:50.940 | And that's because no token is allowed to look forward.

00:06:53.940 | They can only look backwards in time.

00:06:56.940 | And you'll see here that Mike is looking at Mike 100% of the time.

00:06:59.660 | Higher values mean more attention.

00:07:01.440 | These are all normalized to one.

00:07:03.560 | Here is the word "he," or the token "he," I should say.

00:07:05.940 | And you'll notice 0.48.

00:07:06.940 | So about half of its attention is focused on the antecedent of its pronoun.

00:07:11.940 | Now, this is just one of many heads.

00:07:13.940 | If I scroll to the right, you'll see a lot more.

00:07:15.940 | There aren't always as directly interpretable as that.

00:07:18.940 | But it gives you a sense of how the attention mechanism works.

00:07:20.940 | And then if we scroll further down, we'll see the multilayer perceptron right here.

00:07:25.940 | If you know something about neural nets, you know there's just a large combination of multiplications

00:07:30.940 | or a matrix multiply, and so I don't know if you can see this in the back.

00:07:34.940 | There's a mmult, which is how you do an Excel matrix multiply.

00:07:38.940 | And that's basically multiplying it times its weight.

00:07:40.940 | And then here we put it through its activation function to get the next prediction.

00:07:44.940 | OK.

00:07:45.940 | Let's keep going.

00:07:47.940 | OK.

00:07:48.940 | Next, we have the language head.

00:07:50.940 | And this is where we actually reverse the process.

00:07:53.940 | So what we do is we take the last token and we unembed it and reverse the embedding process

00:08:01.940 | we did before.

00:08:02.940 | And we probabilistically look at which are the tokens the closest to the final last tokens unembedding.

00:08:10.940 | And we interpret that as a probability distribution.

00:08:12.940 | Now, if you're at temperature zero, like we are in this spreadsheet, then you just take the thing with the highest probability.

00:08:18.940 | But if your temperature is higher, then you sample it according to some algorithm like beam search.

00:08:24.940 | Let's take a look.

00:08:26.940 | And we'll go here.

00:08:29.940 | So again, I don't know if you can see in the back, but this function here is basically-- there we go.

00:08:39.940 | This function in the back basically is taking block 11, the output of the very last block.

00:08:43.940 | It's putting it through a step called layer norm.

00:08:45.940 | Then we multiply it, another mmult, times the unembedding matrix.

00:08:51.940 | And these are what are known as our logits.

00:08:53.940 | And then to predict the next most likely token, we just go to the next one.

00:08:59.940 | And if you can see this function, it basically is looking at max of the previous column you saw in the previous sheet.

00:09:05.940 | And it's taking the highest probability token just like that.

00:09:09.940 | And that's our predicted token.

00:09:11.940 | We get a token ID, then we look it up in the matrix, and we know what the next likely token is.

00:09:15.940 | OK.

00:09:16.940 | So that's the forward pass of how GPT-2 works.

00:09:20.940 | But how do all those components work together?

00:09:22.940 | So let's take our patient and put them through a virtual MRI so we can see how he thinks.

00:09:26.940 | Before we do that, there's something I forgot to mention.

00:09:29.940 | These are called residual connections.

00:09:31.940 | Inside every layer, there's an addition operation.

00:09:34.940 | And what this lets the model do is it lets it route information around and completely skip any part of these layers,

00:09:41.940 | either attention or the perceptron.

00:09:43.940 | And so you can reimagine the model as actually a communication network or a communication stream.

00:09:49.940 | So the residual stream here is every one of those tokens.

00:09:52.940 | And information is flowing through them like an information superhighway.

00:09:55.940 | And what each layer is doing is we've got attention moving information across the lanes of this highway.

00:10:01.940 | And then the perceptron trying to figure out what the likely token is for every single lane of the highway.

00:10:07.940 | But there are multiple of these layers.

00:10:08.940 | So they're really reading and writing to each other information in this communication bus.

00:10:13.940 | What we can do is we can do a technique called Logit Lens.

00:10:16.940 | We can take the language head we talked about earlier and stick it in between every single layer of the network.

00:10:21.940 | And what was it thinking at that layer?

00:10:23.940 | So that's what I've done in this sheet.

00:10:28.940 | So I gave it the prompt.

00:10:30.940 | If today is Tuesday, tomorrow is, and the predicted token is Wednesday.

00:10:33.940 | And GPT-2 does this correctly for all seven days.

00:10:36.940 | And what you see in this chart is essentially the columns here from three through nine are all those lanes of the information superhighway.

00:10:43.940 | And, for example, here at block three, this is the top most predicted token at the last token position.

00:10:52.940 | So it predicted not.

00:10:53.940 | The second most likely word was going to be still.

00:10:56.940 | Then it was going to be just.

00:10:57.940 | These are all wrong.

00:10:58.940 | So let's look for what we know is the right answer, Wednesday.

00:11:01.940 | So over here at block zero, we see Wednesday.

00:11:04.940 | It's at the bottom of the Tuesday stream for some reason on that highway.

00:11:07.940 | Well, it makes sense it would be close to Tuesday.

00:11:09.940 | And then it completely disappears.

00:11:11.940 | And then, oh, over here towards the last few layers, suddenly we see tomorrow, forever, Tuesday, Friday.

00:11:18.940 | It knows we're talking about time.

00:11:19.940 | We're talking about days.

00:11:20.940 | And it gets Wednesday, but it's still the third most likely token.

00:11:23.940 | And then, finally, it moves it up to the final position, and then it locks it into place.

00:11:27.940 | So what's going on here?

00:11:28.940 | Well, a series of researchers basically took this logit lens technique on steroids and isolated

00:11:36.940 | that only four components out of the entire network were responsible for doing this correctly

00:11:40.940 | over all seven days.

00:11:41.940 | What they found was that all you needed was the perceptron from layer zero, attention from

00:11:48.940 | layer nine, and actually only one head, the perceptron from layer nine, and then attention

00:11:53.940 | from layer 10.

00:11:54.940 | And that's kind of what we saw on the sheet, right?

00:11:56.940 | At the top, we saw Wednesday, and then it disappeared until the later layers pulled it back up and

00:12:01.940 | up in probability towards the end of the process.

00:12:04.940 | So it's an example of where you can see each layer acting as a communication bus, trying to

00:12:09.940 | jointly figure out and create what they call a circuit to accomplish a task.

00:12:14.940 | Okay.

00:12:15.940 | We are now out of med school and ready for surgery.

00:12:18.940 | So, you may have heard about the pioneering work that Anthropic has done about scaling

00:12:22.940 | monosemanticity.

00:12:23.940 | This gave rise to what was known as Golden Gate Claude.

00:12:26.940 | It was a version of Claude that was very obsessed with the Golden Gate Bridge.

00:12:30.940 | To some, it felt like it thought it was the Golden Gate Bridge.

00:12:34.940 | Conceptually, here's how this process worked.

00:12:36.940 | You have a large language model, and then you have this residual stream we talked about

00:12:41.940 | earlier.

00:12:42.940 | And then you use another AI technique, an autoencoder.

00:12:44.940 | This one's a sparse autoencoder.

00:12:46.940 | And you ask it to look at the residual stream and separate it out into interpretable features.

00:12:51.940 | And you then try and deduce what each feature is.

00:12:55.940 | And then you can actually turn up and down each of these features back in the residual stream

00:12:59.940 | in order to amplify or suppress certain concepts.

00:13:03.940 | It turns out a team of researchers, led by Joseph Bloom, Neil Nanda, and others, are building

00:13:09.940 | out sparse autoencoder features for open source models like GPT-2 small.

00:13:14.940 | So, here, for example, is layer 2's feature 7650.

00:13:19.940 | I don't know if you can see it in the back.

00:13:21.940 | It's basically everything Jedi.

00:13:24.940 | So, gone to our friendly patient again.

00:13:28.940 | And I've taken the vector for that feature while we wait for Excel to wake up.

00:13:37.940 | There it is.

00:13:38.940 | That first row is essentially what they call the decoder vector corresponding to Jedi.

00:13:43.940 | And then I've basically multiplied by a coefficient.

00:13:46.940 | And then I've basically formatted it so that I can inject it right into the residual stream.

00:13:50.940 | This is the start of the block.

00:13:51.940 | You can see that steer block 2.

00:13:54.940 | It's basically just taking that vector I showed you and adding it into the residual stream.

00:13:58.940 | Simple addition.

00:13:59.940 | Now we go to our prompt.

00:14:01.940 | And originally, normally, you ask GPT-2, Mike pulls out his.

00:14:05.940 | Makes sense.

00:14:06.940 | He pulls out his phone.

00:14:07.940 | But if we turn the Jedi steering vector on, I'll give you one guess what he's probably

00:14:12.940 | going to pull out.

00:14:13.940 | Let's see.

00:14:14.940 | Okay.

00:14:15.940 | So, now we hit calculate now.

00:14:17.940 | And this is where you get to witness the 30 seconds it takes.

00:14:21.940 | And while we wait for it to run, a couple notes.

00:14:24.940 | So, first of all, the way Anthropic did their steering was slightly different, but similar

00:14:28.940 | in spirit.

00:14:29.940 | There's a few other ways to do this kind of steering.

00:14:31.940 | One of those is called representation engineering, where the steering vector is deduced via PCA,

00:14:37.940 | or principal component analysis.

00:14:38.940 | And there's another technique called activation steering, where what you do is you'd take the

00:14:43.940 | thing you want to amplify, like Jedi, and you'd run the model through just on that token.

00:14:48.940 | And then you'd run on something you might want to suppress, like in this case, phone.

00:14:51.940 | And then you'd create a Jedi minus phone vector and inject that into the residual stream.

00:14:56.940 | Okay.

00:14:57.940 | There it is.

00:14:58.940 | There it is.

00:14:59.940 | Mike pulls out his lightsaber.

00:15:01.940 | There we go.

00:15:02.940 | We have done it.

00:15:04.940 | Our operation has been a success.

00:15:09.940 | We've created the world's first GPT-2 Jedi.

00:15:13.940 | Stick that on LMSS Arena.

00:15:15.940 | Okay.

00:15:16.940 | Well, hopefully I've given you a little better insight into how large language models work,

00:15:21.940 | but also why they work.

00:15:23.940 | But the root message I want to leave with is that to be a better AI engineer,

00:15:27.940 | it does help to unlock the black box.

00:15:30.940 | Partly this is about just knowing your tools and their behavior and their limitations better.

00:15:33.940 | But also, we're in a very fast-moving field.

00:15:35.940 | And if you want to understand the latest research, it helps to know how these work.

00:15:38.940 | And then last but not least, when you communicate with non-technical stakeholders,

00:15:43.940 | there's very often a perception of magic.

00:15:45.940 | And the more you can clear that up, the more you can clear up misunderstandings.

00:15:48.940 | I'll give you just one example of where this bubbles up, where architecture bubbles up to

00:15:52.940 | how you use them.

00:15:53.940 | So this is the instructions for RWKV, which is a different type of model.

00:15:58.940 | But the template for normal transformers at the top, the template for an RWKV prompt is

00:16:03.940 | at the bottom.

00:16:04.940 | And what's interesting is that they recommend you swap the traditional order of instructions

00:16:09.940 | and context because the attention mechanism or the pseudo-attention mechanism in RWKV can't

00:16:14.940 | look back the same way a regular transformer can.

00:16:16.940 | So it's a great example of where model architecture matters all the way up to prompting.

00:16:21.940 | OK.

00:16:22.940 | Here are the references for the research we talked about today.

00:16:25.940 | And then if you want to learn more, you can go to spreadsheetsareallyouneed.ai.

00:16:30.940 | And you can download this spreadsheet and you can run it on your own device.

00:16:34.940 | If you want to see me go through every single step of this spreadsheet, I just launched a course

00:16:41.940 | on Maven today.

00:16:42.940 | And the link to it is on that website as well.

00:16:45.940 | And that's it.

00:16:46.940 | Thank you.

00:16:46.940 | And that's it.

00:16:47.940 | Thank you.

00:16:48.940 | Thank you.

00:16:49.940 | Thank you.

00:16:50.940 | Thank you.

00:16:51.940 | Thank you.

00:16:52.940 | Thank you.

00:16:53.940 | Thank you.

00:16:54.940 | Thank you.

00:16:55.940 | Thank you.

00:16:56.940 | Thank you.

00:16:57.940 | Thank you.

00:16:58.940 | Thank you.

00:16:59.940 | Thank you.

00:17:00.940 | Thank you.

00:17:01.940 | Thank you.

00:17:02.940 | We'll see you next time.