back to index

Decoding the Decoder LLM without de code: Ishan Anand


Whisper Transcript | Transcript Only Page

00:00:00.000 | I hope you're all having a good conference.
00:00:15.680 | And I hope you're ready.
00:00:18.680 | Because if you came to this conference or the AI engineering
00:00:24.520 | field without a machine learning degree,
00:00:27.320 | then this is going to be your crash course
00:00:30.200 | in how machine learning models actually work under the hood.
00:00:35.400 | Let's bring up the slides.
00:00:40.920 | There we go.
00:00:41.840 | Thank you.
00:00:42.780 | OK, so I'm Ishan, and I'm dressed in scrubs.
00:00:46.180 | Because today, we're all going to be AI brain surgeons.
00:00:50.220 | And our patient will be none other than GPT-2,
00:00:54.820 | an early precursor to ChatGPT.
00:00:58.660 | And our operating table will be a table,
00:01:01.540 | but it will be a table of numbers.
00:01:03.920 | It will be an Excel spreadsheet.
00:01:06.520 | This Excel spreadsheet implements all of GPT-2 small entirely
00:01:12.540 | in pure Excel functions.
00:01:15.040 | No API calls.
00:01:16.720 | No Python.
00:01:18.380 | In theory, you can understand GPT-2 just by going tab by tab,
00:01:22.840 | function by function, through this spreadsheet.
00:01:25.440 | But you want to hold on to those VLOOKUPs,
00:01:27.520 | because there's over 150 tabs and over 124 million cells
00:01:32.480 | for every single one of the parameters in GPT-2 small.
00:01:36.060 | I will give you the abbreviated tour.
00:01:39.980 | So we'll do three things today in our little med school.
00:01:42.780 | First, we'll study the anatomy of our patient, how he's put together.
00:01:46.780 | Then we're going to put him through a virtual MRI to see how he thinks.
00:01:51.580 | And then finally, we're going to change his thinking with a little AI brain surgery.
00:01:57.460 | OK, let's start with anatomy.
00:01:59.620 | You're probably familiar with the concept
00:02:01.780 | that large language models are trained to complete sentences,
00:02:04.640 | to fill in the blank of phrases like this one.
00:02:07.480 | Mike is quick.
00:02:08.220 | He moves.
00:02:09.200 | And as a human, you might reasonably guess quickly.
00:02:12.280 | But how do we get a computer to do that?
00:02:14.500 | Well, here's a fill in the blank that computers are very good at.
00:02:17.640 | 2 plus 2 equals 4, right?
00:02:19.460 | They're really good at math.
00:02:20.460 | In fact, you can make it very complex, and they do it very well.
00:02:23.940 | So what we're going to do, in essence,
00:02:25.900 | is we're going to take a word problem
00:02:27.860 | and turn it into a math problem.
00:02:29.960 | In order to do that, we take our whole sentence or phrases,
00:02:33.520 | and we break them into subword units called tokens.
00:02:36.760 | And then we map each of those tokens onto numbers called embeddings.
00:02:40.100 | And I've shown it for simplicity here as a single number,
00:02:42.700 | but the embedding for each token is many, many, many numbers,
00:02:45.340 | as we'll see in a bit.
00:02:46.680 | And then instead of the simple arithmetic shown here,
00:02:50.080 | we're doing the much more complex math of multi-headed attention
00:02:53.620 | and the multilayer perceptron.
00:02:55.280 | Multilayer perceptron, just another name for a neural network.
00:02:58.760 | And then finally, instead of getting one precise exact answer
00:03:02.040 | like you used to get in elementary school,
00:03:03.780 | we're going to interpret the result as a probability distribution
00:03:07.180 | as to what the next token should be.
00:03:10.160 | So here's our setup.
00:03:12.500 | We get input text.
00:03:13.640 | We turn that text into tokens.
00:03:15.740 | We turn those tokens into numbers.
00:03:17.880 | We do some number crunching.
00:03:19.780 | And then we reverse the process.
00:03:21.100 | We turn the numbers back out into tokens or text.
00:03:23.580 | And then you get our next token prediction.
00:03:26.320 | So this handy chart shows where each of those actions maps to one
00:03:30.880 | or more tabs inside our friendly patient spreadsheet.
00:03:34.320 | Let's take a look.
00:03:35.560 | So the first thing you do is we get our prompt, right?
00:03:38.220 | Here the prompt is, Mike is quick.
00:03:40.300 | He moves.
00:03:40.960 | And then it will output, after about 30 seconds,
00:03:43.560 | since we're running in a spreadsheet--
00:03:44.740 | don't use this in production--
00:03:46.240 | the next predicted token of quickly.
00:03:48.960 | So the first step is to split this into tokens.
00:03:52.520 | Now you see that every word here goes into a single token.
00:03:55.780 | But that's not always the case.
00:03:57.240 | In fact, it's not uncommon to be two or more tokens.
00:04:00.120 | Let me give you some examples.
00:04:02.140 | So here's another version of the sheet.
00:04:03.700 | Let me zoom this up so you can see it a little better.
00:04:06.940 | I've put actually some fake words.
00:04:08.380 | Re-injury is a real word, but funology isn't a real word.
00:04:11.820 | But you know what it means, right?
00:04:13.120 | Because it's the word fun with ology put together.
00:04:15.380 | Those are the morphemes, as linguists like to call them.
00:04:18.440 | And the tokenization algorithm actually
00:04:20.640 | is able to recognize that in some cases.
00:04:23.200 | Whoa, there we go.
00:04:24.800 | Right there.
00:04:25.640 | You see fun split into a fun and ology.
00:04:32.020 | If we zoom that one up.
00:04:34.420 | There we go.
00:04:35.540 | But it doesn't always work.
00:04:36.800 | So notice how re-injury got split up right here.
00:04:39.580 | It's rain injury.
00:04:41.240 | And that's because the algorithm is a little dumb.
00:04:43.580 | It just picks the most common subword units
00:04:45.500 | it finds in its iterations.
00:04:47.280 | And it doesn't always map to your native intuition.
00:04:49.920 | And so in practice, machine learning experts
00:04:52.220 | feel like it's a necessary evil.
00:04:55.580 | And then the next step is we have to map each of these tokens
00:04:58.600 | to the embeddings.
00:04:59.900 | So let's go back to the original one.
00:05:01.900 | And that's in this tab here.
00:05:03.520 | So we have each of our tokens in a separate row.
00:05:06.680 | And then right here, starting in column three,
00:05:08.840 | is where our embeddings begin.
00:05:09.980 | So this is the row right here.
00:05:11.120 | The second row is all the embeddings for Mike.
00:05:13.960 | Now, in the case of GPT2 small, the embeddings are 768 numbers.
00:05:18.960 | So we're starting in column three.
00:05:20.580 | So that means if we go to column 770, we will see the last end of this.
00:05:24.740 | And so there is the end of our embeddings for Mike.
00:05:31.420 | And each one of these, again, is the embedding for each token.
00:05:36.840 | Then we get to the layers.
00:05:38.200 | This is the heart of the number crunching.
00:05:40.360 | So there are two key components.
00:05:41.980 | There's attention and then the neural network or multi-layer perceptron.
00:05:45.140 | And in the intention phase, basically, the tokens look around at the other tokens next to them
00:05:49.920 | to figure out the context in which they sit.
00:05:52.300 | So the token "he" might look at the word "Mike" to look at the antecedent for its pronoun.
00:05:58.140 | Or "moves" might look at the word "quick" because "quick" actually has multiple meanings.
00:06:03.400 | "Quick" can mean movement in physical space.
00:06:05.960 | It can mean "smart" as in "quick of wit."
00:06:08.540 | It can mean a body part, like the "quick of your fingernail."
00:06:11.040 | And in Shakespearean English, it can mean "alive or dead," like "the quick are the dead."
00:06:15.620 | And seeing that the word "moves" here helps it disambiguate for the next layer, the perceptron,
00:06:21.620 | that, oh, we're talking about moving in physical space.
00:06:23.620 | So maybe it's quickly, or maybe it's fast, or maybe it's around,
00:06:27.700 | but it's certainly not something about your fingernail.
00:06:29.920 | So let's see where this is all happening.
00:06:32.080 | So these are layers.
00:06:33.080 | Now, there's 12 of them.
00:06:34.080 | So this is block 0 all the way to block 11.
00:06:36.120 | Each one's a tab.
00:06:37.360 | And then if you go up here-- we can't go through all of this in the time we have--
00:06:40.700 | but this is one of the attention heads.
00:06:42.480 | This is step 7.
00:06:43.760 | This is where you can see where each token is paying attention to every other token.
00:06:48.180 | And you'll notice that there's a bunch of zeros up at the top right.
00:06:50.940 | And that's because no token is allowed to look forward.
00:06:53.940 | They can only look backwards in time.
00:06:56.940 | And you'll see here that Mike is looking at Mike 100% of the time.
00:06:59.660 | Higher values mean more attention.
00:07:01.440 | These are all normalized to one.
00:07:03.560 | Here is the word "he," or the token "he," I should say.
00:07:05.940 | And you'll notice 0.48.
00:07:06.940 | So about half of its attention is focused on the antecedent of its pronoun.
00:07:11.940 | Now, this is just one of many heads.
00:07:13.940 | If I scroll to the right, you'll see a lot more.
00:07:15.940 | There aren't always as directly interpretable as that.
00:07:18.940 | But it gives you a sense of how the attention mechanism works.
00:07:20.940 | And then if we scroll further down, we'll see the multilayer perceptron right here.
00:07:25.940 | If you know something about neural nets, you know there's just a large combination of multiplications
00:07:30.940 | or a matrix multiply, and so I don't know if you can see this in the back.
00:07:34.940 | There's a mmult, which is how you do an Excel matrix multiply.
00:07:38.940 | And that's basically multiplying it times its weight.
00:07:40.940 | And then here we put it through its activation function to get the next prediction.
00:07:45.940 | Let's keep going.
00:07:48.940 | Next, we have the language head.
00:07:50.940 | And this is where we actually reverse the process.
00:07:53.940 | So what we do is we take the last token and we unembed it and reverse the embedding process
00:08:01.940 | we did before.
00:08:02.940 | And we probabilistically look at which are the tokens the closest to the final last tokens unembedding.
00:08:10.940 | And we interpret that as a probability distribution.
00:08:12.940 | Now, if you're at temperature zero, like we are in this spreadsheet, then you just take the thing with the highest probability.
00:08:18.940 | But if your temperature is higher, then you sample it according to some algorithm like beam search.
00:08:24.940 | Let's take a look.
00:08:26.940 | And we'll go here.
00:08:29.940 | So again, I don't know if you can see in the back, but this function here is basically-- there we go.
00:08:39.940 | This function in the back basically is taking block 11, the output of the very last block.
00:08:43.940 | It's putting it through a step called layer norm.
00:08:45.940 | Then we multiply it, another mmult, times the unembedding matrix.
00:08:51.940 | And these are what are known as our logits.
00:08:53.940 | And then to predict the next most likely token, we just go to the next one.
00:08:59.940 | And if you can see this function, it basically is looking at max of the previous column you saw in the previous sheet.
00:09:05.940 | And it's taking the highest probability token just like that.
00:09:09.940 | And that's our predicted token.
00:09:11.940 | We get a token ID, then we look it up in the matrix, and we know what the next likely token is.
00:09:16.940 | So that's the forward pass of how GPT-2 works.
00:09:20.940 | But how do all those components work together?
00:09:22.940 | So let's take our patient and put them through a virtual MRI so we can see how he thinks.
00:09:26.940 | Before we do that, there's something I forgot to mention.
00:09:29.940 | These are called residual connections.
00:09:31.940 | Inside every layer, there's an addition operation.
00:09:34.940 | And what this lets the model do is it lets it route information around and completely skip any part of these layers,
00:09:41.940 | either attention or the perceptron.
00:09:43.940 | And so you can reimagine the model as actually a communication network or a communication stream.
00:09:49.940 | So the residual stream here is every one of those tokens.
00:09:52.940 | And information is flowing through them like an information superhighway.
00:09:55.940 | And what each layer is doing is we've got attention moving information across the lanes of this highway.
00:10:01.940 | And then the perceptron trying to figure out what the likely token is for every single lane of the highway.
00:10:07.940 | But there are multiple of these layers.
00:10:08.940 | So they're really reading and writing to each other information in this communication bus.
00:10:13.940 | What we can do is we can do a technique called Logit Lens.
00:10:16.940 | We can take the language head we talked about earlier and stick it in between every single layer of the network.
00:10:21.940 | And what was it thinking at that layer?
00:10:23.940 | So that's what I've done in this sheet.
00:10:28.940 | So I gave it the prompt.
00:10:30.940 | If today is Tuesday, tomorrow is, and the predicted token is Wednesday.
00:10:33.940 | And GPT-2 does this correctly for all seven days.
00:10:36.940 | And what you see in this chart is essentially the columns here from three through nine are all those lanes of the information superhighway.
00:10:43.940 | And, for example, here at block three, this is the top most predicted token at the last token position.
00:10:52.940 | So it predicted not.
00:10:53.940 | The second most likely word was going to be still.
00:10:56.940 | Then it was going to be just.
00:10:57.940 | These are all wrong.
00:10:58.940 | So let's look for what we know is the right answer, Wednesday.
00:11:01.940 | So over here at block zero, we see Wednesday.
00:11:04.940 | It's at the bottom of the Tuesday stream for some reason on that highway.
00:11:07.940 | Well, it makes sense it would be close to Tuesday.
00:11:09.940 | And then it completely disappears.
00:11:11.940 | And then, oh, over here towards the last few layers, suddenly we see tomorrow, forever, Tuesday, Friday.
00:11:18.940 | It knows we're talking about time.
00:11:19.940 | We're talking about days.
00:11:20.940 | And it gets Wednesday, but it's still the third most likely token.
00:11:23.940 | And then, finally, it moves it up to the final position, and then it locks it into place.
00:11:27.940 | So what's going on here?
00:11:28.940 | Well, a series of researchers basically took this logit lens technique on steroids and isolated
00:11:36.940 | that only four components out of the entire network were responsible for doing this correctly
00:11:40.940 | over all seven days.
00:11:41.940 | What they found was that all you needed was the perceptron from layer zero, attention from
00:11:48.940 | layer nine, and actually only one head, the perceptron from layer nine, and then attention
00:11:53.940 | from layer 10.
00:11:54.940 | And that's kind of what we saw on the sheet, right?
00:11:56.940 | At the top, we saw Wednesday, and then it disappeared until the later layers pulled it back up and
00:12:01.940 | up in probability towards the end of the process.
00:12:04.940 | So it's an example of where you can see each layer acting as a communication bus, trying to
00:12:09.940 | jointly figure out and create what they call a circuit to accomplish a task.
00:12:14.940 | Okay.
00:12:15.940 | We are now out of med school and ready for surgery.
00:12:18.940 | So, you may have heard about the pioneering work that Anthropic has done about scaling
00:12:22.940 | monosemanticity.
00:12:23.940 | This gave rise to what was known as Golden Gate Claude.
00:12:26.940 | It was a version of Claude that was very obsessed with the Golden Gate Bridge.
00:12:30.940 | To some, it felt like it thought it was the Golden Gate Bridge.
00:12:34.940 | Conceptually, here's how this process worked.
00:12:36.940 | You have a large language model, and then you have this residual stream we talked about
00:12:41.940 | earlier.
00:12:42.940 | And then you use another AI technique, an autoencoder.
00:12:44.940 | This one's a sparse autoencoder.
00:12:46.940 | And you ask it to look at the residual stream and separate it out into interpretable features.
00:12:51.940 | And you then try and deduce what each feature is.
00:12:55.940 | And then you can actually turn up and down each of these features back in the residual stream
00:12:59.940 | in order to amplify or suppress certain concepts.
00:13:03.940 | It turns out a team of researchers, led by Joseph Bloom, Neil Nanda, and others, are building
00:13:09.940 | out sparse autoencoder features for open source models like GPT-2 small.
00:13:14.940 | So, here, for example, is layer 2's feature 7650.
00:13:19.940 | I don't know if you can see it in the back.
00:13:21.940 | It's basically everything Jedi.
00:13:24.940 | So, gone to our friendly patient again.
00:13:28.940 | And I've taken the vector for that feature while we wait for Excel to wake up.
00:13:37.940 | There it is.
00:13:38.940 | That first row is essentially what they call the decoder vector corresponding to Jedi.
00:13:43.940 | And then I've basically multiplied by a coefficient.
00:13:46.940 | And then I've basically formatted it so that I can inject it right into the residual stream.
00:13:50.940 | This is the start of the block.
00:13:51.940 | You can see that steer block 2.
00:13:54.940 | It's basically just taking that vector I showed you and adding it into the residual stream.
00:13:58.940 | Simple addition.
00:13:59.940 | Now we go to our prompt.
00:14:01.940 | And originally, normally, you ask GPT-2, Mike pulls out his.
00:14:05.940 | Makes sense.
00:14:06.940 | He pulls out his phone.
00:14:07.940 | But if we turn the Jedi steering vector on, I'll give you one guess what he's probably
00:14:12.940 | going to pull out.
00:14:13.940 | Let's see.
00:14:14.940 | Okay.
00:14:15.940 | So, now we hit calculate now.
00:14:17.940 | And this is where you get to witness the 30 seconds it takes.
00:14:21.940 | And while we wait for it to run, a couple notes.
00:14:24.940 | So, first of all, the way Anthropic did their steering was slightly different, but similar
00:14:28.940 | in spirit.
00:14:29.940 | There's a few other ways to do this kind of steering.
00:14:31.940 | One of those is called representation engineering, where the steering vector is deduced via PCA,
00:14:37.940 | or principal component analysis.
00:14:38.940 | And there's another technique called activation steering, where what you do is you'd take the
00:14:43.940 | thing you want to amplify, like Jedi, and you'd run the model through just on that token.
00:14:48.940 | And then you'd run on something you might want to suppress, like in this case, phone.
00:14:51.940 | And then you'd create a Jedi minus phone vector and inject that into the residual stream.
00:14:56.940 | Okay.
00:14:57.940 | There it is.
00:14:58.940 | There it is.
00:14:59.940 | Mike pulls out his lightsaber.
00:15:01.940 | There we go.
00:15:02.940 | We have done it.
00:15:04.940 | Our operation has been a success.
00:15:09.940 | We've created the world's first GPT-2 Jedi.
00:15:13.940 | Stick that on LMSS Arena.
00:15:15.940 | Okay.
00:15:16.940 | Well, hopefully I've given you a little better insight into how large language models work,
00:15:21.940 | but also why they work.
00:15:23.940 | But the root message I want to leave with is that to be a better AI engineer,
00:15:27.940 | it does help to unlock the black box.
00:15:30.940 | Partly this is about just knowing your tools and their behavior and their limitations better.
00:15:33.940 | But also, we're in a very fast-moving field.
00:15:35.940 | And if you want to understand the latest research, it helps to know how these work.
00:15:38.940 | And then last but not least, when you communicate with non-technical stakeholders,
00:15:43.940 | there's very often a perception of magic.
00:15:45.940 | And the more you can clear that up, the more you can clear up misunderstandings.
00:15:48.940 | I'll give you just one example of where this bubbles up, where architecture bubbles up to
00:15:52.940 | how you use them.
00:15:53.940 | So this is the instructions for RWKV, which is a different type of model.
00:15:58.940 | But the template for normal transformers at the top, the template for an RWKV prompt is
00:16:03.940 | at the bottom.
00:16:04.940 | And what's interesting is that they recommend you swap the traditional order of instructions
00:16:09.940 | and context because the attention mechanism or the pseudo-attention mechanism in RWKV can't
00:16:14.940 | look back the same way a regular transformer can.
00:16:16.940 | So it's a great example of where model architecture matters all the way up to prompting.
00:16:22.940 | Here are the references for the research we talked about today.
00:16:25.940 | And then if you want to learn more, you can go to spreadsheetsareallyouneed.ai.
00:16:30.940 | And you can download this spreadsheet and you can run it on your own device.
00:16:34.940 | If you want to see me go through every single step of this spreadsheet, I just launched a course
00:16:41.940 | on Maven today.
00:16:42.940 | And the link to it is on that website as well.
00:16:45.940 | And that's it.
00:16:46.940 | Thank you.
00:16:46.940 | Thank you.
00:16:46.940 | And that's it.
00:16:47.940 | Thank you.
00:16:47.940 | Thank you.
00:16:48.940 | Thank you.
00:16:48.940 | Thank you.
00:16:48.940 | Thank you.
00:16:48.940 | Thank you.
00:16:49.940 | Thank you.
00:16:50.940 | Thank you.
00:16:51.940 | Thank you.
00:16:52.940 | Thank you.
00:16:52.940 | Thank you.
00:16:53.940 | Thank you.
00:16:54.940 | Thank you.
00:16:55.940 | Thank you.
00:16:56.940 | Thank you.
00:16:57.940 | Thank you.
00:16:58.940 | Thank you.
00:16:59.940 | Thank you.
00:17:00.940 | Thank you.
00:17:01.940 | Thank you.
00:17:02.940 | We'll see you next time.