back to indexLlama 1/2/3/4 by Hand: Prof Tom Yeh

00:00:21.600 |
I'm going to show you how we can build deep learning 00:00:23.800 |
architecture from Transformer to a lot of new stuff. 00:00:27.820 |
Or do you also want to take steps to take questions? 00:00:30.320 |
How interactive is this session supposed to be? 00:00:45.540 |
We don't really look that deep into benchmarks. 00:00:52.260 |
We'll try to be interactive as if I'm teaching. 00:01:05.260 |
So you have a lot of things you know better than I do. 00:01:08.400 |
So what I want to share is my live Excel spreadsheet. 00:01:12.140 |
So I've been using this a lot to teach deep learning architecture. 00:01:16.800 |
And so this is sort of my plan to take you from vision transformer to LAMA 1, 2, 3, 4. 00:01:23.480 |
And as I said, a few of these things are being introduced you might have heard of or read about often at the paper. 00:01:30.860 |
So we'd like to talk about ROPE, talk about IMS, NORM, talk about group query attention, LAMA 2, and flash attention, interleave attention for LAMA 4, mixture of expert. 00:01:43.040 |
A bit more like high-level, well, I wouldn't say high-level, it's an Excel-level overview of what is happening, some live coding. 00:01:51.720 |
And so I would like you to go to, I actually prepared a link for you to download, to help access to the spreadsheet very easily. 00:02:02.820 |
I should be a bit more prepared, but I don't know why. 00:02:07.240 |
So how, can you give me a, if some of you might have access to the document already, and maybe you can see a thumbs up so I know how many of you. 00:02:17.240 |
I just shared it in the Zoom chat, by the way. 00:02:20.240 |
I'm going to share it on the Zoom chat right now, and so you have a direct access, wait, where's my Zoom chat? 00:02:29.340 |
Okay, this, here is my Zoom chat, I should have another version, so that is like, they have to have the mailing list subscription gate. 00:02:39.000 |
So this is the non-gated version of the link, you can go there, and you will see, and you will see the top link is a live version of the same Excel sheet. 00:02:49.760 |
And the, but the, there's also a baseline version that I'm not going to touch today, so you can compare how things evolving. 00:02:56.860 |
And so that's what my plan to do, and quick overview of my transformer, vision transformer architecture, that is expressed in Excel. 00:03:05.960 |
If you will zoom out, you can see this whole stack, from the input stack, normalization, self-attention stack, to feed forward, and output layer, softmax linear layer, and loss gradient. 00:03:19.960 |
And we're going to do that, and want to explain this, and the reason I picked this is that the transformer stack part is pretty common across modalities. 00:03:29.060 |
So I picked vision transformer, so I picked vision transformer, so that just to get an idea, anything that you could, anything that it could convert your input into some tokens, you can put it into transformer stack. 00:03:41.160 |
I picked vision transformer stack, from this point, from this point, this is where a transformer encoder started, or a decoder. 00:03:46.260 |
In this case, it's the encoder, but for GBT, it's a decoder. 00:03:49.260 |
Anyway, so let's take the first challenge, so I hope that you could probably just get it right away, because it's not technical, you have seen transformer before, you might not have seen this format. 00:04:01.260 |
But you'll get it very quickly, so what I want to bring your attention, to the attention layer, here, that I'd like to show you, how I would start by, let's go back from my plan, my plan is to just to get, well, from transformer to llama, one, two, three, four, a lot of things scale up, just increase in dimensionality. 00:04:23.360 |
So we're going to see how it is like, we're going to focus on query and key, if we increase query and key, what happened, so if I increase query and key, so in this case, I want to read this a little bit, so we'll have input, this is a, if you go out a little bit, these are tokens, we have 10 tokens here, and so here, one, two, three, four are the embedding dimension, they're currently five, and so this will be input, and to start from attention stack, we'll multiply with all this weight 00:04:53.340 |
matrices, I get this, so this one, so this one, we'll then, so have query and key here, and then we get attention, it's dot product scale attention, and we solve max, that's how you can read it, and you can see per column, this is solve max, give it distribution, and then you multiply with your value, that's how attention, so the work, the way I visualize it, and so what happened is that if you add, if I want to add, so this is three, the key, 00:05:23.320 |
key dimension here, you can visualize three, and over one, over one, two, three, four, five, or five tokens, okay, so far, so good, and what happened, if I want to add one dimension, what I would do live, is I would just shift this up, down here, so I have, this is kind of broken now, so I can add some more weight, like this, okay, and now, it's become, it become, a problem here, is that if I like to get 00:05:53.300 |
my keys over here, maybe use my keyboard instead, I hope you are following me, fast enough, okay, so here, I have my query over here, because I'll have, she'll have four dimensions, let me remove this, push it down a little bit more, so I have four dimensions for my query, so I have four dimensions, now this scaled up dimension, scale product is broken, because my keys are four dimensions, so I do something 00:06:19.260 |
similar for my key, so I'm going to move down, so I'm going to move down, and then introduce a few more, initialize some random weights, maybe zero here, and then I will update my equation here, so now I just managed to add one dimension, so now, if you see this inductively, you can see how this is scale, 00:06:37.260 |
and the takeaway is that the key and query, they should have the same dimensions, four and four, but value doesn't have to be, but even though, typically, we keep the value the same dimension for convenience reason, but theoretically, this is the way you can visualize that the key and value increase in size, 00:06:56.260 |
and what happens if we don't add one more token, so let me see, I check this, so I check this, maybe use this highlighter to say I'm done with this, okay, and I add, when I add key and value dimensions, you realize nothing else change, 00:07:12.260 |
other than the computational complex plus flexibility per cell, now I have four things to dot-proc dot-width, okay, here, matching multiplication, 00:07:21.900 |
softmax, this is the raw interpretation of softmax, and we have another matching multiplication here, okay, all right, so, and what happens if you want to add value, 00:07:31.700 |
say you want to add to five, just to prove my point that they don't need to match my key and query dimension, 00:07:41.260 |
okay, and I have five of this, okay, and then, and then, so, then, I have five over here, now, I can ugly, let's just add some more space here, 00:07:51.760 |
and insert here, and now I have this five over here, now, all of a sudden, my attention-weighted values from this attention head is five dimension, 00:08:01.500 |
and I'm going to zoom out a little bit to see what's going to happen if we have, we count the attention head, now we have first attention, second attention, 00:08:10.500 |
and third attention head, and we have things that collect them, so, this is where concatenation happened, now, we have two extra dimensions for my values, 00:08:19.000 |
and now it's no longer fit, what I can do is that I'm going to increase my dimensionality of the attention-weighted picture from the first head, 00:08:28.700 |
and move up here, now I have, now you can see that it will all fit, okay, so I have five from first head, three from second head, and third from, 00:08:37.000 |
and three from third head, this is three head, multi head attention, and then, but this is kind of important, because my embedding dimension is five, 00:08:46.500 |
and then it has to be five consistently throughout your stack, so they can put it all together, but now, I have one, two, three, five, 00:08:55.000 |
one, two, three, five, six, seven, eight, nine, ten, eleven, twelve, twelve to five, that's what we need to, we need to project, so I have to introduce more weights, 00:09:06.000 |
so I want to introduce three, three more weights, so I just copy over here, and then, I update my modification, linear projection, 00:09:15.000 |
all the sudden, things got, things are working again, in, actually, I had too many, I should add two, instead of three, okay, 00:09:23.500 |
two, okay, so I'll kind of check my task, okay, all right, now it matches, it's working again, so I just added one, two dimension, 00:09:34.000 |
two value for the particular decision head, and I show you how everything else change, okay, anything else? 00:09:42.200 |
How focus cell makes green across along the ratio here, view, show, focus cell, here, and select your green color, 00:09:56.200 |
I found green is pretty good for my class, and to help you focus attention, anything, any question? 00:10:03.200 |
What do I do, okay, all right, where was I, okay, so I checked the value size, I just changed it, go back, 00:10:15.200 |
and vocabulary size, let's just talk about vocabulary size, so at the end, we want this model to output a word, 00:10:26.200 |
and our probability distribution across the word, so at the end, we still have, remember, 00:10:33.200 |
this is a final output from encoder, still five tokens, ten tokens, each token is five dimensions, is it? 00:10:40.200 |
So, but supposed to be a vocabulary of 20 words, in this case, for the original vision transformer, 00:10:48.200 |
there's 20 class problems, then we take the first head token, this is a class token, 00:10:54.200 |
so we need to project from five to 20, so now you can visualize this linear projection right here, 00:11:01.200 |
so this is what linear, in the last thing, taken to the last thing of your transformer stack, 00:11:06.200 |
and you do a soft max, let's just zoom in a little bit, so you could see these, these are the values, 00:11:12.200 |
though this other value could be arbitrary number, but we want it to become a number between zero and one for probability, 00:11:22.200 |
probability, but also, all these numbers have to add up to one for probability distribution, that's why we need soft max, 00:11:29.200 |
so this is linear and soft max, suppose we want to apply, so as you can see, for llama one, two, three, four, 00:11:37.200 |
it goes from, where is the vocabulary size, from 32k, 32k, 128k, so it's going to be multilingual models, even more, 250k, because it's multilingual, multimodal models, 00:11:49.200 |
they want to have more, they want to have more things they can predict, okay, so, and then, that progression is mostly reflected in the last layer, 00:11:58.200 |
so instead of 5 to 20, for instance, we want to go from 20 to 30, what do I have to do? 00:12:05.200 |
so, where do we have to grow this? so, maybe a little bit too much, let's just grow 5 by 5, so what I would do is that I would just 1, 2, 3, 4, 5, 00:12:13.200 |
select 5 rows, and insert, and then, I have to fill in this row, this is initialize to 0, maybe I just copy all my 0 over here, and these are all the biases, 00:12:26.200 |
minus to 0, 0, 0, 0, and now randomize this a little bit, but adding some random one here, so now I updated the score here, and then, so I also have to do that, update it by solve max layer, 00:12:40.200 |
add it in the file row in the middle, so I have, now these equations get updated, so I just increased the vocabulary size from 20 to 25, 00:12:49.200 |
so that is where you increase, but interestingly, as you can tell, the only thing I touch is the very last layer, I didn't have to touch any internal of the transformer stack, okay? 00:13:03.200 |
So, if you're looking for this sheet, so I have a, I could use this one, this is the link result, there's no newsletter subscription requirement, just go there and just grab stuff that you can follow this, and, alright, so let's go back to the number of vocabulary size, I looked at it now, and embedding dimension is a bit difficult to change, I'm not going to change, 00:13:30.200 |
I just say, if I have to change, I have to change here, this is a token, so this will be the, this is the, from image patch to token, about 9 to 5, so 9 is because of a 3x3 window here, and again, it got flattened into a column vector of 9, 00:13:52.200 |
but for the language application, but for the language application, you can think of it as from a word in the vocabulary to embedding, and so we have 5, embedding space of 5, if you want to add one more, I will have to add one more rule here, 00:14:07.200 |
but the problem is, is everything else has to be extended by 5, this all has to change to 5, there's a lot of things that have to change, so I'm not going to do that, so, well, I just restore it, 00:14:20.200 |
but what, but what I want to do, actually, maybe I do want to do it for the, for the, for, let me just do it, but I'm not going to change the whole thing, 00:14:31.200 |
so I, now I just add one dimension, where else do I have to change, so I, so this will go to the norm, so this has to change to 5 as well, and this, this is the most important, if I have a 5 of this, 00:14:44.200 |
I have to move this, I have to move this weight, I have to add another columns of weight, in order to multiply them well, so you can see the impact of adding more dimensions, for your embedding, is that you want your weight have to grow in this direction, this direction, like here, 00:15:06.200 |
like here, this is too much change, I don't want, I don't want to change this, I'm going to do that, so this, I'm going to talk about it, so I put, maybe put a gray, just to say that I sort of talk about it, but I'm not really implementing the embedding dimension change in my express, example, okay, and then, let's talk about good query attention, 00:15:31.200 |
some attention, some attention stuff, so let's zoom out a little bit, I want to talk about, now it sees 3 attention head, what does it take to add another attention head, what I would do, is that I would just create a more space here, 00:15:44.200 |
okay, and I could, actually no, let's do this, 00:15:50.200 |
let's do this, I want to select here, come on, let me do this, select this, and this will be empty space, for me to add some stuff, whatever, let's just, okay, so, do I have enough space, let's just do a little bit more space, 00:16:17.200 |
so to get a new head into this, so what I can do is, I will just copy this, and copy this, all right, and I have this weight, another set of weights, so you might, 00:16:29.200 |
in prepositation, I usually like to use red to highlight trainable weights, so these are all trainable parameters, when you hear 7 billion model, you are referring to this, 00:16:44.200 |
shaded red, variable parameters, so the things I do not shade, means they are not trainable parameters, they do not count to the parameter count, but they still matter in terms of calculation, 00:16:56.200 |
so they all need to be calculated, they all take up, uh, runtime memory, you have to, all have to figure out how to fit into your GPU to optimize, but you can visualize, 00:17:06.200 |
that's computational complexity versus parameter size of the model, but when we talk about size of model, we tend to refer to these parameters, so here, okay, let's try to fix this, where does it come from? 00:17:18.200 |
So this come from, I had to fix this, but I could re-implement this, I'm going to zoom in a little bit, so what do I do, I want to take this, take all the, I want to input, 00:17:30.200 |
metric multiplication, to take all state of weights, and then, and I'll go up, and up to select, they all share the same input, across this head, they all, uh, multiplying with the, this is the output from the previous layer, which is a norm, 00:17:46.200 |
okay, so this is how I implement it, so you can see this, and, you can use F2 to examine this, so just to show you, I can scroll up to see, okay, this is where it's selected, 00:17:58.200 |
and when I select this, okay, this is where it's selected, it's a different kind of weight, because I copy and paste it, so this, all the weights are the same right now, but I want you to purposely change the view thing, so they have two different set of weights, okay, so now, 00:18:14.200 |
now, what I just manually implemented, a new head, and the rest of it, the same, so we have query, this, you take the first, one third, and to move over to the query, and take a second, 00:18:26.200 |
one third, and then, and put it on the key, but I have to transpose it, maybe I can use my pen to draw it, like this, to show you a data flow, like here, and here, and then, you do a KT, 00:18:40.200 |
multiplication, to get this dot product, scale, attention, and scale it out with soft mix, soft mix, soft mix, soft mix, okay, hard to write, with my mouse, okay, so I have a three different head, 00:18:54.200 |
so what do I get, I get a more attention, values here, over, we'll have to add this, we'll have to bring this three back here again, so let me just move up, move three spaces up, 00:19:06.200 |
so create some space for here, I'm going to select here, to concatenate, so what I'm implementing is concatenation, when you concatenate tensors, that's what's happening here, 00:19:16.200 |
so if I do that, and this whole thing gets taller, so I have to extend my weight matrix, sidewise, sideway, by the same length, so I'm going to copy three sets away here, and once I do that, 00:19:30.200 |
I'll be able to update my metric multiplication, that it is linear projection over here, to match, so if you read the paper, so usually they have says W out, or down, what is it, down project, W down, is referred to this metric, 00:19:46.200 |
this metric, this metric can be pretty big too, but sometimes, if you kind of divide your attention dimension in a way, you might not even need this, suppose I, instead of having this many heads, suppose instead of five, I have, say 100, and then we have 20 heads, 00:20:12.200 |
and then, and then, and then each head has five dimension, and when that concatenate all that, I get 100 back, in this case, you can skip this, dimensionality change, with this matrix, anyway, so this is how I, things are impacting when add one more head, so I do that, so add more head, do I add more heads here, where is it, heads, okay, when we add more head, that's what happened, 00:20:38.200 |
so there's a head, so there's a head, so there's a head, adding head from llama two, llama three, over here, so I think I implement this, I'm going to put, highlight this, okay, come back, alright, so any more, let me see, I'm just monitoring the chat, but I guess, maybe, should I take one question, before I move on, I want to talk about group query retention next. 00:21:04.200 |
Yeah, there's a question in the chat about vocabulary, how much does it cost in vocab size to add native understanding, slash generation of images and audio, 00:21:13.200 |
so text only llama had 128k vocab, but multi-modal llama had 256k, I wonder if you want to address that, before you, yeah. 00:21:22.200 |
So, if I zoom out a little bit, the vocabulary size, have only impact, like in, like in the output here, so you will grow this for output vocabulary, and then if you have three modalities, you just, this will get longer, and then you also need to figure out a way to convert your input into the embedding here, so this will be got a lot longer over here, but internally, once you pick the model size, this is going to be the same. 00:25:54.200 |
We have not significantly introduced a lot more weights. 00:28:17.200 |
So I have an actual one to take care of bias. 00:28:25.200 |
So surprisingly this only thing I had to change. 00:28:29.200 |
He said that all this way is going to be different. 00:28:30.200 |
Even though I copy and paste are same weights. 00:28:39.200 |
So you can visualize that the weights just increased by two times. 00:29:06.200 |
I'm going to color this into a different color. 00:29:37.200 |
For instance, I could totally just remove this. 00:29:44.200 |
Is the fact that they all have a fixed dimension of five. 00:29:48.200 |
That's how you can stack them up very easily. 00:29:51.200 |
It's not a lot of dimension changes between the block. 00:29:55.200 |
But they have dimension changes internally in the attention layer. 00:29:59.200 |
For instance, you can see that from five to three, three, four. 00:30:02.200 |
And one of the other heads from five to three, three, three. 00:30:11.200 |
Because you like them to have the same five dimension. 00:30:23.200 |
You can see the one attention layer over here. 00:30:26.200 |
Do you need skip connection for the residual stream? 00:30:35.200 |
So that allow means to float all the way down. 00:30:46.200 |
So I'm adding the blue thing with this orange. 00:30:49.200 |
Red thing coming all the way from the above here. 00:35:18.200 |
where we were. Oops. Too much. Just highlight this. I did this. Now I did this. Okay. All right. 00:35:35.060 |
Oh, another thing is that you want to highly optimize. If you've got a way to fit this 00:35:41.440 |
bi-dimension model, most nicely on the GPU, if you get an option for that, you want to be able to use 00:35:49.800 |
it for all your blocks here. So you could optimize only once. They can be used for entire stack. Does 00:35:57.860 |
it make sense? I hope. Oh, if a new model architecture is designed, how are the number 00:36:05.860 |
of layers determined? Trial and error. I think it's just kind of, it's a trial and error as well. 00:36:09.720 |
And then for, or how many GPUs do you have to train them? So only a few companies in this world have 00:36:16.900 |
gigantic GPU farm. They might universally, it doesn't. So if you look at 128, it's non-trivial. 00:36:24.440 |
I mean, where was I looking at? Layers. Where are the layers? 32 layers. There's same, not that many, 00:36:33.600 |
right? But huge. It's a lot, a lot of layers to train them. And there's 32 steps to propagate. 00:36:41.100 |
Then you have a huge batch to have to go through for your gradient descent. It's very expensive. 00:36:47.840 |
Okay. But now I want to talk about, maybe I talk about this one, V4 network. I haven't spent any time 00:36:54.840 |
to talk about V4 network. Let's talk about the Mitchell expert. I think this around this time 00:36:58.920 |
when MOE has become popular. So there are some variant in Lama 3D studies that use Mitchell expert, 00:37:04.020 |
but maybe not, but obvious, clearly Lama 4 uses MOE now. That's for sure. So do, let me just do a copy. 00:37:15.020 |
Do it. Now work on this copy. Okay. MOE, where MOE layer is happened to the MLP V4 network here. 00:37:26.200 |
Okay. So, um, like, so what does it mean by having two experts? Basically, this is my MLP block. I just copy this. 00:37:38.920 |
And this is it. This is it. Now I have two experts. Not quite yet. I had to fix this a little bit. 00:37:44.100 |
So this one is linear project from the input, from the previous layer, output from the previous layer, 00:37:51.020 |
and with this weight. So now I just have to modify this to select. 00:37:54.920 |
They select the same, actually, what happened here? 00:37:59.060 |
Now I move this. Let's move. Doesn't let me. I cannot select this. 00:38:09.060 |
All right. Now I have done. And I have to take this together. 00:38:14.800 |
So I would take this, this, add this, and this. 00:38:19.160 |
I can just, the easiest implementation is I just going to place, I place plus and just add this. 00:38:25.380 |
All right. Okay. So now I have a most basic mission of expert 00:38:31.000 |
with constant, with one, with equal weight. Okay. 00:38:35.960 |
I can repeat the same process and to do another, my third expert, 00:38:41.620 |
I'll repeat the same process and I'll update my input to share the same input from before. 00:38:51.860 |
And then I'll come down here. I also just add the new output. 00:38:57.120 |
Now they are equally weighted. So this is the equally weighted scenario. 00:39:01.780 |
And then the question you have in your mind is that, but how can we 00:39:06.840 |
have a mechanism that they are sort of linearly weighted? 00:39:11.680 |
So they weighted differently. Sometimes this expert get more weight. 00:39:15.580 |
Sometimes this expert more weight. Sometimes this expert more weight. 00:39:19.000 |
Also, we have 10 tokens, maybe for a particular token, we, and we like one expert to receive more weight than the other. 00:39:28.560 |
So what we can do is that we could create another network over here. 00:39:32.320 |
I'm going to build another here called a routers network. 00:39:35.720 |
And we need to have at least three different weights that we can calculate one for each expert. 00:39:43.560 |
So to map it out, what I want to do is that I want to have three expert. 00:39:55.560 |
So I want to have a three by three by one, two, by 10, three by 10 of my gated value. 00:40:10.200 |
Now do I, how can I get three by 10 metrics from here? 00:40:23.740 |
shift everything to the right, I can have some more space to work with. So I'm gonna use this, insert, insert, 00:40:31.180 |
shift sail to the right, will that work? Instead I don't like this color, let's just remove this. 00:40:38.540 |
Okay, so I now kind of map this out, I want to have one, two, three, four, five, this is my weight, 00:40:46.140 |
this is how many weight I want, so I can do zero, zero, zero, zero, zero, zero, zero, zero, zero, zero, zero, zero, zero, zero, zero, zero, zero, zero. 00:40:56.860 |
Okay, so this is how many weight, I just initiate to zero, and if I multiply, matrix multiply with my weights here, 00:41:04.940 |
and then my tokens here, I just don't have binary, so zero, that puts on random weights, 00:41:13.740 |
so that's my random way for computing gating value, so now I have all this, okay, and then 00:41:20.860 |
I would like to do a solve max, so these are my relative weights, the term learned by this network, 00:41:29.420 |
let me just share this, so following the convention, so these are the variable weights, and to implement 00:41:35.100 |
solve max, so I'll do exponent of this by column, and then sum of the same thing, exponent 00:41:42.860 |
of this, so this is my solve max, you can notice that this thing, even though it's at 414, 00:41:47.900 |
but if I show you more, let's zoom in more, you can see, 00:41:51.660 |
this three number will add out to one, that's what solve max gives you, so I'll copy solve max over here, 00:42:00.620 |
so that's all my solve max, so each one will be a gate values across like this, so this is what the tensor 00:42:05.580 |
looks like for your gate, and once we have that, I just lost track after I, okay, so I will bring this 00:42:14.300 |
gate value over here, maybe just create some spaces here, I'll create some spaces here, and now put some 00:42:21.420 |
gate values close by here, so here is my gate value, one, the first expert will be this row, 00:42:31.180 |
okay, and then this per token, that weight is slightly differently, my second expert will be, 00:42:36.700 |
come back here, this, this row, and my third expert will use this row as the predicted gate value, 00:42:47.500 |
it's all kind of high, and then, let me just create some more space, I need some more space again, 00:42:55.260 |
okay, and then I will do an element wise multiplication, to take the output from the first, 00:43:02.940 |
one more time, output from the first expert, multiply the gate value, this per token, and I'll have this 00:43:12.700 |
gated, except that, okay, and then this logic is the same for the other experts, I can just copy the 00:43:20.220 |
formula, so now instead of adding these three rows as is, as an equally weighted combination, 00:43:28.220 |
now I have a gated combination, so I go back to modify my equation here, so I just drag this down here, 00:43:36.860 |
so now I finish my dance, make sure the expert, come on, come on, here, where is my hand, 00:43:48.540 |
doesn't, oh here, okay, here, one more, one more, 00:43:58.220 |
doesn't let me, one more, move a little bit, oh here, okay, drag here, down here, okay, now I finish, 00:44:04.460 |
and so for sparse, if that you pick top two, for sparse, you have a mechanism that just pick two 00:44:13.180 |
highest value, and say the other one to be zero, for instance, zero, four, one, zero, so you can have a 00:44:20.380 |
equation to do something like this, and then what, and then I use this as a gated value instead, 00:44:25.820 |
there will be, there will be a sparse mixture of expert, but excel, I find there's no easy way to do 00:44:31.980 |
this, so I'm not going to bother today, okay, so, uh, so now I think I finished this as well, 00:44:40.700 |
mixture of expert, did I did mixture expert, okay, so, anyway, I think I could top flash attention, 00:44:49.820 |
or just have a conversation right now, is there a room for improvement within the domain of attention 00:44:59.260 |
mechanism, NSA being the latest innovation of DSEC, or is there a natural boundary in your opinion? 00:45:04.460 |
I think with NSA is native, uh, native sparse attention, like, do you want me to do that? 00:45:11.580 |
Uh, for sparse MOE, do you normalize and F select in the top K? That, I guess, probably, if you happen to 00:45:26.140 |
have, uh, like, uh, RMS norm along the way, so normalization is probably, uh, not as necessary, 00:45:35.980 |
because we'll like this, this to learn to normalize your value across your experts, 00:45:42.700 |
and they'll normalize in some way, but it does, it can't, so it's, it's no theoretical justification 00:45:50.140 |
one way or the other, but I guess it's just, empirically, if you commit to add a normalization 00:45:56.060 |
layer, and you committed three months to train your model, you, you already committed to it, 00:46:02.700 |
it's too late for you to change something, change to something else, but maybe the benefit is marginal, 00:46:07.180 |
maybe there's no benefit, you don't know, but it also doesn't hurt for you to try, um, okay, so, 00:46:18.060 |
uh, so, does input and output dimension for each expert match the model dimension? 00:46:22.860 |
Uh, yes, also, it doesn't have to be, I could, for instance, what if I have one of the expert to 00:46:32.620 |
output more, how can we have, say, expert three output more, a longer token? So, we could add another 00:46:42.620 |
row of weights here, so let me show you, maybe add, I just moved by two, two spaces, maybe add two, 00:46:49.740 |
two set of weights, so this is similar to adding two, two nodes in this, this, uh, MLP right here, 00:46:57.100 |
so now all of a sudden, I have seven instead of four, so I update my ReLU here, ReLU is, I think, 00:47:04.300 |
automatically updated here, okay, here, so I've got seven, now I have seven, the problem is that now you 00:47:10.140 |
cannot add them together, now what do we do? And then eventually, if most likely, then you have to 00:47:17.580 |
project this down to fight anyway, and to be able to add them together, or you can concatenate, but, 00:47:24.540 |
but concatenate, eventually, you still have to project, project it to a state dimension to match 00:47:31.340 |
with, to be able to work with other layers, just kind of, Lego pieces, all this, all these things, they 00:47:38.140 |
have to have the same dimension for them to stack up together, you cannot have an arbitrary, so, but in 00:47:43.420 |
theory, you can, why not, you can just add this, add a particular layer over here, and to add another, 00:47:49.500 |
then add another linear projection to project back, so I will, what I'll do is, I will do seven, one, two, three, four, five, six, seven, and five, so this is how much I need, 00:48:00.700 |
I need to then change, it's kind of ugly now, let's do, follow the space over here, 00:48:07.740 |
and then, if I do a matching multiplication, so we're back to five dimension per token, 00:48:18.220 |
and then, when I go over here, so I fix, I can move my green one over down here, all of a sudden, the equation 00:48:26.460 |
works again, okay, so now it's fixed here, so maybe I select it for two minutes, okay, here, okay, 00:48:37.580 |
what else, should I talk about NSA, so the, um, 10 minutes, do we want to hear flash attention, or hear MSA, 00:48:52.220 |
over here, any of this, rope, RMS, flash attention, 00:48:59.740 |
NSA, native space attention, so sparse attention, so you can see that it is a dense attention, 00:49:08.460 |
so we have, we derive the entire, uh, entire query over here, okay, so my understanding of the sparse 00:49:18.380 |
attention, is that, let me just use a, uh, uh, mix, move a copy here, create a copy here, so this is my 00:49:30.780 |
attempt to explain NSA, maybe, based on my memory, I hope I'm doing it correctly, so what we can do, 00:49:39.420 |
is that, now I have 10 tokens, what if, I take three tokens here, okay, three tokens here, and try to merge 00:49:47.980 |
into one token, so it becomes more sparse, it sounds like, makes sense, maybe this is the two, too big, 00:49:54.780 |
let's use a smaller one, maybe here, uh, I want to take these three tokens, and, and then somehow 00:50:02.220 |
condense all the information I need into one, so let's just draw some space here, this is my goal, 00:50:08.300 |
I want to take this, these three tokens, let's just highlight a little bit, green, into this blue, okay, 00:50:16.620 |
so what do we need to do, so first, we need to flatten this, so we have nine elements here, and from nine 00:50:23.020 |
to three, how do we go from nine to three, we need a linear projection here, so that way is three by 00:50:28.300 |
nine, make sure it's right here, so I can, I get lazy enough right now, I can use rain array, and we get 00:50:35.100 |
three rule and nine column, so I immediately, immediately get the make, the, the mixture I want, 00:50:41.740 |
and then I will just do a make sure modification, make sure modification, come on, you can say make sure 00:50:49.260 |
you have to do this, and take this, okay, and then, and then I take this, so I cannot do this yet, I have 00:50:57.580 |
to do is, I want to do this, but I want to convert it into a column, two columns, two, too small, two columns, 00:51:10.140 |
and then, I cannot see it, I said, but I just read it out, now, okay, but then I get this, now I'm done, 00:51:17.740 |
okay, so let's just see, just review this, okay, so this is, and all right, so if I repeat the same 00:51:23.660 |
thing for these three, so I have this one, and I select the same weight, they share the same weights, 00:51:32.460 |
okay, and then the last one, another block here, so also to repeat this, copy over here, 00:51:39.500 |
I just select the same weight, now I, I'm working with only three queries, did you see this, only with 00:51:48.220 |
three query x, and then, so if I do the attention weight matrices here, let's re-implement this, so I'm going to 00:51:55.740 |
create some space for me, ah, come on, this is ugly, but, well, how do I fix it real quick, 00:52:04.220 |
because it just erase it, sometimes it's trying to be smart, trying to figure out a format for me, 00:52:13.820 |
and, okay, and then, and then, so now I just redo this, I have my first query, and then my second 00:52:24.060 |
query, and my third query, and then we do the same thing for my keys, I think we've got copy and 00:52:32.060 |
paste, might just work, let me do it, is it right, oh, yeah, actually, oh, yeah, it works, and then you 00:52:37.740 |
can notice that you have a different set of weights, okay, and now I can do the same thing here, I want to 00:52:51.660 |
and then one, two, three, and actually, transpose my another key, 00:52:57.580 |
and transpose another key here, all right, and then, all of a sudden, 00:53:06.380 |
did you see the attention weight matrix, oh, the dark product I need to do is a lot, is only here, 00:53:16.540 |
I'm going to take a comma, and select, let me redo one more time, make tree modification, select my 00:53:27.660 |
key, and select my query here, and then this is, okay, and then usually I also have divide by the square root of the dimension of key, which is three, so this is typical, but, all right, so this is what we have, and then we take softmax, here, 00:53:34.540 |
I can copy the format here, so it looks pretty, okay, so now I can zoom out, you can come in on the, save the difference in terms of the computational complexity, so it's not sparse, you can see here, 00:53:50.540 |
here, so instead of all this, now I have this, okay, and then the native part, the native part is that this is part of the training, so I'm going to do this shading action just with emphasis, just to show you that, so this is what the native part is, this is part of the training, 00:54:12.540 |
so I'm going to do this shading action, just with emphasis, just to show you that, so this is what the native is referring to, just training this, as part of your sparse attention mechanism, okay, let's just look at the non-sparse attention mechanism, how does it work, so if I'm going to take this into non-sparse attention, 00:54:36.540 |
how can I convert this one into a non-sparse attention, maybe just do a copy, actually, maybe not, let's just change it, so I have this matrix modification for the whole thing, what I can do is that I could just do like this, 00:54:50.540 |
and so in this case, each query is only going to compare with key in the neighborhood, so I can repeat this, I copy and paste over here, and it slides, 00:55:06.540 |
select here, and here, okay, and then repeat this again, but I'll make sure I select the right key, maybe in this time, I just do four of this, since it's a multiple of three, but the last one multiple is four, okay, so now it's a lot sparse, you see the sparse, and this is dense, and this is dense, 00:55:27.540 |
it's a sparse, but the difference between this, I just say, I just say, okay, match with neighbors, there's no extra 00:55:35.540 |
learnable parameters involved, there's no extra network you learn to do this, but in this case, it's a native sparse attention, and somehow works, you have this, and I thought, what is, well, this is, 00:55:52.540 |
well, this is, you can visually see there's an efficiency, in terms of the computational efficiency, just fewer matches compute, but you have to give DPC credit, because theoretically, it sounds fine, but you talk about tens of millions of dollars to even just experiment, to try to see where it works, what if it didn't work, it could, it could be the case that they tried it, it didn't work, but they tried it anyway, 00:56:14.540 |
and they are lucky they worked, and so they probably take the report about it, but I bet they probably tried five, six other things that didn't work, so they never talked about it. 00:56:21.540 |
So, this is like a convolution with strike three, with no overlay, that's correct. 00:56:35.540 |
I think we're running out of time, two minutes. 00:56:39.540 |
I enjoyed it, although I have, I can show you my practice sheet. 00:56:50.540 |
So, real, real quick, so if I stop, share, and share again, to, this is my internal practice, not as pretty, but if you are wondering, so this is, I have, so this is, the things I did not talk about today, is the, 00:57:17.540 |
So, this is the, so this is from layer norm, to rms norm, and then what else did I talk about, also I talk about, oh, here, so this is rope, it's kind of complicated, yeah, this is rope. 00:57:50.540 |
But, I think I can tell you the high level stuff, so, you can see that I put this block of actual computation. 00:57:58.540 |
Number one, it's not shaded, so this is not trainable, it's all pre-computed, all the rotation matrix can be pre-computed. 00:58:05.540 |
Number two, it's really close to the attention head right here, whereas the original position encoding is only injected right in the beginning. 00:58:15.540 |
All right, and then you, you hope, all this key connection to bring information down. 00:58:20.540 |
Now you have 32 layers to do so, and you, you're lucky, if the position coding has any impact down the stack. 00:58:27.540 |
But with rope, you would add this computation at each head, at each head level, every single stack, get the position really close to where the attention matters. 00:58:38.540 |
So that is probably the takeaway, that being able to visualize the difference between the two. 00:58:46.540 |
There's a way to visualize this, like this rope, rope is right here. 00:58:50.540 |
So that's a way to visualize the position coding, right, is at the top. 00:58:55.540 |
Okay, so hopefully, I still talk about everything, not at the same level of depth, a bit unstructured, and hopefully, you guys have fun. 00:59:11.540 |
It's always good to see a different perspective on how this stuff works. 00:59:14.540 |
I'm sure people learn a lot about you on that. 00:59:17.540 |
Yeah, this reminds me a lot of like Ishan's walkthrough, but like actually, higher level, like you did a lot of work in like reducing the dimension so that we can actually hold it in our heads, which I think is very important. 00:59:33.540 |
And then you, if you get three dimension right, then you, in the future, I see just vibe code. 00:59:39.540 |
You as a model to, to expand this to a higher dimensional space, but underlying math. 00:59:46.540 |
Math is easier to work out at the lower dimensional space. 00:59:52.540 |
Well, you know, just to be respectful of your time. 00:59:55.540 |
If people want more, you know, are you, where to best find your, the rest of your work? 01:00:03.540 |
Honestly, I don't tend to get to talk to a taken audience like that. 01:00:09.540 |
So I was curious how it's being received because with my own student, I cannot go at this steps. 01:00:21.540 |
So I get to talk in one semester worth of stuff in one hour. 01:00:26.540 |
And I expect that you could follow most of it or all of it. 01:00:31.540 |
So I enjoyed it to give me the opportunity to go nerd. 01:00:39.540 |
I mean, we, we've been covering a paper here every week for the last two years. 01:00:43.540 |
So, um, there's, there's been a lot of interesting, uh, lectures and papers and, um, yeah, definitely 01:00:52.540 |
Some of this is new or like a new perspective on the same thing is actually always useful. 01:00:55.540 |
So did, uh, the Excel, uh, online live up to the enough fast enough for you to check, 01:01:03.540 |
I'm curious from your side of the experience. 01:01:10.540 |
And you were able to also check the equation for me last and yeah. 01:01:21.540 |
So I spent quite a lot of time to think about how I should cover this. 01:01:25.540 |
And I'm glad there's mostly most of you still stay and we'll hope to do this again for some 01:01:34.540 |
Uh, the life, uh, spreadsheet was very, very useful. 01:01:38.540 |
I feel like I still have a lot to dig back into, but yeah, the formula is showing what. 01:01:44.540 |
Multi-headed attention transposing and stuff. 01:01:51.540 |
Thank you for the invitation for this, this opportunity. 01:01:53.540 |
I hope that I'll have opportunity to come back. 01:01:55.540 |
And because then I can go, I can kick out with you. 01:02:00.540 |
Um, I think for next week, I'm probably going to invite someone from Prime Intellect to cover 01:02:23.540 |
Uh, you know, for, yeah, I think that the, the, the really, the, the thing, the interesting 01:02:26.540 |
thing is like, does RL and, um, long chain of thoughts sort of, uh, training actually introduce 01:02:35.540 |
new training paradigms where you, you're the, the, the hardware requirements are actually 01:02:40.540 |
different and they actually don't benefit from the normal centralization factors. 01:02:53.540 |
Uh, so Sean, is it possible that you can share with me the chat history?