Llama 1/2/3/4 by Hand: Prof Tom Yeh

00:00:00.000 | Yeah.

00:00:02.600 | OK, now it's been recorded.

00:00:06.440 | So I all of a sudden feel nervous.

00:00:11.260 | OK, yeah.

00:00:12.140 | So what do we want to do today?

00:00:15.560 | So I could just go ahead and just go

00:00:17.680 | through a lot of live coding in Excel.

00:00:21.600 | I'm going to show you how we can build deep learning

00:00:23.800 | architecture from Transformer to a lot of new stuff.

00:00:27.820 | Or do you also want to take steps to take questions?

00:00:30.320 | How interactive is this session supposed to be?

00:00:34.460 | It gets pretty interactive.

00:00:35.560 | People talk in chat.

00:00:37.520 | If there's anything interesting, one of us

00:00:39.760 | will let you know, or people might unmute.

00:00:41.900 | But pretty free-flowing.

00:00:43.720 | I think we do it the same way.

00:00:45.540 | We don't really look that deep into benchmarks.

00:00:47.660 | We kind of skip that.

00:00:48.440 | People can look at numbers themselves.

00:00:49.980 | But it's fairly interactive.

00:00:51.760 | Sounds good.

00:00:52.260 | We'll try to be interactive as if I'm teaching.

00:00:55.460 | This is more like a club.

00:00:56.580 | It's more a club rather than lecture.

00:00:58.460 | So I want to be more interactive.

00:00:59.920 | I'll stop and I'll take questions.

00:01:02.420 | And I'll also ask questions to you, too.

00:01:05.260 | So you have a lot of things you know better than I do.

00:01:08.400 | So what I want to share is my live Excel spreadsheet.

00:01:12.140 | So I've been using this a lot to teach deep learning architecture.

00:01:16.800 | And so this is sort of my plan to take you from vision transformer to LAMA 1, 2, 3, 4.

00:01:23.480 | And as I said, a few of these things are being introduced you might have heard of or read about often at the paper.

00:01:30.860 | So we'd like to talk about ROPE, talk about IMS, NORM, talk about group query attention, LAMA 2, and flash attention, interleave attention for LAMA 4, mixture of expert.

00:01:43.040 | A bit more like high-level, well, I wouldn't say high-level, it's an Excel-level overview of what is happening, some live coding.

00:01:51.720 | And so I would like you to go to, I actually prepared a link for you to download, to help access to the spreadsheet very easily.

00:02:00.400 | Let me just, let me just find the link.

00:02:02.820 | I should be a bit more prepared, but I don't know why.

00:02:07.240 | So how, can you give me a, if some of you might have access to the document already, and maybe you can see a thumbs up so I know how many of you.

00:02:17.240 | I just shared it in the Zoom chat, by the way.

00:02:20.240 | I'm going to share it on the Zoom chat right now, and so you have a direct access, wait, where's my Zoom chat?

00:02:29.340 | Okay, this, here is my Zoom chat, I should have another version, so that is like, they have to have the mailing list subscription gate.

00:02:39.000 | So this is the non-gated version of the link, you can go there, and you will see, and you will see the top link is a live version of the same Excel sheet.

00:02:49.760 | And the, but the, there's also a baseline version that I'm not going to touch today, so you can compare how things evolving.

00:02:56.860 | And so that's what my plan to do, and quick overview of my transformer, vision transformer architecture, that is expressed in Excel.

00:03:05.960 | If you will zoom out, you can see this whole stack, from the input stack, normalization, self-attention stack, to feed forward, and output layer, softmax linear layer, and loss gradient.

00:03:19.960 | And we're going to do that, and want to explain this, and the reason I picked this is that the transformer stack part is pretty common across modalities.

00:03:29.060 | So I picked vision transformer, so I picked vision transformer, so that just to get an idea, anything that you could, anything that it could convert your input into some tokens, you can put it into transformer stack.

00:03:41.160 | I picked vision transformer stack, from this point, from this point, this is where a transformer encoder started, or a decoder.

00:03:46.260 | In this case, it's the encoder, but for GBT, it's a decoder.

00:03:49.260 | Anyway, so let's take the first challenge, so I hope that you could probably just get it right away, because it's not technical, you have seen transformer before, you might not have seen this format.

00:04:01.260 | But you'll get it very quickly, so what I want to bring your attention, to the attention layer, here, that I'd like to show you, how I would start by, let's go back from my plan, my plan is to just to get, well, from transformer to llama, one, two, three, four, a lot of things scale up, just increase in dimensionality.

00:04:23.360 | So we're going to see how it is like, we're going to focus on query and key, if we increase query and key, what happened, so if I increase query and key, so in this case, I want to read this a little bit, so we'll have input, this is a, if you go out a little bit, these are tokens, we have 10 tokens here, and so here, one, two, three, four are the embedding dimension, they're currently five, and so this will be input, and to start from attention stack, we'll multiply with all this weight

00:04:53.340 | matrices, I get this, so this one, so this one, we'll then, so have query and key here, and then we get attention, it's dot product scale attention, and we solve max, that's how you can read it, and you can see per column, this is solve max, give it distribution, and then you multiply with your value, that's how attention, so the work, the way I visualize it, and so what happened is that if you add, if I want to add, so this is three, the key,

00:05:23.320 | key dimension here, you can visualize three, and over one, over one, two, three, four, five, or five tokens, okay, so far, so good, and what happened, if I want to add one dimension, what I would do live, is I would just shift this up, down here, so I have, this is kind of broken now, so I can add some more weight, like this, okay, and now, it's become, it become, a problem here, is that if I like to get

00:05:53.300 | my keys over here, maybe use my keyboard instead, I hope you are following me, fast enough, okay, so here, I have my query over here, because I'll have, she'll have four dimensions, let me remove this, push it down a little bit more, so I have four dimensions for my query, so I have four dimensions, now this scaled up dimension, scale product is broken, because my keys are four dimensions, so I do something

00:06:19.260 | similar for my key, so I'm going to move down, so I'm going to move down, and then introduce a few more, initialize some random weights, maybe zero here, and then I will update my equation here, so now I just managed to add one dimension, so now, if you see this inductively, you can see how this is scale,

00:06:37.260 | and the takeaway is that the key and query, they should have the same dimensions, four and four, but value doesn't have to be, but even though, typically, we keep the value the same dimension for convenience reason, but theoretically, this is the way you can visualize that the key and value increase in size,

00:06:56.260 | and what happens if we don't add one more token, so let me see, I check this, so I check this, maybe use this highlighter to say I'm done with this, okay, and I add, when I add key and value dimensions, you realize nothing else change,

00:07:12.260 | other than the computational complex plus flexibility per cell, now I have four things to dot-proc dot-width, okay, here, matching multiplication,

00:07:21.900 | softmax, this is the raw interpretation of softmax, and we have another matching multiplication here, okay, all right, so, and what happens if you want to add value,

00:07:31.700 | say you want to add to five, just to prove my point that they don't need to match my key and query dimension,

00:07:41.260 | okay, and I have five of this, okay, and then, and then, so, then, I have five over here, now, I can ugly, let's just add some more space here,

00:07:51.760 | and insert here, and now I have this five over here, now, all of a sudden, my attention-weighted values from this attention head is five dimension,

00:08:01.500 | and I'm going to zoom out a little bit to see what's going to happen if we have, we count the attention head, now we have first attention, second attention,

00:08:10.500 | and third attention head, and we have things that collect them, so, this is where concatenation happened, now, we have two extra dimensions for my values,

00:08:19.000 | and now it's no longer fit, what I can do is that I'm going to increase my dimensionality of the attention-weighted picture from the first head,

00:08:28.700 | and move up here, now I have, now you can see that it will all fit, okay, so I have five from first head, three from second head, and third from,

00:08:37.000 | and three from third head, this is three head, multi head attention, and then, but this is kind of important, because my embedding dimension is five,

00:08:46.500 | and then it has to be five consistently throughout your stack, so they can put it all together, but now, I have one, two, three, five,

00:08:55.000 | one, two, three, five, six, seven, eight, nine, ten, eleven, twelve, twelve to five, that's what we need to, we need to project, so I have to introduce more weights,

00:09:06.000 | so I want to introduce three, three more weights, so I just copy over here, and then, I update my modification, linear projection,

00:09:15.000 | all the sudden, things got, things are working again, in, actually, I had too many, I should add two, instead of three, okay,

00:09:23.500 | two, okay, so I'll kind of check my task, okay, all right, now it matches, it's working again, so I just added one, two dimension,

00:09:34.000 | two value for the particular decision head, and I show you how everything else change, okay, anything else?

00:09:42.200 | How focus cell makes green across along the ratio here, view, show, focus cell, here, and select your green color,

00:09:56.200 | I found green is pretty good for my class, and to help you focus attention, anything, any question?

00:10:03.200 | What do I do, okay, all right, where was I, okay, so I checked the value size, I just changed it, go back,

00:10:15.200 | and vocabulary size, let's just talk about vocabulary size, so at the end, we want this model to output a word,

00:10:26.200 | and our probability distribution across the word, so at the end, we still have, remember,

00:10:33.200 | this is a final output from encoder, still five tokens, ten tokens, each token is five dimensions, is it?

00:10:40.200 | So, but supposed to be a vocabulary of 20 words, in this case, for the original vision transformer,

00:10:48.200 | there's 20 class problems, then we take the first head token, this is a class token,

00:10:54.200 | so we need to project from five to 20, so now you can visualize this linear projection right here,

00:11:01.200 | so this is what linear, in the last thing, taken to the last thing of your transformer stack,

00:11:06.200 | and you do a soft max, let's just zoom in a little bit, so you could see these, these are the values,

00:11:12.200 | though this other value could be arbitrary number, but we want it to become a number between zero and one for probability,

00:11:22.200 | probability, but also, all these numbers have to add up to one for probability distribution, that's why we need soft max,

00:11:29.200 | so this is linear and soft max, suppose we want to apply, so as you can see, for llama one, two, three, four,

00:11:37.200 | it goes from, where is the vocabulary size, from 32k, 32k, 128k, so it's going to be multilingual models, even more, 250k, because it's multilingual, multimodal models,

00:11:49.200 | they want to have more, they want to have more things they can predict, okay, so, and then, that progression is mostly reflected in the last layer,

00:11:58.200 | so instead of 5 to 20, for instance, we want to go from 20 to 30, what do I have to do?

00:12:05.200 | so, where do we have to grow this? so, maybe a little bit too much, let's just grow 5 by 5, so what I would do is that I would just 1, 2, 3, 4, 5,

00:12:13.200 | select 5 rows, and insert, and then, I have to fill in this row, this is initialize to 0, maybe I just copy all my 0 over here, and these are all the biases,

00:12:26.200 | minus to 0, 0, 0, 0, and now randomize this a little bit, but adding some random one here, so now I updated the score here, and then, so I also have to do that, update it by solve max layer,

00:12:40.200 | add it in the file row in the middle, so I have, now these equations get updated, so I just increased the vocabulary size from 20 to 25,

00:12:49.200 | so that is where you increase, but interestingly, as you can tell, the only thing I touch is the very last layer, I didn't have to touch any internal of the transformer stack, okay?

00:13:03.200 | So, if you're looking for this sheet, so I have a, I could use this one, this is the link result, there's no newsletter subscription requirement, just go there and just grab stuff that you can follow this, and, alright, so let's go back to the number of vocabulary size, I looked at it now, and embedding dimension is a bit difficult to change, I'm not going to change,

00:13:30.200 | I just say, if I have to change, I have to change here, this is a token, so this will be the, this is the, from image patch to token, about 9 to 5, so 9 is because of a 3x3 window here, and again, it got flattened into a column vector of 9,

00:13:52.200 | but for the language application, but for the language application, you can think of it as from a word in the vocabulary to embedding, and so we have 5, embedding space of 5, if you want to add one more, I will have to add one more rule here,

00:14:07.200 | but the problem is, is everything else has to be extended by 5, this all has to change to 5, there's a lot of things that have to change, so I'm not going to do that, so, well, I just restore it,

00:14:20.200 | but what, but what I want to do, actually, maybe I do want to do it for the, for the, for, let me just do it, but I'm not going to change the whole thing,

00:14:31.200 | so I, now I just add one dimension, where else do I have to change, so I, so this will go to the norm, so this has to change to 5 as well, and this, this is the most important, if I have a 5 of this,

00:14:44.200 | I have to move this, I have to move this weight, I have to add another columns of weight, in order to multiply them well, so you can see the impact of adding more dimensions, for your embedding, is that you want your weight have to grow in this direction, this direction, like here,

00:15:06.200 | like here, this is too much change, I don't want, I don't want to change this, I'm going to do that, so this, I'm going to talk about it, so I put, maybe put a gray, just to say that I sort of talk about it, but I'm not really implementing the embedding dimension change in my express, example, okay, and then, let's talk about good query attention,

00:15:31.200 | some attention, some attention stuff, so let's zoom out a little bit, I want to talk about, now it sees 3 attention head, what does it take to add another attention head, what I would do, is that I would just create a more space here,

00:15:44.200 | okay, and I could, actually no, let's do this,

00:15:50.200 | let's do this, I want to select here, come on, let me do this, select this, and this will be empty space, for me to add some stuff, whatever, let's just, okay, so, do I have enough space, let's just do a little bit more space,

00:16:17.200 | so to get a new head into this, so what I can do is, I will just copy this, and copy this, all right, and I have this weight, another set of weights, so you might,

00:16:29.200 | in prepositation, I usually like to use red to highlight trainable weights, so these are all trainable parameters, when you hear 7 billion model, you are referring to this,

00:16:44.200 | shaded red, variable parameters, so the things I do not shade, means they are not trainable parameters, they do not count to the parameter count, but they still matter in terms of calculation,

00:16:56.200 | so they all need to be calculated, they all take up, uh, runtime memory, you have to, all have to figure out how to fit into your GPU to optimize, but you can visualize,

00:17:06.200 | that's computational complexity versus parameter size of the model, but when we talk about size of model, we tend to refer to these parameters, so here, okay, let's try to fix this, where does it come from?

00:17:18.200 | So this come from, I had to fix this, but I could re-implement this, I'm going to zoom in a little bit, so what do I do, I want to take this, take all the, I want to input,

00:17:30.200 | metric multiplication, to take all state of weights, and then, and I'll go up, and up to select, they all share the same input, across this head, they all, uh, multiplying with the, this is the output from the previous layer, which is a norm,

00:17:46.200 | okay, so this is how I implement it, so you can see this, and, you can use F2 to examine this, so just to show you, I can scroll up to see, okay, this is where it's selected,

00:17:58.200 | and when I select this, okay, this is where it's selected, it's a different kind of weight, because I copy and paste it, so this, all the weights are the same right now, but I want you to purposely change the view thing, so they have two different set of weights, okay, so now,

00:18:14.200 | now, what I just manually implemented, a new head, and the rest of it, the same, so we have query, this, you take the first, one third, and to move over to the query, and take a second,

00:18:26.200 | one third, and then, and put it on the key, but I have to transpose it, maybe I can use my pen to draw it, like this, to show you a data flow, like here, and here, and then, you do a KT,

00:18:40.200 | multiplication, to get this dot product, scale, attention, and scale it out with soft mix, soft mix, soft mix, soft mix, okay, hard to write, with my mouse, okay, so I have a three different head,

00:18:54.200 | so what do I get, I get a more attention, values here, over, we'll have to add this, we'll have to bring this three back here again, so let me just move up, move three spaces up,

00:19:06.200 | so create some space for here, I'm going to select here, to concatenate, so what I'm implementing is concatenation, when you concatenate tensors, that's what's happening here,

00:19:16.200 | so if I do that, and this whole thing gets taller, so I have to extend my weight matrix, sidewise, sideway, by the same length, so I'm going to copy three sets away here, and once I do that,

00:19:30.200 | I'll be able to update my metric multiplication, that it is linear projection over here, to match, so if you read the paper, so usually they have says W out, or down, what is it, down project, W down, is referred to this metric,

00:19:46.200 | this metric, this metric can be pretty big too, but sometimes, if you kind of divide your attention dimension in a way, you might not even need this, suppose I, instead of having this many heads, suppose instead of five, I have, say 100, and then we have 20 heads,

00:20:12.200 | and then, and then, and then each head has five dimension, and when that concatenate all that, I get 100 back, in this case, you can skip this, dimensionality change, with this matrix, anyway, so this is how I, things are impacting when add one more head, so I do that, so add more head, do I add more heads here, where is it, heads, okay, when we add more head, that's what happened,

00:20:38.200 | so there's a head, so there's a head, so there's a head, adding head from llama two, llama three, over here, so I think I implement this, I'm going to put, highlight this, okay, come back, alright, so any more, let me see, I'm just monitoring the chat, but I guess, maybe, should I take one question, before I move on, I want to talk about group query retention next.

00:21:04.200 | Yeah, there's a question in the chat about vocabulary, how much does it cost in vocab size to add native understanding, slash generation of images and audio,

00:21:13.200 | so text only llama had 128k vocab, but multi-modal llama had 256k, I wonder if you want to address that, before you, yeah.

00:21:22.200 | So, if I zoom out a little bit, the vocabulary size, have only impact, like in, like in the output here, so you will grow this for output vocabulary, and then if you have three modalities, you just, this will get longer, and then you also need to figure out a way to convert your input into the embedding here, so this will be got a lot longer over here, but internally, once you pick the model size, this is going to be the same.

00:21:49.200 | Okay.

00:21:52.200 | Okay.

00:21:53.200 | Does it make sense?

00:21:54.200 | Yeah, that makes sense.

00:21:55.200 | We increase the number of vocab.

00:21:56.200 | Increase the number of vocab.

00:21:58.200 | We increase the number of vocab.

00:21:59.200 | We increase the number of vocab.

00:22:00.200 | We increase the number of vocab.

00:22:01.200 | We increase the number of vocab.

00:22:02.200 | We increase the number of vocab.

00:22:03.200 | We increase the number of vocab.

00:22:04.200 | We increase the number of vocab.

00:22:05.200 | We increase the number of vocab.

00:22:06.200 | We increase the number of vocab.

00:22:07.200 | We increase the number of vocab.

00:22:08.200 | We increase the number of vocab.

00:22:09.200 | We increase the number of vocab.

00:22:10.200 | We increase the number of vocab.

00:22:11.200 | We increase the number of vocab.

00:22:12.200 | We increase the number of vocab.

00:22:13.200 | We increase the number of vocab.

00:22:14.200 | We increase the number of vocab.

00:22:15.200 | We increase the number of vocab.

00:22:16.200 | We increase the number of vocab.

00:22:17.200 | We increase the number of vocab.

00:22:18.200 | We increase the number of vocab.

00:22:19.200 | We increase the number of vocab.

00:22:20.200 | We increase the number of vocab.

00:22:21.200 | We increase the number of vocab.

00:22:22.200 | We increase the number of vocab.

00:22:23.200 | We increase the number of vocab.

00:22:24.200 | We increase the number of vocab.

00:22:25.200 | We increase the number of vocab.

00:22:26.200 | We increase the number of vocab.

00:22:27.200 | We increase the number of vocab.

00:22:28.200 | We increase the number of vocab.

00:22:29.200 | We increase the number of vocab.

00:22:30.200 | We increase the number of vocab.

00:22:31.200 | We increase the number of vocab.

00:22:32.200 | We increase the number of vocab.

00:22:33.200 | We increase the number of vocab.

00:22:34.200 | We increase the number of vocab.

00:22:35.200 | We increase the number of vocab.

00:22:36.200 | We increase the number of vocab.

00:22:37.200 | We increase the number of vocab.

00:22:38.200 | We increase the number of vocab.

00:22:39.200 | We increase the number of vocab.

00:22:40.200 | We increase the number of vocab.

00:22:41.200 | We increase the number of vocab.

00:22:42.200 | We increase the number of vocab.

00:22:43.200 | We increase the number of vocab.

00:22:44.200 | We increase the number of vocab.

00:22:45.200 | We increase the number of vocab.

00:22:46.200 | We increase the number of vocab.

00:22:47.200 | We increase the number of vocab.

00:22:48.200 | We increase the number of vocab.

00:22:49.200 | We increase the number of vocab.

00:22:50.200 | We increase the number of vocab.

00:22:51.200 | We increase the number of vocab.

00:22:52.200 | We increase the number of vocab.

00:22:53.200 | We increase the number of vocab.

00:22:54.200 | We increase the number of vocab.

00:22:55.200 | We increase the number of vocab.

00:22:56.200 | We increase the number of vocab.

00:22:57.200 | We increase the number of vocab.

00:22:58.200 | We increase the number of vocab.

00:22:59.200 | We increase the number of vocab.

00:23:00.200 | We increase the number of vocab.

00:23:01.200 | We increase the number of vocab.

00:23:02.200 | We increase the number of vocab.

00:23:03.200 | We increase the number of vocab.

00:23:04.200 | We increase the number of vocab.

00:23:05.200 | We increase the number of vocab.

00:23:06.200 | We increase the number of vocab.

00:23:07.200 | We increase the number of vocab.

00:23:08.200 | We increase the number of vocab.

00:23:09.200 | We increase the number of vocab.

00:23:10.200 | We increase the number of vocab.

00:23:11.200 | We increase the number of vocab.

00:23:12.200 | We increase the number of vocab.

00:23:13.200 | We increase the number of vocab.

00:23:14.200 | We increase the number of vocab.

00:23:15.200 | We increase the number of vocab.

00:23:16.200 | We increase the number of vocab.

00:23:17.200 | We increase the number of vocab.

00:23:18.200 | We increase the number of vocab.

00:23:19.200 | We increase the number of vocab.

00:23:20.200 | We increase the number of vocab.

00:23:21.200 | We increase the number of vocab.

00:23:22.200 | We increase the number of vocab.

00:23:23.200 | We increase the number of vocab.

00:23:24.200 | We increase the number of vocab.

00:23:25.200 | We increase the number of vocab.

00:23:26.200 | We increase the number of vocab.

00:23:27.200 | We increase the number of vocab.

00:23:28.200 | We increase the number of vocab.

00:23:29.200 | We increase the number of vocab.

00:23:30.200 | We increase the number of vocab.

00:23:31.200 | We increase the number of vocab.

00:23:32.200 | We increase the number of vocab.

00:23:33.200 | We increase the number of vocab.

00:23:34.200 | We increase the number of vocab.

00:23:35.200 | We increase the number of vocab.

00:23:36.200 | We increase the number of vocab.

00:23:37.200 | We increase the number of vocab.

00:23:38.200 | We increase the number of vocab.

00:23:39.200 | We increase the number of vocab.

00:23:40.200 | We increase the number of vocab.

00:23:41.200 | We increase the number of vocab.

00:23:42.200 | We increase the number of vocab.

00:23:43.200 | We increase the number of vocab.

00:23:44.200 | We increase the number of vocab.

00:23:45.200 | We increase the number of vocab.

00:23:46.200 | We increase the number of vocab.

00:23:47.200 | We increase the number of vocab.

00:23:48.200 | We increase the number of vocab.

00:23:49.200 | We increase the number of vocab.

00:23:50.200 | We increase the number of vocab.

00:23:51.200 | We increase the number of vocab.

00:23:52.200 | We increase the number of vocab.

00:23:53.200 | We increase the number of vocab.

00:23:54.200 | We increase the number of vocab.

00:23:55.200 | We increase the number of vocab.

00:23:56.200 | We increase the number of vocab.

00:23:57.200 | We increase the number of vocab.

00:23:58.200 | We increase the number of vocab.

00:23:59.200 | We increase the number of vocab.

00:24:00.200 | We increase the number of vocab.

00:24:01.200 | We increase the number of vocab.

00:24:02.200 | We increase the number of vocab.

00:24:03.200 | We increase the number of vocab.

00:24:04.200 | We increase the number of vocab.

00:24:05.200 | We increase the number of vocab.

00:24:06.200 | We increase the number of vocab.

00:24:07.200 | We increase the number of vocab.

00:24:08.200 | We increase the number of vocab.

00:24:09.200 | We increase the number of vocab.

00:24:10.200 | We increase the number of vocab.

00:24:11.200 | We increase the number of vocab.

00:24:12.200 | We increase the number of vocab.

00:24:13.200 | We increase the number of vocab.

00:24:14.200 | We increase the number of vocab.

00:24:15.200 | We increase the number of vocab.

00:24:16.200 | We increase the number of vocab.

00:24:17.200 | We increase the number of vocab.

00:24:18.200 | We increase the number of vocab.

00:24:19.200 | We increase the number of vocab.

00:24:20.200 | We increase the number of vocab.

00:24:21.200 | We increase the number of vocab.

00:24:22.200 | We increase the number of vocab.

00:24:23.200 | We increase the number of vocab.

00:24:24.200 | We increase the number of vocab.

00:24:25.200 | We increase the number of vocab.

00:24:26.200 | We increase the number of vocab.

00:24:27.200 | We increase the number of vocab.

00:24:28.200 | We increase the number of vocab.

00:24:29.200 | We increase the number of vocab.

00:24:30.200 | We increase the number of vocab.

00:24:31.200 | We increase the number of vocab.

00:24:32.200 | We increase the number of vocab.

00:24:33.200 | We increase the number of vocab.

00:24:34.200 | We increase the number of vocab.

00:24:35.200 | We increase the number of vocab.

00:24:36.200 | We increase the number of vocab.

00:24:37.200 | We increase the number of vocab.

00:24:38.200 | We increase the number of vocab.

00:24:39.200 | We increase the number of vocab.

00:24:40.200 | We increase the number of vocab.

00:24:41.200 | We increase the number of vocab.

00:24:42.200 | We increase the number of vocab.

00:24:43.200 | We increase the number of vocab.

00:24:44.200 | We increase the number of vocab.

00:24:45.200 | We increase the number of vocab.

00:24:46.200 | We increase the number of vocab.

00:24:47.200 | We increase the number of vocab.

00:24:48.200 | We increase the number of vocab.

00:24:49.200 | We increase the number of vocab.

00:24:50.200 | We increase the number of vocab.

00:24:51.200 | We increase the number of vocab.

00:24:52.200 | We increase the number of vocab.

00:24:53.200 | We increase the number of vocab.

00:24:54.200 | We increase the number of vocab.

00:24:55.200 | We increase the number of vocab.

00:24:56.200 | We increase the number of vocab.

00:24:57.200 | We increase the number of vocab.

00:24:58.200 | We increase the number of vocab.

00:24:59.200 | We increase the number of vocab.

00:25:00.200 | We increase the number of vocab.

00:25:01.200 | We increase the number of vocab.

00:25:02.200 | We increase the number of vocab.

00:25:03.200 | We increase the number of vocab.

00:25:04.200 | We increase the number of vocab.

00:25:05.200 | We increase the number of vocab.

00:25:06.200 | We increase the number of vocab.

00:25:07.200 | We increase the number of vocab.

00:25:08.200 | We increase the number of vocab.

00:25:09.200 | We increase the number of vocab.

00:25:10.200 | We increase the number of vocab.

00:25:11.200 | We increase the number of vocab.

00:25:12.200 | We increase the number of vocab.

00:25:13.200 | We increase the number of vocab.

00:25:14.200 | We increase the number of vocab.

00:25:15.200 | We increase the number of vocab.

00:25:16.200 | We increase the number of vocab.

00:25:17.200 | We increase the number of vocab.

00:25:18.200 | We increase the number of vocab.

00:25:19.200 | We increase the number of vocab.

00:25:20.200 | We increase the number of vocab.

00:25:21.200 | We increase the number of vocab.

00:25:22.200 | We increase the number of vocab.

00:25:23.200 | We increase the number of vocab.

00:25:24.200 | We increase the number of vocab.

00:25:25.200 | We increase the number of vocab.

00:25:26.200 | We increase the number of vocab.

00:25:27.200 | We increase the number of vocab.

00:25:28.200 | We increase the number of vocab.

00:25:29.200 | We increase the number of vocab.

00:25:30.200 | We increase the number of vocab.

00:25:31.200 | We increase the number of vocab.

00:25:32.200 | We increase the number of vocab.

00:25:33.200 | We increase the number of vocab.

00:25:34.200 | We increase the number of vocab.

00:25:35.200 | We increase the number of vocab.

00:25:36.200 | We increase the number of vocab.

00:25:37.200 | We increase the number of vocab.

00:25:38.200 | We increase the number of vocab.

00:25:39.200 | We increase the number of vocab.

00:25:40.200 | We increase the number of vocab.

00:25:41.200 | We increase the number of vocab.

00:25:42.200 | We increase the number of vocab.

00:25:43.200 | We increase the number of vocab.

00:25:44.200 | We increase the number of vocab.

00:25:45.200 | We increase the number of vocab.

00:25:46.200 | We increase the number of vocab.

00:25:47.200 | We increase the number of vocab.

00:25:48.200 | We increase the number of vocab.

00:25:49.200 | We increase the number of vocab.

00:25:50.200 | We increase the number of vocab.

00:25:51.200 | We increase the number of vocab.

00:25:52.200 | We increase the number of vocab.

00:25:53.200 | We increase the number of vocab.

00:25:54.200 | We have not significantly introduced a lot more weights.

00:25:57.200 | As you can probably notice.

00:25:58.200 | But if we add a new layer.

00:26:01.200 | What's going to happen is.

00:26:04.200 | They just mark what the layers are.

00:26:07.200 | So layers come from here.

00:26:08.200 | From norm.

00:26:09.200 | To self-attention.

00:26:11.200 | To multi-header attention.

00:26:13.200 | Actually.

00:26:15.200 | And to all the way to here.

00:26:18.200 | Right before the output layer.

00:26:19.200 | So there's one encoder or decoder block.

00:26:22.200 | So if I select this whole thing.

00:26:25.200 | Can I select this?

00:26:27.200 | Too small.

00:26:28.200 | Okay.

00:26:29.200 | And I'm going to select all this row.

00:26:30.200 | Going up, up, up.

00:26:31.200 | Okay.

00:26:36.200 | It's kind of slow.

00:26:40.200 | But let's do it slowly.

00:26:41.200 | So I don't break anything.

00:26:42.200 | So now a quick review of.

00:26:44.200 | You can now see in group query attention.

00:26:46.200 | You're seeing self-attention.

00:26:48.200 | Multi-head attention.

00:26:49.200 | Right here.

00:26:50.200 | So this is my one block.

00:26:51.200 | Okay.

00:26:52.200 | And then copy.

00:26:53.200 | And now going to just reinsert right here.

00:26:57.200 | Insert my copy cell.

00:26:59.200 | Now all of a sudden I have two blocks.

00:27:01.200 | So I just have to connect them a little bit.

00:27:03.200 | Let's just say.

00:27:04.200 | What do I do?

00:27:05.200 | Let's go.

00:27:06.200 | Maybe coloring a little bit.

00:27:07.200 | So I can distinguish them.

00:27:08.200 | So I'm going to color the first row.

00:27:10.200 | First column of the second block here.

00:27:13.200 | Is a different color.

00:27:14.200 | Is a different color.

00:27:15.200 | So let's use.

00:27:16.200 | Maybe use.

00:27:17.200 | Use.

00:27:18.200 | Use this.

00:27:19.200 | Pink.

00:27:20.200 | Okay.

00:27:21.200 | So this is my second block.

00:27:22.200 | Let's just connect them a little bit.

00:27:23.200 | So yeah.

00:27:24.200 | This is actually ultimately connected.

00:27:25.200 | Because they're relatively positioning.

00:27:27.200 | So this.

00:27:28.200 | That's.

00:27:29.200 | Output from the last encoded block.

00:27:31.200 | Get.

00:27:32.200 | Add to the norm.

00:27:33.200 | About normal.

00:27:34.200 | I miss norm.

00:27:35.200 | Later.

00:27:36.200 | So this is layer norm.

00:27:37.200 | It's being implemented over here.

00:27:38.200 | By the way.

00:27:39.200 | So, but this is going to be the same.

00:27:41.200 | And this only thing different is that.

00:27:43.200 | So this will be taking.

00:27:44.200 | First token from here.

00:27:46.200 | Instead.

00:27:47.200 | Instead of all the way in the top.

00:27:49.200 | So right now is.

00:27:50.200 | All the way at the top of here.

00:27:52.200 | So I could.

00:27:53.200 | Move is down.

00:27:54.200 | But it's kind of hard to move this.

00:27:55.200 | I just made you do it.

00:27:56.200 | It's not that hard.

00:27:57.200 | Move.

00:27:58.200 | Move.

00:27:59.200 | Move.

00:28:00.200 | Move.

00:28:01.200 | Move.

00:28:02.200 | Come on.

00:28:03.200 | I just lost it.

00:28:04.200 | It's just re-implement this then.

00:28:05.200 | Okay.

00:28:06.200 | So I want to implement linear layer.

00:28:07.200 | Linear layer.

00:28:08.200 | I just do a metric multiplication.

00:28:09.200 | And select all my ways and biases here.

00:28:14.200 | And then select my first token.

00:28:17.200 | So I have an actual one to take care of bias.

00:28:20.200 | So this is.

00:28:21.200 | I just re-implemented the linear layer.

00:28:22.200 | And the soft max is the same.

00:28:23.200 | So do not change the soft max layer.

00:28:24.200 | So.

00:28:25.200 | So surprisingly this only thing I had to change.

00:28:26.200 | And I will able to add another layer.

00:28:27.200 | And everything's the same.

00:28:28.200 | But he said.

00:28:29.200 | He said that all this way is going to be different.

00:28:30.200 | Even though I copy and paste are same weights.

00:28:31.200 | But ever you back propagate and stuff.

00:28:32.200 | They were going to have different weights.

00:28:33.200 | Let's just look at that.

00:28:34.200 | What happened here.

00:28:35.200 | If you zoom out now.

00:28:36.200 | We have.

00:28:36.200 | Let me zoom out now.

00:28:37.200 | Now I have two.

00:28:38.200 | Two encoder blocks.

00:28:39.200 | So you can visualize that the weights just increased by two times.

00:28:42.200 | And that was just increased by two times.

00:28:44.200 | And that was just increased by two times.

00:28:46.200 | And that was just increased by two times.

00:28:48.200 | And that was just increased by two times.

00:28:51.200 | And that was just increased by two times.

00:28:53.200 | And that was just increased by two times.

00:28:56.200 | And that was just increased by two times.

00:28:58.200 | And that was just increased by two times.

00:29:00.200 | And that was just increased by two times.

00:29:03.200 | And that was just increased.

00:29:04.200 | And that was just make it prettier.

00:29:06.200 | I'm going to color this into a different color.

00:29:09.200 | So to show that we have two different block.

00:29:11.200 | Maybe this is faster here.

00:29:12.200 | So this is another.

00:29:13.200 | Another block.

00:29:14.200 | The first block.

00:29:15.200 | And second block now.

00:29:16.200 | Okay.

00:29:17.200 | So you can see how that they grow.

00:29:19.200 | Okay.

00:29:20.200 | Blue.

00:29:21.200 | Is it blue?

00:29:22.200 | Okay.

00:29:23.200 | Fine.

00:29:24.200 | Blue is fine.

00:29:25.200 | So we'll have.

00:29:26.200 | Now visualize it.

00:29:27.200 | We have two blocks.

00:29:29.200 | And we also have.

00:29:31.200 | One group.

00:29:33.200 | Attention head.

00:29:34.200 | And each block has four attention.

00:29:36.200 | But it doesn't have to be.

00:29:37.200 | For instance, I could totally just remove this.

00:29:40.200 | If I want to.

00:29:42.200 | What's most important in my opinion.

00:29:44.200 | Is the fact that they all have a fixed dimension of five.

00:29:48.200 | That's how you can stack them up very easily.

00:29:51.200 | It's not a lot of dimension changes between the block.

00:29:55.200 | But they have dimension changes internally in the attention layer.

00:29:59.200 | For instance, you can see that from five to three, three, four.

00:30:02.200 | And one of the other heads from five to three, three, three.

00:30:06.200 | And then concatenate.

00:30:07.200 | You get this bigger one.

00:30:08.200 | And you have to project back again.

00:30:10.200 | So this is necessary.

00:30:11.200 | Because you like them to have the same five dimension.

00:30:14.200 | Now, okay.

00:30:15.200 | Pay attention to the linear layer.

00:30:17.200 | So this is the linear layer.

00:30:18.200 | Maybe we can talk about.

00:30:19.200 | Multi.

00:30:20.200 | But make sure the expert right now.

00:30:22.200 | So this is one.

00:30:23.200 | You can see the one attention layer over here.

00:30:25.200 | Okay.

00:30:26.200 | Do you need skip connection for the residual stream?

00:30:29.200 | Good question.

00:30:30.200 | So this is actually building already.

00:30:32.200 | You can see skip connection.

00:30:33.200 | This purple.

00:30:34.200 | This orange thing.

00:30:35.200 | So that allow means to float all the way down.

00:30:38.200 | Skip over here.

00:30:39.200 | And add it over here.

00:30:42.200 | So this is things that are being added.

00:30:44.200 | You can check the equation.

00:30:46.200 | So I'm adding the blue thing with this orange.

00:30:49.200 | Red thing coming all the way from the above here.

00:30:52.200 | So without skip connection.

00:30:54.200 | When we remove skip connection.

00:30:56.200 | It's basically just not doing anything.

00:30:59.200 | Just delete this.

00:31:00.200 | Now I don't have skip connection anymore.

00:31:02.200 | Okay.

00:31:03.200 | So if I do that.

00:31:04.200 | I'm going to show you a feature of Excel.

00:31:07.200 | You probably.

00:31:08.200 | A lot of people don't know.

00:31:09.200 | But I found it very useful.

00:31:10.200 | For teaching and for research.

00:31:13.200 | Is this thing called.

00:31:16.200 | What else I did?

00:31:17.200 | I just removed the skip connection, right?

00:31:19.200 | So let's see.

00:31:20.200 | If I do a trace precedent.

00:31:23.200 | You can see that.

00:31:24.200 | I'm going to see that.

00:31:26.200 | Oh, actually.

00:31:27.200 | Actually.

00:31:28.200 | Let's just zoom in a little bit.

00:31:29.200 | You can see the error.

00:31:30.200 | So if we do trace precedent.

00:31:31.200 | That Excel has a building.

00:31:33.200 | Building way to show.

00:31:35.200 | Show this kind of.

00:31:36.200 | Pretty cool.

00:31:37.200 | Pretty cool.

00:31:41.200 | This back propagation.

00:31:44.200 | Sort of visualization.

00:31:45.200 | You can see.

00:31:46.200 | How now.

00:31:47.200 | This concatenation.

00:31:48.200 | Project back.

00:31:50.200 | It tell you.

00:31:51.200 | They came from.

00:31:52.200 | Four different attention heads.

00:31:53.200 | And this attention head.

00:31:54.200 | If you trace back.

00:31:55.200 | And came from.

00:31:57.200 | This.

00:31:58.200 | Dot product.

00:31:59.200 | And then came from this.

00:32:00.200 | Eventually.

00:32:01.200 | Came from this input here.

00:32:02.200 | Do you see this?

00:32:03.200 | Kind of cool, huh?

00:32:05.200 | But you remove.

00:32:06.200 | Let's remove the arrow.

00:32:07.200 | Now.

00:32:08.200 | What if we add.

00:32:09.200 | This skip cognition back.

00:32:10.200 | Add.

00:32:11.200 | And go all the way back up.

00:32:13.200 | To select this.

00:32:14.200 | Select this one.

00:32:16.200 | Okay.

00:32:17.200 | Now I've just.

00:32:18.200 | Restore my skip cognition.

00:32:20.200 | And all of a sudden.

00:32:21.200 | You have this.

00:32:22.200 | Shortcut.

00:32:23.200 | To allow your gradient.

00:32:24.200 | To flow.

00:32:25.200 | Directly back.

00:32:26.200 | Here.

00:32:28.200 | Right.

00:32:29.200 | And then to avoid.

00:32:32.200 | Gradient.

00:32:33.200 | Exploding.

00:32:34.200 | Gradient.

00:32:35.200 | Diminishing.

00:32:36.200 | Gradient.

00:32:37.200 | The question.

00:32:38.200 | Also.

00:32:39.200 | Increase the performance.

00:32:40.200 | And.

00:32:41.200 | You can also do.

00:32:42.200 | Play the opposite.

00:32:43.200 | Where did this go?

00:32:44.200 | So you can do.

00:32:45.200 | Dependence.

00:32:46.200 | So you can go.

00:32:47.200 | Down.

00:32:48.200 | Actually.

00:32:49.200 | Doesn't show me.

00:32:50.200 | How come.

00:32:51.200 | One more time.

00:32:52.200 | Dependent.

00:32:53.200 | Okay.

00:32:54.200 | See.

00:32:55.200 | Go all the way down.

00:32:56.200 | And dependent.

00:32:57.200 | Dependent.

00:32:58.200 | Dependent.

00:32:59.200 | And see this.

00:33:00.200 | Visualization.

00:33:01.200 | Kind of cool.

00:33:02.200 | I like this.

00:33:03.200 | Can we have different.

00:33:05.200 | Model dimension.

00:33:06.200 | After each.

00:33:07.200 | Transforming block.

00:33:08.200 | Instead of constant.

00:33:09.200 | Yes.

00:33:10.200 | In theory.

00:33:11.200 | You can.

00:33:12.200 | They cannot prevent you.

00:33:13.200 | From doing it.

00:33:14.200 | But.

00:33:15.200 | So let's see.

00:33:16.200 | If I just.

00:33:17.200 | Purposely.

00:33:19.200 | Purposely add this.

00:33:21.200 | It would.

00:33:22.200 | Purposely have.

00:33:23.200 | Another dimension here.

00:33:25.200 | Okay.

00:33:26.200 | Um.

00:33:27.200 | So this is my multi.

00:33:29.200 | Layer perceptron.

00:33:30.200 | Before.

00:33:31.200 | Or positional white.

00:33:32.200 | Before network.

00:33:33.200 | And so.

00:33:34.200 | Usually.

00:33:35.200 | You take.

00:33:36.200 | This input.

00:33:37.200 | From previous layer.

00:33:38.200 | Five.

00:33:39.200 | And then you project.

00:33:40.200 | Up.

00:33:41.200 | Higher dimensional space.

00:33:42.200 | To capture some.

00:33:43.200 | High.

00:33:44.200 | High.

00:33:45.200 | Higher.

00:33:46.200 | Other logic.

00:33:47.200 | But I had to project.

00:33:48.200 | Back to five.

00:33:49.200 | So they'd be consistent.

00:33:51.200 | And just.

00:33:52.200 | For the argument's sake.

00:33:53.200 | That's what.

00:33:54.200 | Why won't project.

00:33:55.200 | Back to six.

00:33:56.200 | Instead.

00:33:57.200 | So what I have to do.

00:33:58.200 | Is just have one.

00:33:59.200 | Extra layer.

00:34:00.200 | Of weight.

00:34:01.200 | Okay.

00:34:02.200 | One.

00:34:03.200 | Okay.

00:34:04.200 | And then we'll.

00:34:05.200 | Fit.

00:34:06.200 | Now.

00:34:07.200 | Could just review.

00:34:07.200 | Five.

00:34:08.200 | To ten.

00:34:09.200 | And ten.

00:34:10.200 | To six.

00:34:11.200 | Okay.

00:34:11.200 | So this had.

00:34:12.200 | This.

00:34:13.200 | This block has a different.

00:34:14.200 | The D model.

00:34:15.200 | So I have six.

00:34:16.200 | But the thing is.

00:34:17.200 | It doesn't fit here.

00:34:18.200 | How do you.

00:34:19.200 | You have to get six from five.

00:34:20.200 | What can you do?

00:34:21.200 | What are your options?

00:34:22.200 | Go from six.

00:34:23.200 | To five.

00:34:24.200 | The easiest option to.

00:34:25.200 | And then you introduce.

00:34:26.200 | Another linear projection.

00:34:27.200 | In the middle.

00:34:28.200 | Which will be.

00:34:29.200 | Five by six.

00:34:30.200 | Matrix here.

00:34:31.200 | As you can visualize here.

00:34:32.200 | You can just.

00:34:33.200 | If you put this matrix.

00:34:34.200 | Over here.

00:34:35.200 | And do a linear projection.

00:34:36.200 | Then you can allow.

00:34:37.200 | To match.

00:34:38.200 | Another problem.

00:34:39.200 | With this.

00:34:40.200 | Mismatch dimension.

00:34:41.200 | Is that.

00:34:42.200 | How do you even.

00:34:43.200 | Skip connection.

00:34:44.200 | Even on the brain.

00:34:45.200 | Input from the.

00:34:46.200 | Preview.

00:34:47.200 | Input from the.

00:34:48.200 | Preview layer.

00:34:49.200 | And it just.

00:34:50.200 | Doesn't match.

00:34:51.200 | You have to.

00:34:52.200 | Project in the game.

00:34:53.200 | So it's.

00:34:54.200 | That's what.

00:34:55.200 | Used to be the case.

00:34:56.200 | For.

00:34:57.200 | Before.

00:34:58.200 | Transformer.

00:34:59.200 | When we.

00:35:00.200 | When that.

00:35:01.200 | We were dealing with.

00:35:02.200 | Convolution neural network.

00:35:03.200 | The.

00:35:04.200 | Original version.

00:35:05.200 | Or original risk.

00:35:06.200 | Resonate.

00:35:07.200 | You have this.

00:35:08.200 | Dimensional changes.

00:35:09.200 | Over.

00:35:09.200 | Over the stack.

00:35:10.200 | And.

00:35:11.200 | Oh.

00:35:12.200 | Another benefit.

00:35:13.200 | Of.

00:35:14.200 | I just.

00:35:15.200 | Now just go back.

00:35:16.200 | I just control Z my way.

00:35:17.200 | Back to where.

00:35:18.200 | where we were. Oops. Too much. Just highlight this. I did this. Now I did this. Okay. All right.

00:35:35.060 | Oh, another thing is that you want to highly optimize. If you've got a way to fit this

00:35:41.440 | bi-dimension model, most nicely on the GPU, if you get an option for that, you want to be able to use

00:35:49.800 | it for all your blocks here. So you could optimize only once. They can be used for entire stack. Does

00:35:57.860 | it make sense? I hope. Oh, if a new model architecture is designed, how are the number

00:36:05.860 | of layers determined? Trial and error. I think it's just kind of, it's a trial and error as well.

00:36:09.720 | And then for, or how many GPUs do you have to train them? So only a few companies in this world have

00:36:16.900 | gigantic GPU farm. They might universally, it doesn't. So if you look at 128, it's non-trivial.

00:36:24.440 | I mean, where was I looking at? Layers. Where are the layers? 32 layers. There's same, not that many,

00:36:33.600 | right? But huge. It's a lot, a lot of layers to train them. And there's 32 steps to propagate.

00:36:41.100 | Then you have a huge batch to have to go through for your gradient descent. It's very expensive.

00:36:47.840 | Okay. But now I want to talk about, maybe I talk about this one, V4 network. I haven't spent any time

00:36:54.840 | to talk about V4 network. Let's talk about the Mitchell expert. I think this around this time

00:36:58.920 | when MOE has become popular. So there are some variant in Lama 3D studies that use Mitchell expert,

00:37:04.020 | but maybe not, but obvious, clearly Lama 4 uses MOE now. That's for sure. So do, let me just do a copy.

00:37:15.020 | Do it. Now work on this copy. Okay. MOE, where MOE layer is happened to the MLP V4 network here.

00:37:26.200 | Okay. So, um, like, so what does it mean by having two experts? Basically, this is my MLP block. I just copy this.

00:37:38.920 | And this is it. This is it. Now I have two experts. Not quite yet. I had to fix this a little bit.

00:37:44.100 | So this one is linear project from the input, from the previous layer, output from the previous layer,

00:37:51.020 | and with this weight. So now I just have to modify this to select.

00:37:54.920 | They select the same, actually, what happened here?

00:37:59.060 | Now I move this. Let's move. Doesn't let me. I cannot select this.

00:38:05.100 | Oh, here. Okay. Here. Okay. Here. Okay.

00:38:09.060 | All right. Now I have done. And I have to take this together.

00:38:14.800 | So I would take this, this, add this, and this.

00:38:19.160 | I can just, the easiest implementation is I just going to place, I place plus and just add this.

00:38:25.380 | All right. Okay. So now I have a most basic mission of expert

00:38:31.000 | with constant, with one, with equal weight. Okay.

00:38:35.960 | I can repeat the same process and to do another, my third expert,

00:38:41.620 | I'll repeat the same process and I'll update my input to share the same input from before.

00:38:51.860 | And then I'll come down here. I also just add the new output.

00:38:57.120 | Now they are equally weighted. So this is the equally weighted scenario.

00:39:01.780 | And then the question you have in your mind is that, but how can we

00:39:06.840 | have a mechanism that they are sort of linearly weighted?

00:39:11.680 | So they weighted differently. Sometimes this expert get more weight.

00:39:15.580 | Sometimes this expert more weight. Sometimes this expert more weight.

00:39:19.000 | Also, we have 10 tokens, maybe for a particular token, we, and we like one expert to receive more weight than the other.

00:39:28.560 | So what we can do is that we could create another network over here.

00:39:32.320 | I'm going to build another here called a routers network.

00:39:35.720 | And we need to have at least three different weights that we can calculate one for each expert.

00:39:43.560 | So to map it out, what I want to do is that I want to have three expert.

00:39:52.340 | Each token will have either weight.

00:39:55.560 | So I want to have a three by three by one, two, by 10, three by 10 of my gated value.

00:40:05.040 | This is my gate tensor, three by 10.

00:40:07.840 | Did I draw 10?

00:40:08.720 | Okay.

00:40:10.200 | Now do I, how can I get three by 10 metrics from here?

00:40:16.260 | So we need a linear projection here.

00:40:18.440 | So let's just move everything to the right.

00:40:20.940 | Actually, I can do this.

00:40:23.740 | shift everything to the right, I can have some more space to work with. So I'm gonna use this, insert, insert,

00:40:31.180 | shift sail to the right, will that work? Instead I don't like this color, let's just remove this.

00:40:38.540 | Okay, so I now kind of map this out, I want to have one, two, three, four, five, this is my weight,

00:40:46.140 | this is how many weight I want, so I can do zero, zero, zero, zero, zero, zero, zero, zero, zero, zero, zero, zero, zero, zero, zero, zero, zero, zero.

00:40:56.860 | Okay, so this is how many weight, I just initiate to zero, and if I multiply, matrix multiply with my weights here,

00:41:04.940 | and then my tokens here, I just don't have binary, so zero, that puts on random weights,

00:41:13.740 | so that's my random way for computing gating value, so now I have all this, okay, and then

00:41:20.860 | I would like to do a solve max, so these are my relative weights, the term learned by this network,

00:41:29.420 | let me just share this, so following the convention, so these are the variable weights, and to implement

00:41:35.100 | solve max, so I'll do exponent of this by column, and then sum of the same thing, exponent

00:41:42.860 | of this, so this is my solve max, you can notice that this thing, even though it's at 414,

00:41:47.900 | but if I show you more, let's zoom in more, you can see,

00:41:51.660 | this three number will add out to one, that's what solve max gives you, so I'll copy solve max over here,

00:42:00.620 | so that's all my solve max, so each one will be a gate values across like this, so this is what the tensor

00:42:05.580 | looks like for your gate, and once we have that, I just lost track after I, okay, so I will bring this

00:42:14.300 | gate value over here, maybe just create some spaces here, I'll create some spaces here, and now put some

00:42:21.420 | gate values close by here, so here is my gate value, one, the first expert will be this row,

00:42:31.180 | okay, and then this per token, that weight is slightly differently, my second expert will be,

00:42:36.700 | come back here, this, this row, and my third expert will use this row as the predicted gate value,

00:42:47.500 | it's all kind of high, and then, let me just create some more space, I need some more space again,

00:42:55.260 | okay, and then I will do an element wise multiplication, to take the output from the first,

00:43:02.940 | one more time, output from the first expert, multiply the gate value, this per token, and I'll have this

00:43:12.700 | gated, except that, okay, and then this logic is the same for the other experts, I can just copy the

00:43:20.220 | formula, so now instead of adding these three rows as is, as an equally weighted combination,

00:43:28.220 | now I have a gated combination, so I go back to modify my equation here, so I just drag this down here,

00:43:36.860 | so now I finish my dance, make sure the expert, come on, come on, here, where is my hand,

00:43:48.540 | doesn't, oh here, okay, here, one more, one more,

00:43:58.220 | doesn't let me, one more, move a little bit, oh here, okay, drag here, down here, okay, now I finish,

00:44:04.460 | and so for sparse, if that you pick top two, for sparse, you have a mechanism that just pick two

00:44:13.180 | highest value, and say the other one to be zero, for instance, zero, four, one, zero, so you can have a

00:44:20.380 | equation to do something like this, and then what, and then I use this as a gated value instead,

00:44:25.820 | there will be, there will be a sparse mixture of expert, but excel, I find there's no easy way to do

00:44:31.980 | this, so I'm not going to bother today, okay, so, uh, so now I think I finished this as well,

00:44:40.700 | mixture of expert, did I did mixture expert, okay, so, anyway, I think I could top flash attention,

00:44:49.820 | or just have a conversation right now, is there a room for improvement within the domain of attention

00:44:59.260 | mechanism, NSA being the latest innovation of DSEC, or is there a natural boundary in your opinion?

00:45:04.460 | I think with NSA is native, uh, native sparse attention, like, do you want me to do that?

00:45:11.580 | Uh, for sparse MOE, do you normalize and F select in the top K? That, I guess, probably, if you happen to

00:45:26.140 | have, uh, like, uh, RMS norm along the way, so normalization is probably, uh, not as necessary,

00:45:35.980 | because we'll like this, this to learn to normalize your value across your experts,

00:45:42.700 | and they'll normalize in some way, but it does, it can't, so it's, it's no theoretical justification

00:45:50.140 | one way or the other, but I guess it's just, empirically, if you commit to add a normalization

00:45:56.060 | layer, and you committed three months to train your model, you, you already committed to it,

00:46:02.700 | it's too late for you to change something, change to something else, but maybe the benefit is marginal,

00:46:07.180 | maybe there's no benefit, you don't know, but it also doesn't hurt for you to try, um, okay, so,

00:46:18.060 | uh, so, does input and output dimension for each expert match the model dimension?

00:46:22.860 | Uh, yes, also, it doesn't have to be, I could, for instance, what if I have one of the expert to

00:46:32.620 | output more, how can we have, say, expert three output more, a longer token? So, we could add another

00:46:42.620 | row of weights here, so let me show you, maybe add, I just moved by two, two spaces, maybe add two,

00:46:49.740 | two set of weights, so this is similar to adding two, two nodes in this, this, uh, MLP right here,

00:46:57.100 | so now all of a sudden, I have seven instead of four, so I update my ReLU here, ReLU is, I think,

00:47:04.300 | automatically updated here, okay, here, so I've got seven, now I have seven, the problem is that now you

00:47:10.140 | cannot add them together, now what do we do? And then eventually, if most likely, then you have to

00:47:17.580 | project this down to fight anyway, and to be able to add them together, or you can concatenate, but,

00:47:24.540 | but concatenate, eventually, you still have to project, project it to a state dimension to match

00:47:31.340 | with, to be able to work with other layers, just kind of, Lego pieces, all this, all these things, they

00:47:38.140 | have to have the same dimension for them to stack up together, you cannot have an arbitrary, so, but in

00:47:43.420 | theory, you can, why not, you can just add this, add a particular layer over here, and to add another,

00:47:49.500 | then add another linear projection to project back, so I will, what I'll do is, I will do seven, one, two, three, four, five, six, seven, and five, so this is how much I need,

00:48:00.700 | I need to then change, it's kind of ugly now, let's do, follow the space over here,

00:48:07.740 | and then, if I do a matching multiplication, so we're back to five dimension per token,

00:48:18.220 | and then, when I go over here, so I fix, I can move my green one over down here, all of a sudden, the equation

00:48:26.460 | works again, okay, so now it's fixed here, so maybe I select it for two minutes, okay, here, okay,

00:48:37.580 | what else, should I talk about NSA, so the, um, 10 minutes, do we want to hear flash attention, or hear MSA,

00:48:52.220 | over here, any of this, rope, RMS, flash attention,

00:48:59.740 | NSA, native space attention, so sparse attention, so you can see that it is a dense attention,

00:49:08.460 | so we have, we derive the entire, uh, entire query over here, okay, so my understanding of the sparse

00:49:18.380 | attention, is that, let me just use a, uh, uh, mix, move a copy here, create a copy here, so this is my

00:49:30.780 | attempt to explain NSA, maybe, based on my memory, I hope I'm doing it correctly, so what we can do,

00:49:39.420 | is that, now I have 10 tokens, what if, I take three tokens here, okay, three tokens here, and try to merge

00:49:47.980 | into one token, so it becomes more sparse, it sounds like, makes sense, maybe this is the two, too big,

00:49:54.780 | let's use a smaller one, maybe here, uh, I want to take these three tokens, and, and then somehow

00:50:02.220 | condense all the information I need into one, so let's just draw some space here, this is my goal,

00:50:08.300 | I want to take this, these three tokens, let's just highlight a little bit, green, into this blue, okay,

00:50:16.620 | so what do we need to do, so first, we need to flatten this, so we have nine elements here, and from nine

00:50:23.020 | to three, how do we go from nine to three, we need a linear projection here, so that way is three by

00:50:28.300 | nine, make sure it's right here, so I can, I get lazy enough right now, I can use rain array, and we get

00:50:35.100 | three rule and nine column, so I immediately, immediately get the make, the, the mixture I want,

00:50:41.740 | and then I will just do a make sure modification, make sure modification, come on, you can say make sure

00:50:49.260 | you have to do this, and take this, okay, and then, and then I take this, so I cannot do this yet, I have

00:50:57.580 | to do is, I want to do this, but I want to convert it into a column, two columns, two, too small, two columns,

00:51:10.140 | and then, I cannot see it, I said, but I just read it out, now, okay, but then I get this, now I'm done,

00:51:17.740 | okay, so let's just see, just review this, okay, so this is, and all right, so if I repeat the same

00:51:23.660 | thing for these three, so I have this one, and I select the same weight, they share the same weights,

00:51:32.460 | okay, and then the last one, another block here, so also to repeat this, copy over here,

00:51:39.500 | I just select the same weight, now I, I'm working with only three queries, did you see this, only with

00:51:48.220 | three query x, and then, so if I do the attention weight matrices here, let's re-implement this, so I'm going to

00:51:55.740 | create some space for me, ah, come on, this is ugly, but, well, how do I fix it real quick,

00:52:04.220 | because it just erase it, sometimes it's trying to be smart, trying to figure out a format for me,

00:52:13.820 | and, okay, and then, and then, so now I just redo this, I have my first query, and then my second

00:52:24.060 | query, and my third query, and then we do the same thing for my keys, I think we've got copy and

00:52:32.060 | paste, might just work, let me do it, is it right, oh, yeah, actually, oh, yeah, it works, and then you

00:52:37.740 | can notice that you have a different set of weights, okay, and now I can do the same thing here, I want to

00:52:44.300 | transpose this, transpose this,

00:52:51.660 | and then one, two, three, and actually, transpose my another key,

00:52:57.580 | and transpose another key here, all right, and then, all of a sudden,

00:53:06.380 | did you see the attention weight matrix, oh, the dark product I need to do is a lot, is only here,

00:53:16.540 | I'm going to take a comma, and select, let me redo one more time, make tree modification, select my

00:53:27.660 | key, and select my query here, and then this is, okay, and then usually I also have divide by the square root of the dimension of key, which is three, so this is typical, but, all right, so this is what we have, and then we take softmax, here,

00:53:34.540 | I can copy the format here, so it looks pretty, okay, so now I can zoom out, you can come in on the, save the difference in terms of the computational complexity, so it's not sparse, you can see here,

00:53:50.540 | here, so instead of all this, now I have this, okay, and then the native part, the native part is that this is part of the training, so I'm going to do this shading action just with emphasis, just to show you that, so this is what the native part is, this is part of the training,

00:54:12.540 | so I'm going to do this shading action, just with emphasis, just to show you that, so this is what the native is referring to, just training this, as part of your sparse attention mechanism, okay, let's just look at the non-sparse attention mechanism, how does it work, so if I'm going to take this into non-sparse attention,

00:54:36.540 | how can I convert this one into a non-sparse attention, maybe just do a copy, actually, maybe not, let's just change it, so I have this matrix modification for the whole thing, what I can do is that I could just do like this,

00:54:50.540 | and so in this case, each query is only going to compare with key in the neighborhood, so I can repeat this, I copy and paste over here, and it slides,

00:55:06.540 | select here, and here, okay, and then repeat this again, but I'll make sure I select the right key, maybe in this time, I just do four of this, since it's a multiple of three, but the last one multiple is four, okay, so now it's a lot sparse, you see the sparse, and this is dense, and this is dense,

00:55:27.540 | it's a sparse, but the difference between this, I just say, I just say, okay, match with neighbors, there's no extra

00:55:35.540 | learnable parameters involved, there's no extra network you learn to do this, but in this case, it's a native sparse attention, and somehow works, you have this, and I thought, what is, well, this is,

00:55:52.540 | well, this is, you can visually see there's an efficiency, in terms of the computational efficiency, just fewer matches compute, but you have to give DPC credit, because theoretically, it sounds fine, but you talk about tens of millions of dollars to even just experiment, to try to see where it works, what if it didn't work, it could, it could be the case that they tried it, it didn't work, but they tried it anyway,

00:56:14.540 | and they are lucky they worked, and so they probably take the report about it, but I bet they probably tried five, six other things that didn't work, so they never talked about it.

00:56:21.540 | So, this is like a convolution with strike three, with no overlay, that's correct.

00:56:28.540 | What else?

00:56:35.540 | I think we're running out of time, two minutes.

00:56:39.540 | I enjoyed it, although I have, I can show you my practice sheet.

00:56:50.540 | So, real, real quick, so if I stop, share, and share again, to, this is my internal practice, not as pretty, but if you are wondering, so this is, I have, so this is, the things I did not talk about today, is the,

00:57:17.540 | So, this is the, so this is from layer norm, to rms norm, and then what else did I talk about, also I talk about, oh, here, so this is rope, it's kind of complicated, yeah, this is rope.

00:57:45.540 | Um, rotational position encoding.

00:57:50.540 | But, I think I can tell you the high level stuff, so, you can see that I put this block of actual computation.

00:57:58.540 | Number one, it's not shaded, so this is not trainable, it's all pre-computed, all the rotation matrix can be pre-computed.

00:58:05.540 | Number two, it's really close to the attention head right here, whereas the original position encoding is only injected right in the beginning.

00:58:15.540 | All right, and then you, you hope, all this key connection to bring information down.

00:58:20.540 | Now you have 32 layers to do so, and you, you're lucky, if the position coding has any impact down the stack.

00:58:27.540 | But with rope, you would add this computation at each head, at each head level, every single stack, get the position really close to where the attention matters.

00:58:38.540 | So that is probably the takeaway, that being able to visualize the difference between the two.

00:58:46.540 | There's a way to visualize this, like this rope, rope is right here.

00:58:50.540 | So that's a way to visualize the position coding, right, is at the top.

00:58:54.540 | There's a difference.

00:58:55.540 | Okay, so hopefully, I still talk about everything, not at the same level of depth, a bit unstructured, and hopefully, you guys have fun.

00:59:08.540 | I have fun.

00:59:09.540 | Very, very fun.

00:59:11.540 | It's always good to see a different perspective on how this stuff works.

00:59:14.540 | I'm sure people learn a lot about you on that.

00:59:17.540 | Yeah, this reminds me a lot of like Ishan's walkthrough, but like actually, higher level, like you did a lot of work in like reducing the dimension so that we can actually hold it in our heads, which I think is very important.

00:59:31.540 | Yeah.

00:59:32.540 | This is an inductive process.

00:59:33.540 | And then you, if you get three dimension right, then you, in the future, I see just vibe code.

00:59:38.540 | Say 14.

00:59:39.540 | You as a model to, to expand this to a higher dimensional space, but underlying math.

00:59:46.540 | Math is easier to work out at the lower dimensional space.

00:59:51.540 | Cool.

00:59:52.540 | Well, you know, just to be respectful of your time.

00:59:55.540 | If people want more, you know, are you, where to best find your, the rest of your work?

01:00:01.540 | Oh, so I enjoy that too.

01:00:03.540 | Honestly, I don't tend to get to talk to a taken audience like that.

01:00:07.540 | So I, I went all out today.

01:00:09.540 | So I was curious how it's being received because with my own student, I cannot go at this steps.

01:00:17.540 | I mean, cannot go at this, this pace.

01:00:21.540 | So I get to talk in one semester worth of stuff in one hour.

01:00:26.540 | And I expect that you could follow most of it or all of it.

01:00:31.540 | So I enjoyed it to give me the opportunity to go nerd.

01:00:35.540 | Yeah.

01:00:36.540 | On this.

01:00:37.540 | Awesome.

01:00:38.540 | Um, yeah.

01:00:39.540 | I mean, we, we've been covering a paper here every week for the last two years.

01:00:43.540 | So, um, there's, there's been a lot of interesting, uh, lectures and papers and, um, yeah, definitely

01:00:50.540 | like some of this is repeated.

01:00:52.540 | Some of this is new or like a new perspective on the same thing is actually always useful.

01:00:55.540 | So did, uh, the Excel, uh, online live up to the enough fast enough for you to check,

01:01:02.540 | to walk around the worksheet.

01:01:03.540 | I'm curious from your side of the experience.

01:01:05.540 | What is this like?

01:01:06.540 | Uh, Steve Saito says Excel is pretty fast.

01:01:10.540 | And you were able to also check the equation for me last and yeah.

01:01:16.540 | Anyway.

01:01:17.540 | Okay.

01:01:18.540 | I, um, yeah, I, I did prepare.

01:01:20.540 | I thought it was fun.

01:01:21.540 | So I spent quite a lot of time to think about how I should cover this.

01:01:25.540 | And I'm glad there's mostly most of you still stay and we'll hope to do this again for some

01:01:32.540 | other things I didn't get to cover.

01:01:34.540 | Uh, the life, uh, spreadsheet was very, very useful.

01:01:38.540 | I feel like I still have a lot to dig back into, but yeah, the formula is showing what.

01:01:43.540 | Yeah.

01:01:44.540 | Multi-headed attention transposing and stuff.

01:01:46.540 | A little back prop was crazy.

01:01:48.540 | Very, very cool.

01:01:49.540 | Okay.

01:01:50.540 | Thank you.

01:01:51.540 | Thank you for the invitation for this, this opportunity.

01:01:53.540 | I hope that I'll have opportunity to come back.

01:01:55.540 | And because then I can go, I can kick out with you.

01:01:59.540 | Always.

01:02:00.540 | Um, I think for next week, I'm probably going to invite someone from Prime Intellect to cover

01:02:06.540 | Intellect 2.

01:02:07.540 | They put out a paper on that this week.

01:02:09.540 | So what do we think?

01:02:10.540 | Distributed GPUs on the blockchain.

01:02:13.540 | Let's go.

01:02:14.540 | Yeah.

01:02:15.540 | Sounds good.

01:02:16.540 | Sounds good.

01:02:17.540 | And it's like a good paper.

01:02:19.540 | So we'll, we'll invite the team over.

01:02:21.540 | I'll share the, uh, paper.

01:02:22.540 | Yeah.

01:02:23.540 | Uh, you know, for, yeah, I think that the, the, the really, the, the thing, the interesting

01:02:26.540 | thing is like, does RL and, um, long chain of thoughts sort of, uh, training actually introduce

01:02:35.540 | new training paradigms where you, you're the, the, the hardware requirements are actually

01:02:40.540 | different and they actually don't benefit from the normal centralization factors.

01:02:46.540 | Yeah.

01:02:47.540 | Cool.

01:02:48.540 | All right.

01:02:49.540 | Well, thank you so much.

01:02:50.540 | Thanks everyone.

01:02:51.540 | Bye.

01:02:52.540 | Bye.

01:02:53.540 | Uh, so Sean, is it possible that you can share with me the chat history?

01:02:59.540 | How do I save it?

01:03:00.540 | Cause there's a lot of good questions.