Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

Hello guys, welcome back to my channel today, we are gonna talk about Mistral So as you know Mistral is a new language model that came out a few months ago from Mistral AI Which is a one of the hottest to start up right now in Europe for language models It also became a unicorn recently and we will exploring both the models They released the one is the 7 billion and one is the 8 by 7 billion model So let's review the topics of today the first thing I will introduce you is the architectural differences between the vanilla transformer and the architecture of Mistral later We will see what is the sliding window attention and how it is related to the concept of receptive field a concept that usually we find in convolutional neural networks I will briefly review the KB cache because I want to introduce the concept of rolling buffer cache and also how it is done with the pre-filling and chunking we will see what is a sparse mixture of experts model sharding with a little with a very brief introduction with the pipeline parallelism and Last but not least we will also go through the code of Mistral because there is a lot of innovations in the code Especially when they use the Xformers library with the block attention So I want to guide you into understanding the code because it can be really hard for beginners to understand and find themselves around There are some topics that are related to Mistral But will not be covered in this current video because I already covered them in my previous video about Lama and in particular I will not be talking about the RMS normalization, the rotary positional encoding and the grouped query attention because I already Teach them in depth in my previous video on Lama So if you want to know about them, please watch my previous video on Lama In order the only prerequisite that I hope you have before watching this video because the topics we are going to touch are quite advanced Is that you are familiar with the transformer model So if you are not familiar with the transformer model and the attention mechanism in particular and in particular the self attention mechanism Please go watch my video on the transformer in which I teach all this concept very thoroughly very in detail These are really a prerequisite for watching this video because the topics here are quite advanced Okay, let's proceed further So let's watch the differences between the vanilla transformer and Mistral at the architecture level as you can see from the Image here, which I built by myself using the code because they didn't release any architecture picture in the paper And the architecture of Mistral first of all, let's talk about some terminology When you have a model like this made up of many encoder layers plus linear and the softmax We are talking about a decoder only model because this part this model here looks like the decoder of the vanilla Transformer you can see here except for the cross attention because as you can see here, there is no cross attention When we have a model without the linear and the softmax we call it an encoder only model For example BERT is an encoder only model because BERT has some heads at the end Which is one or more linear layers depending on the application But itself BERT doesn't need a head because it can be used for multiple downstream tasks so it's called an encoder only model because it resembles the Encoder side of the transformer because as you can see in the encoder side, there is no linear and softmax So Mistral is a decoder only model and it's very similar if not equal to Lama The differences between Lama and Mistral are highlighted here in red the first difference between Lama and Mistral is that In the self attention we use the sliding window attention and we still use the grouped query attention But and also the KV cache for inferencing But this is a rolling buffer KV cache and it's actually related to the fact that we are using sliding window attention So later, we will see all these concepts and the another difference is that the feedforward layer here instead of using the relu function that we use In the vanilla transformer or the ZWIGLU function that we use in Lama here in Mistral we use the SILU function And the feedforward is one in case of Mistral 7b So the first model they released and it can be eight feedforward Networks in parallel with each other which are the experts of this mixture of experts in the case of Mistral 8x7b We will see later how it works So for now, you just need to understand that Mistral is made up of okay the input which are converted into embeddings Then we have this block which is repeated n times and we will see that in the case of Mistral is repeated 32 times one after another such that the Output of each layer is fed to the next layer as input and the output of the last layer is then sent to this RMS norm to the linear and to the softmax to produce the output of the model and This is exactly the same as what we do with any other transformer model Usually we have many of these blocks here.

Now in the code of Mistral this part here is known as transformer block But it's also known as encoder block or decoder block depending on the contents of this Block here. I will refer to it as an encoder block because if you look at it It looks like exactly as the block of the encoder side So it has a multi-head attention, add-end norm, a feedforward and add-end norm The only difference is that the normalization here comes before the the block of the feedforward and the self-attention Okay, let's move forward Now let's compare the two models So one is Mistral 7B and one is Mistral 8x7B The parameter dim indicates the dimensions of the the dimensions of the embedding vector So how big is the embedding vector?

So each token is represented by an embedding vector of size 4096 dimensions We have 32 of the encoder layers. So this block here is repeated 32 times The head dimension indicates as you remember in the multi-head attention we have Each head is watching in the entire sentence But only a part of the embedding of each token and this indicates how much How many dimensions each head will attend to in each For the in the multi-head attention and the hidden dimension here indicates the hidden dimension of the feedforward layer so if the in the case of the feedforward layer, we have two linear layers one that converts the Dimension of the embedding vector into the hidden size then another one that converts the hidden size back into the embedding vector dimensions So in the case of the Mistral they are using as a hidden size 14336 usually this is a multiple of the dimension and it looks like it's 3.5 the dimension here The number of heads of attention for the query is a 32 while the number of heads for the K and V So the key and values is 8 and they are not equal because of the grouped query attention So if you remember from my previous video on llama in which we talked about the grouped query attention In the very simple case of the grouped query attention We have the multi query attention which means that only the query have the multi head while the key and V don't have the multi head attention Which means that you may have eight heads for the query and only one head for the K and V in the case of the grouped query attention means that each group of Query will have one Attention head for the K and V So in this case Every four query have one attention head for the keys and values if this concept is not clear I describe it very thoroughly in my previous video on llama the window size is the size of the sliding window that we used in the Calculation of the attention and we will see later how it works.

The context length is What is the context size upon which the model was trained upon? And it's much bigger for the 8 by 7 B The vocabulary size is the same for both and then the last two parameters You can see here are related to the sparse mixture of experts and we will see later How it works, but we just remember that we have eight experts and for each token we use two experts But later I will clarify how it works Let's proceed further.

So let's talk about the sliding window attention. But before we talk about the sliding window attention I need to review a little bit of the self attention mechanism. So what is self attention? Self attention is a mechanism that allows the model to relate tokens to each other So tokens that are in the same sentence are related with each other through the self attention mechanism This is why it's called self attention because each token is watching other tokens of the of the same sentence And when when this is means basically that the query key and values are the same matrix So imagine we have the following sentence the cat is on a chair We have our query which is a matrix made up of six tokens each token represented by 4096 dimensions, which is the dim parameter that we saw before This is multiplied by the transpose of the keys, which is 4096 by 6 but it's just the query matrix transpose because the query key and values are the same matrix in the case of self attention This will produce a matrix that is 6 by 6 Because the inner two dimensions kind of cancel out and the outer dimensions indicate the dimension of the output matrix here Now, what are the values in this matrix representing?

The first value here indicates the dot product of the first token with the first The first row of the query with the first column of the keys So basically the dot product of the embedding of the first token with itself The second value here indicates the dot product of the first row of the query matrix With the second column of the key matrix here the transpose of the keys matrix here Which basically means that it's the dot product of the embedding of the first token So the with the embedding of the second token, which is cat and etc, etc for all the other values Don't concentrate too much on the values because all the values I put here are random And also the fact that these numbers are less than one It's not necessary because the dot product can be bigger than one.

It's not a condition of the dot product Usually in the formula we also normalize here we divide by the dimension of the dk dk basically it's the size, the part of the embedding to which this particular attention head will attend to But let's pretend that we only have one One head so dk is equal to d model.

So basically this head will watch the full embedding of each token Okay, usually we train autoregressive models. So language model is an autoregressive model It means that the output depends on the previous The next token depends only on the previous tokens And this is why we apply a causal mask.

Causal mask means basically that in the attention mechanism We don't want to release a word with future words So words that come after it but only with the words that come before it so for example, we don't want the word "the" to be related to the word "cat" because The word "cat" comes after the word "the" but on the other hand We want the word "cat" to be related to the word "the" because it comes before it And for this reason we apply this causal mask Because the attention mechanism uses the softmax function we can see here.

The softmax function basically will transform all this minus infinity into zero because the formula of the softmax has at numerator an e to the power of X and When X goes to minus infinity e to the power of minus infinity will go to zero So this is why we apply a mask in which we put all the values that we don't want, all the interactions that we don't Want between tokens.

We just mask them out by replacing them with minus infinity So that when we apply the softmax, the softmax will take care of transforming them into zeros Okay, also the softmax will do another thing because it will not only convert this minus infinities to zero But it will also modify the other value for each row such that they sum up to one So as you can see now these values here They don't sum up to one for each row, right?

Because this is 0.2, 0.1 and 0.2 They don't sum up to one but the softmax will convert the minus infinities into zero and the remaining values for each row such that they sum up to one Now let's talk about sliding window attention So we applied the causal mask to hide the interactions between The words, a word and all the future words, but with the sliding window attention We also don't want the word to watch Other words that are outside its local context What do I mean by this?

In the previous case when we only applied the causal mask, the word chair for example was being related to all the previous Tokens as you can see, so the token chair here is related to itself But also to the a on is cat v, so it could watch basically all the sentence But in the case of sliding window attention, we don't want the word chair to watch words that are further than the sliding window size from itself, so The sliding window size in this case is three, so tokens that are distance more than three from the word we are considering, so the word the chair should not be related to the word is because the distance is four and the word a should not be related to the word cat because the distance is four and of course We still want the mask to be causal because we don't want the model to Each token to watch future words because we are training an autoregressive model So the sliding window attention basically reduces the number of dot products that we are performing And this will improve the performance during the training and the inference because as you can see when we only apply the causal mask We are performing all these dot products you see here, but with the sliding window attention We are performing less dot products because all the other will be masked out Sliding window attention however may lead to degradation of the performance of the model because as you can see here The word the chair and the word the are not related to each other anymore, right?

So the information Will not be conveyed from the word the and the word chair the word chair will only be related to other tokens that are belonging to the local context of this particular token so only the Tokens that are in the same in the inside this is sliding window This may be if this window is too small it may reduce the performance of the model But it may also be beneficial because for example imagine you are reading a book you don't care about Relating the word in chapter 5 with the words in chapter 1 because most of the books They could be talking about totally different things and you don't even care about relating these two tokens But for sure you want to relate the tokens In the chapter 5 with other tokens in the chapter 5 because the local context matters But I want to introduce you the concept of receptive field because when we use sliding window attention Even if the word chair and the are not related to each other actually Because in Mistral and in all transformer models we use multiple layers of encoders We will see that the information so the the word the chair and the the will still be kind of related to To each other not directly, but indirectly In a concept that is very similar to the receptive field of the convolutional neural networks So let's talk about the receptive field as you remember in convolutional neural networks We have a mask a kernel that we run through an image.

So imagine this is our original image This one here and we run a mask that is a kernel that is 3x3 this one here That when we run a kernel it will produce an output feature. So for example this feature here This is the output produced by applying the kernel to the first 3x3 grid here This value here the second value here in yellow It will be produced when we will move our kernel to the next group of 3x3 pixels.

So let me draw Let's use the pen. So this value here will be produced when we will move our kernel in this grid here and This value here is also an output feature of a convolutional kernel that is a 3x3 Applied to this layer 2. So this is a 3x3 kernel that is applied to this layer 2 So apparently there is no connection between this one this pixel here and this one But because he this this output feature depends on a kernel applied in this grid and this grid Includes this feature here which depends on this pixel here.

We can safely say that this feature here Depends indirectly also on this feature here Even if they are not directly related to each other and this is the concept of the receptive field. So basically One feature of the convolutional neural networks can watch a much bigger receptive field down upward in the layers because of this Sequential application of kernels in the convolutional kernels Let's see how this concept is related to the sliding window attention now now After we apply the softmax to the mask that we have seen before as I told you before all the minus infinities are Converted into zero and all the other values are changed in such a way that they sum up to one So let's go back as you remember here.

We have the minus infinities here here here here here and here So now we apply the softmax and it will become zeros zeros here. Also, let me Okay, all the zeros here all the zeros here and all the other values are changed in such a way that they sum up to One what is the next operation that we do in the self-attention?

We then take the output of the softmax and multiply it by the V matrix. So let's do it The V matrix is basically the same as the initial sequence because I told you this is self-attention So the query key and values are the same matrix So this means let's analyze what happens by hand when we do this multiplication So let me change to the pen.

Okay, the V matrix here is is a sequence of tokens where each token is a vector represented by 4096 dimensions so we can say that it's the output of the self-attention if you watch the Dimensions of these two matrices. So it's a 6 by 6 and the 6 by 4096 the output will be another matrix that is 6 by 4096 so it will have the same dimension as the V matrix and also as the Q and the Query matrix because they have the same dimensions.

So it will be six tokens as output Okay, let's analyze. What is the first dimension of the output to this one here? So this first value of the output so the value on the row 1, column 1 of the output matrix will be the dot Product of the first row of this matrix here.

So this row here With the first column of this matrix, so the first column we can see here and As you can see most of the values here are 0 which means that all the rows from the 1 to 5 Sorry from 2 to 6 will not be used but only the first row here fully the values of the first row will be used because if you remember the dot product is the first dimension with the first dimension of this column and The second dimension of this row with the second dimension of this column The third dimension of this row with the third dimension of this column and then we sum up all these values So this first value of the output will only depend on the first token of the V matrix You can see here.

Let's check the second one the second dimension of the the first dimension of the second row of the output matrix will be the dot product of the first row of this matrix here With the first column of the V matrix but most of the values are 0 which means that this dimension here and all the dimensions in this row will depend only on the first two tokens of the V matrix and We can say the same for the third.

Let's analyze the sixth one here So the first dimension of the sixth row of the output matrix So this value here comes from the dot product of this row And the first column of the V matrix but most of the values at the beginning are 0 which means that it will only depend on the 4, 5 and 6th token of the V matrix and so will be all the dimensions here because in each column Whatever the column we use from the V matrix the first values will always be multiplied by 0, 0, 0 So it will only use the values in these three rows here So we can safely say that the 6th token of the output matrix of this self-attention mechanism will be a vector that will only depend on the last three tokens of the V matrix and Because we are talking about self-attention the V matrix is equal to query matrix So we can say that the output of the self-attention is a matrix that has the same shape as the input sequence But where each token now captures some more information about other tokens Which tokens depending on the mask we have applied So our mask says that the first token can only watch itself So the first output token will be an embedding that will only depend on itself The second token will only depend on the first two tokens The third output token will only depend on the first three tokens The fourth will depend on the token number two because the first token is not used The token number two, the token number three and the token number four, etc, etc Until the last here The last token will depend only on the last three tokens because the first three tokens are masked out And this is the importance of the mask that we apply in the self-attention mechanism This concept that I show you now is very important to understand the rest of the video So please if you didn't understand it, you can take a little pause You can try to do it by yourself because it's really important that you understand How the self-attention mechanism works with the mask Okay, now that we have seen this concept I want to introduce you to the next one So as we saw before the output of the self-attention mechanism is another Matrix with the same shape as the query matrix in which each token is represented by an embedding of size 4096 but each embedding now captures information also about other tokens According to the mask and if we check the this mask here, so the output here We can safely say that the input of our sliding window attention was the initial Sequence dcat is on a chair But after applying the self-attention the first token is now related to itself The second token is related to itself and the token before it The third is related to the token before it and the one also before it The last one only depends on the previous two tokens, etc.

According to the mask, right? Now what happens if we feed this one because as you know in the transformer world and also in Mistral and also in Lama we have many layers of encoders one after another which are also called the transformer block in the code and The output of each layer is fed to the next one.

So this is the first layer of the transformer So we take the input sequence and we feed it to the first layer which will produce a list of tokens where each token now captures Information about other tokens, but this will become the input of the next layer where we it will produce an output This output I will prove you that will capture information about even more tokens Even if the sliding window attention says that they should only be able to watch the previous two tokens Because the sliding window size we chose three as a sliding window size I want to prove it.

So Imagine this is the output of the first layer So it's a list of tokens that capture information about other tokens and it's the the matrix that we built in the previous slide Let's use it as an input for another layer of the encoder. So we multiply the query We multiply the query and the transposed of the keys which will produce a matrix like this one in which each token is not only One token, but it's capturing already information about multiple tokens, right according to the mask So I'm taking this one and this one will become query key and values So if we multiply the query by the key, it will return a matrix like this So the first token only depends on itself.

The second one depends on himself and the previous one So the embedding of this token captures information about two tokens and the embedding of this token capture information about three tokens, etc Let's try to do the multiplication again so We have that our V matrix is again a list of tokens and the output Will also be a list of tokens, but each one will capture information about other tokens Okay, let's analyze the dot product here So the first value of the first row So the first dimension of the first row of the output matrix will be the dot product of the first row of this Matrix here.

So this row here with the first column of this matrix here. So this column here But because of this causal mask with the sliding window attention mask that we can see here It will the output will only depend on the first row of the V matrix But because the V matrix is a matrix that is made of these tokens here it will only depend on the word V so as we can see here the output of the second layer only depends on the word V and So will be this second one.

So let's check the fourth token. For example here this one Let's check this fourth token here So this value here will be the product of the fourth row of this matrix dot product of This row with the first column of the V matrix. So this column here But the first token will not be used because it's we are multiplying it with zero whatever value we have here We will not be using it we are using the second token the third token and the fourth token and Each token actually they are aggregating this This values here.

This token here is already Aggregating the value of two tokens, which is D and cat. So this embedding here is already about talking about D and cat and this token here is talking about is aggregating the information about the D cat and is so D cat and is and the fourth token is Aggregating the information of the cat is on so cat is and On because as we saw before the fourth token here cat is on which is the result of the previous self-attention that we done So this output value here will depend on three tokens that already include information about other tokens So this value here will aggregate Information about the union of all these tokens So it will for sure depend on the word D because it's included in the second token.

We are multiplying it with It for sure will include information about the word the cat because it's included in this token as well for sure It will include information about is because it's included in the second value We are multiplying it with and for sure it will include about the token on because it's present in the the fourth token of the V matrix for with which we are Multiplying it because this value is not zero So as you can see after applying another layer of the encoder The fourth token now includes another token in its information before it was only including these three tokens but now it also depends on a new token which is the word V and We can prove the same for the fifth token and the sixth token so at every application of the encoder layer one after another we keep increasing the number of tokens that get Accumulated in these dot products and I made a notebook in Python to visualize this So if you look at my github repository You will see this Notebook called the sliding window attention in which I help you visualize this process and I also share the code on how I do this Self-attention basically I represent each token as a set so each Each token instead of being represented as an embedding as a set of all the words upon which that token depends depends then I apply the cell sliding window attention, which basically means that I take the two tokens that from the sequence and I Accumulate I make the union of the two sets they contain Because I am multiplying two vectors that already include information about multiple tokens So what is the output is the union of the two sets when I multiply by V I do the same thing and I can visualize it So after we apply the first layer, we will see that the input of the first layer is just our normal sequence So the cat is on a chair the output of the first layer will be another sequence There in which each position includes information about multiple tokens Depending on the mask that we have applied and I also show the mask that we apply After we apply the second layer, we can see that the information increases So this last token now is not watching only the previous three tokens, but the previous four tokens Sorry, not only the previous two tokens, but the previous four tokens so every step we do with the sliding window size of three we include two tokens at every layer and Here I Show it for five layers, but it's not necessary because after a while the sequence will reach the maximum length If you want you can increase the sequence's length here by including more tokens So this is the concept of the Receptive field applied to the self window attention.

So basically with the sliding window attention, we are not Directly connecting two tokens with each other But if we apply multiple layers after one after another this information will get Will get captured by the embedding in successive applications of the layers such that the last layer basically will be able to watch all the sentence even if it's very long and this is actually shown by the Mistral paper in this picture you can see here.

So basically this is our input sequence. So let me write So this is our input which is a The original sentence so the cat is on a chair The fourth token of the first layer. So this is the output of the first layer. So layer one We have seen that the fourth token here depend with a sliding window size of four this will depend on the itself On the previous token on the one before and also this token here and it will produce This this embedding here in the fourth position which includes information about the previous token as well But then this will become the input of the next layer, which is the layer number two And this will produce an embedding at this position for example that will depend for sure on the previous four tokens because the sliding window size is Four but because for example this token here is already the aggregation of the previous four tokens It will actually multiply the visibility of its sliding window So this token here is not related directly to the first one We can see here, but indirectly through the this intermediate token.

We can see here. I hope this I hope this concept is clear. If it's not clear, I try I I Recommend using my notebook so that you can experiment by playing with the multiple Sequences and you can see how the information flow will go through all the layers All right, let's talk about our next topic Which is the KV cache because I want to introduce the KV cache which I already explained in my previous video on llama But I want to introduce it again and review it because I want to introduce later the rolling buffer cache So let's start by talking about first of all how we train language models because this is needed to understand the KV cache So the language models are trained using what is known as the next token prediction task so given a prompt the goal of the language model is to predict what is the next token that makes sense with the prompt that We have given and imagine we want to train a language model on Dante Alighieri's poem Divine comedy and in particular we will training it on a line that you can see here in This one in English.

So love that can quickly seize the gentle heart. How does it work? We prepare an input for our language model Which is the line that we want to teach it with a token Prepended called the start of sentence and then we build the target which is the same line But with a token at the end called end of sentence.

We run the input through this transformer model It will produce an output sequence so as we saw before the in the output of the self attention is another sequence with the same length as the input sequence, but Embedding is modified in such a way that each token capture information about other tokens.

And this is what we do To actually train a model so if we feed the model with with the nine tokens the model will produce nine tokens as output and how does it work basically the model will learn a mapping between Input and output such that if we give to the model as input the token start of sentence only it will produce The first token as output which is the word love if we give to the model as input the first two tokens So start of sentence love the model will produce the two tokens as output So love that it will feed the model as input three tokens So start of sentence love that the model will produce love that can so when we train the model we train it like this We prepare the input like this the target like this.

We calculated the output We calculated the loss using the cross entropy loss and then we run back propagation and this is done in all one step When we do the inference we do it in multiple step So when we do the inference at time step one, we feed the model only the first token So the start of sentence and the model will produce the output love Then we take the output the last token of the output and we prepend it to the input which becomes the input as time step Two so it becomes start of sentence love.

So the model will produce love that We take the last token of the output and we prepare append it to the input for the time step three And this will become the new input which will produce love that can then we take the last Token of the output and we append it to the input for the time step four So it will become the new output will become love that can quickly then we take this word quickly We append it to the input for the next time step and which will produce the next token as output, etc Etc until the last token until we see the end of sentence token as output then we know that the model has stopped Has stopped producing new tokens and we can stop the inference Now at every step the inference we are only interested in the last token output by the model because we already have the previous one, but of course, we need to feed all the previous tokens to to the model which is Belonging to the prompt because the model needs to access the prompt to understand which token to produce next So for example, we cannot produce the word gentle only by giving the word the we need to give all this sentence to produce this output gentle here But at the same time we are only interested in the last word gentle And this is the reason we introduce the KVCache because the KVCache allow us to reduce the computations that we are doing by only producing one output at a time the one that we need but Without doing all the intermediate computations for all the other tokens that we never use So basically when we want the word heart We don't want to produce the output for the word love that can quickly seize the gentle because we already have them in the prompt We don't need to produce all these tokens We just want to produce the output for the token heart.

So we want to reduce the computation that we are doing Let's see how it works Now in the self-attention mechanism You know that we multiply the query which can be thought of as a list of tokens where each token is an embedding of size 4096 and the transposed of the query becomes the is multiplied the transpose of the keys Are multiplied by the queries to produce this matrix here And then we multiply it by the V matrix to produce the output of the self-attention.

You can see here Let's do this one token at a time So when we inference a language model, we start with our first token, which is the start of sentence. This is one token represented by an embedding of size 4096 we multiplied by the transposed of the keys which is again one token because it's a self-attention.

So The query the key and the value are the same matrix. So this is just the transposed of the query Basically, and so it's a column vector and it will produce a one by one matrix We multiply it by V and it will produce an output token. We take this output token We send it to the linear layer and then to the softmax to understand which token this corresponds to in our vocabulary We take this token from our vocabulary and we append it to the query for the next Inference step to the keys and the values and then we compute again the product of the query multiplied by the keys We multiply then the result by V and it will produce an output made up of two tokens because we have two tokens as input It will produce two tokens as output, but we are all interested in the last token So we take this output token too, we send it to the linear layer then to the softmax This will result in what token is corresponding to in our vocabulary.

We take this token from our vocabulary We append it for the next step to the query key and values We do again this process and then we take the last token as output You can see here. We may send it to the linear layer then the softmax We understand which token it corresponds to, we append it to our query key and values and then we compute again the self attention but we already start to notice something because First of all, in this matrix here, which is the result of the query multiplied by the transpose of the keys We have a lot of dot products at each step that were already computed at the previous step Let me show you.

At the time step 4, we are computing all these dot products As you can see at the time step 3, we already computed these dot products and at the time step 4 We are computing them again as you can see these dot products here The second thing is that usually when we Deal with the language model, we have a causal mask So we do not even care about computing the dot products that we see here in the dark violet because they will be anyway masked out by the causal mask that we apply Because we don't want the first token to watch the token number 2, the token number 3, the token number 4 We only want the token number 4 to watch the previous one So the token number 4 should be related to itself, the previous one, the token number 2 and the token number 1 But not the opposite And also we don't want to produce all these output tokens because we are only interested in the last one We are only interested in knowing what is the last token produced by the attention so that we can send it to the linear layer and then to the softmax to understand what is the Word corresponding in our vocabulary so that we can use it for the prompt to inference the next token again So now let's introduce the KVCache and how the KVCache solve this problem What we do with the KVCache, again, we start from our first step of the inferences So we start from our start of sentence token, which is multiplied So the query is only the start of sentence token.

We multiply it by the transpose of the keys This will produce a 1 by 1 matrix here Then we multiply it by divi and it will produce our first token as output We send it to the linear layer then to the softmax then we know which token it corresponds to Now in the KVCache instead of appending this new token that we have produced as output to the query key and value we only append it to the key and the value and Replace entirely the previous query with this new token.

So Before without the KVCache we were appending the every output token So the last token of the output to the query key and values, but in with the KVCache We don't append it to query key and value, but only to the key and values And we only use the last output token as query for the next step So if this is the output of the first step, so the output corresponding to the token start of sentence We take it we use it as query for the next step, but we append it to the key and the values So this is why it's called the KVCache because at each step We are keeping a cache of the previous K and V But not for the query because we are entirely replacing all the queries with the last token Anyway, this will produce a product So this matrix multiplied by this matrix will produce a matrix that is 1 by 2 We multiply it by V and we will see that this produces only one token as output Then this we take this token we send it to the linear layer to the softmax then we know which token it corresponds to then we use it as Query for the next iteration, but we append it to the only the K and the V matrix This will produce a 1 by 3 matrix Which is then multiplied by the V which will produce the this output token This is the one we are interested in basically, then we use it as query for the next iteration But we append it to the K and the V etc.

So as you can see at the fourth step of the inference We are producing only the last row that we were interested in When we didn't have the KVCache. So let me show you this is the Fourth time step with the KVCache. Let's look at the fourth time step without the KVCache As you can see we are only producing this row here This is the only one we are interested in to produce this last token So with the KVCache basically we reduce the number of computations that we are doing at every step Because the sum of the dot products we have already done in the previous steps and we only produce one token as output Which is exactly the one that we need for predicting the next token Ok, now let's talk about the rolling buffer cache So since we are using the sliding window attention with a size of W And in the examples I showed you before I was using a sliding window size with the size of 3 We don't need to keep all the possible K and V in the cache But we can limit the K and the V only to W tokens because anyway, we will not be computing Attention outside of this W window.

So we do not need imagine our window is 10 tokens We do not keep the previous 1000 tokens because anyway, our attention will only be calculated on the previous 10 tokens So this is the idea behind the rolling buffer cache. Let's see how it works Imagine we arrive at the token 8 of inference using the KVCache If we have a KVCache and we are using the sliding window size of 4 for example We will see that as query we will use the output of the previous step and as key and values We will use the entire cache which is made up of 8 tokens But because of the mask that we are using with the sliding window attention We are not interested in the computation of these dot products because anyway They will be masked out because the distance between this token and this token is outside of the sliding window attention so we are not interested in this calculating these dot products because They will be masked out and secondly We are not interested in keeping this one because anyway because these values will be masked By our mask for the sliding window attention, which basically will result in zeros here We do not care about producing these first four rows in the value matrix because anyway They will be multiplied by zeros.

So they will not contribute to the output token. So here you have to imagine that Let me draw Here you have to imagine that the mask will take care of making this one zero, this one zero, this one zero, this one zero And this one will be a dot product.

This one will be a dot product. This one will be a dot product This one will be a dot product. So whatever value there is here Whatever value there is here, here or here will not contribute to the output of this token because anyway They will be multiplied by zeros here So we do not need to keep this value also in the V matrix or in the K matrix because anyway They will not be used by the sliding window attention.

So that's why we can limit the size of our K and V Cache only to W tokens where W is the size of the sliding window attention that we are using Now let's see how this rolling buffer cache was implemented. So basically rolling buffer cache is a way of Limiting the size of a cache to a limited size in this case W So imagine our W is only four Imagine we have a sentence "the cat is on a chair" and we want to use it for our KV cache At the first inference using the KV cache, we will add the first The first token to the KV cache, then we will add the second one the third one and the fourth one But now the KV cache is full.

How do we proceed further? Basically, we keep track of where we added the last item Using a pointer that we keep track of and when we will arrive at the next token, which is the token A We basically replace the oldest value here starting from the beginning and we update the value of the right pointer but now how do we Go back because now the order of the tokens is not matching the sentence because as you can see now the Cache contains "a cat is on" but this is not the order in the original sentence in the original sentence the order should be "cat is on a" so what we do is we do the unrolling or unrotation and How do we do it?

Basically because we kept track of this right pointer We just need to take all the values after the right pointer and then we put the values from 0 to the right pointer itself So all the values after the right pointer and then all the values before the right pointer and this is how we unrotate And this operation is done in the code in the function called unrotate.

You can see here Which basically will have this condition. So if the cache is not full we can just ignore the unfilled item So if the cache is in this situation, then we take all the values from the 0 up to the right pointer if the cache is a full then we take the value from 0 up to the The value of the right pointer and if the value of the right pointer is already overwriting some value then we need to unrotate and this is done in the Third condition here So we take all the values after the pointer and then the value up to the pointer and this is how we unrotate this buffer cache Okay, let's talk about another concept that is very important, which is chunking and pre-feeding Basically when we generate a text using a language model We use a prompt and then we use this prompt to generate future tokens when dealing with a KV cache We need to build up this KV cache So we need to add the tokens of our prompt to the KV cache that so that we can then exploit this KV cache To build new tokens future tokens Now the prompt is known in advance, right?

Because it's the input of our user It's what you ask to chatgpd for example, right? Tell me a poem. Tell me write me a poem or tell me a joke This is our prompt. So it's known in advance. So we don't know we don't need to generate it Okay, so what we can do is we can pre-fill the KV cache using the tokens of the prompt But there are many ways to do it like we were doing before when I was teaching you about the KV cache We work with one token at a time.

So one way to To add the tokens to the KV cache is to add one token at a time But this can be very time consuming because imagine you have a very large prompt which happens With retrieval augmented generation, which we have very big prompts like 5,000 6,000 tokens or even bigger So this if we add one token at a time It will mean that we have to take 5,000 or 6,000 forward steps in our network Which is can be very time consuming and also doesn't exploit our gpu very much The other way is to take all these tokens and feed them all at once to the model But that may be limited by the size of our gpu because imagine we have 10,000 tokens as our prompt Then maybe our gpu cannot even hold 10,000 tokens.

Maybe it can only hold 4,000 tokens or 2,000 tokens Depending also on the w size of the attention sliding window attention that we have chosen The solution in this case is to use chunking. Basically, we divide our prompt into chunks of a fixed size And this size is equal to w which is the sliding window attention size So imagine we have a very big prompt and we choose a sliding window size of 4 for the calculation of the attention And imagine that the prompt is this one.

So can you tell me? Can you tell me who is the richest man in history? The way we work is this Basically, we take our first chunk of the prompt So because we chose a sliding window size of 4, we also will choose the chunk size to be 4 So we take our first token of the prompt So can you tell me and we compute the self attention in the attention Self attention in the first layer of the model.

How do we build the attention mask? Basically as queries we take all the incoming tokens in this chunk So as this is you can think of this column as the queries and this column as the keys And this is the result of the query multiplied by the transposed of the keys plus the mask So our query we take the first incoming chunk and as keys I will show you later We take the current content of the kvcache, but initially it is empty plus the incoming tokens of the current chunk And this is made for a very specific reason that I will show you in the next step So in the next step, basically we take the current chunk, which is the tokens who is the richest And we aggregate it with the content of the kvcache using the tokens of the previous chunk.

So let me go back At the first step of this prefilling we take the first chunk of the prompt So can you tell me we calculated the attention mask using as query The first four tokens and as keys and values the content of the kvcache Which is empty plus the tokens of the first chunk and then we update the content of the kvcache Using this the tokens of this chunk after we have computed the attention So at the next step the kvcache now contains the previous the tokens of the previous chunk So can you tell me but now the current chunk has become who is the richest?

so as query again, we take the tokens of the current chunk, but as keys and values we take the The content of the kvcache plus the tokens of the current chunk. Why? because As you can see when we were doing token generation when I was teaching you the kvcache We first add the last output token We add it to append it to the k and the v and we use it as the query for the next iteration This is not what we do here.

Here. We first Calculated the attention and then we update the kvcache And when we use the when we build the query the query we use only the tokens of the current chunk And as key and values we take the content of the kvcache So the content of the previous chunk plus the tokens of the current chunk.

Why? because imagine if we didn't do We didn't use the content of the previous chunk. What would happen is this we would have a Attention mask that is only comprised of the tokens of the current chunk. So it would be only limited to this matrix here Let me draw it.

So Only this matrix here But if we only use this matrix here the word who Would not be able to to would not be related to the word me Tell and you even if with the sliding window size, they should be able to watch each other So because we want to relate the current chunk to the previous chunk We basically take as a key and value the content of the kvcache plus the tokens of the current chunk So that we can build this attention between chunks Otherwise this attention would not be built and as query we always use the tokens of the current chunk Let's review how this mechanism is built in the code So basically the pre-filling is done by chunks There is the first chunk and then there are subsequent chunks And finally there is token generation after we have a pre-filled our kvcache with the prompt During the first pre-fill, which means that we are doing it for the first chunk of our prompt As attention mask, we only consider the size of the incoming tokens in the current chunk But for any subsequent chunks, so after the first chunk as to build the attention mask For the query we just use the size of the incoming chunk But for the k and v we use the size of the kvcache, which is this one So cached as you can see here plus the size of the current chunk, which is this s variable you can see here And for token generation, we do the same system that we did before when I was teaching with the kvcache So one token at a time, we take it, we append it to the key We append it to the value and we replace the query with the output token from the previous step So the last chunk in our case will be the tokens man in history And what we do is basically we take the current chunk, so man in history, which becomes the query While the key becomes basically the previous chunk plus the tokens of the current chunk So the who is the richest plus the tokens of the current chunk, so man in history And the reason we do it because otherwise the word in the current chunk will not be able to Be related to the word of the previous chunk, which is necessary.

Okay guys, let's talk about sparse mixture of experts So mixture of experts is an assembled technique in which we have multiple expert model Which each of this model is trained on a subset of the data such that each model will specialize on a subset of this data And then when we produce the output of this mixture of experts, we take the output for each of these experts We combine it usually by using a weighted sum or by averaging to produce one single output In the case of Mistral, we do not talk about only mixture of experts But we talk about a sparse mixture of experts because we have many expert models, but we only use some of them Let me show you.

In the case of Mistral, we have eight experts which are present as the feed-forward layer So after we calculate the self-attention, as you remember, we have this feed-forward network. In the case of Mistral 8x7b, we have eight feed-forward layers. We have to think of them in parallel And the gate is a function that basically will decide for each token which expert, so which feed-forward network, should be working with that token and it will choose two feed-forward networks for each token.

It will run the token through these feed-forward networks, will take their output and will weight it according to the logits this gate produces to produce a weighted sum, which will become the output of the self-attention for that particular token Let me show you with an example So this is the architecture of Mistral.

As you can see, we have the input of this encoder layer. We first run the self-attention using the sliding window attention and the KV cache, etc, etc Then we run the normalization and finally we have this gate function here, which is basically just a linear layer that will produce logits eight logits, which will be values.

Let's call them score values for our expert. The two best performing experts, so the two highest score, will indicate which experts that token should work with Then we run each token in their own two best performing experts Then we take the output of these two experts. We combine it with the weight What is the weight?

Basically the logits produced by the gate are, suppose, eight values here Yeah, I draw only four because I don't have space, but you imagine you have eight values Then we take the top two, so 1.5 and 3.4 in this case These are the two experts through which we will run the token We take the softmax of the two best performing values.

This will be the weight that we'll be using for the weighted sum And basically, why do we do it? So Why do we do it? Because by using a sparse mixture of experts, we can have many expert models But during inferencing only two out of eight will be activated So as you remember the feed-for-wall network is basically two linear layers.

So the linear layer can be thought of as a matrix multiplication of a weight matrix with the input So if we didn't use a sparse mixture of experts, we would run the token through all the eight experts Which means that we need to compute eight matrix multiplications But by using sparse mixture of experts for each token, we are only doing two matrix multiplications Which makes the inference faster But at the same time allows us to increase the power of the model and the parameter of the model Because we are only using some parameters for A subset of the token.

So some tokens will use the expert number one. Some tokens will be using the token The expert number two and three. Some tokens will be using the expert number eight and three Or some other for example, the six and the four etc, etc So we are not using all the experts for each token, but only two of them This allows us to have each expert Specialized on a subset of tokens.

For example, imagine the model has been trained on multiple languages What could happen is that basically some experts, so some feed-forward networks are specialized on Japanese tokens Some feed-forward networks are specialized on English tokens or some It could also happen that some are specialized in verbs, some are specialized in Nouns, some are specialized in adjectives, etc, etc So this is why we use a mixture of experts because we want to increase the size of the parameters of our model So the model becomes more powerful at capturing information But at the same time we don't sacrifice on performance because we only use a subset of the experts for each token And this is the implementation as done in the code So as you can see in the case of Mistral 7b, we have as feed-forward just a feed-forward neural network Which is two linear layers In the case of Mistral 8x7b, it's not only one feed-forward network But it's eight feed-forward networks.

So this as you can see, it's the It's an array of eight feed-forward networks with a gating function, which is just a linear layer Which converts from the embedding size to eight, which is the number of experts So it produces for each embedding. So for each token, it produces logits which indicates for which Expert this token should run through and it will run through them to the top two experts So the two experts with the top logits score Okay, why we apply the softmax after selecting the top k expert so as I show you Here we have the gating function that produces some logits.

We select the top two logits to understand which expert we should run through our Token and then we take the score of the best two performing experts and we Take the softmax of them to create the weights That we will use to create the weighted sum But why we take the softmax of the two best performing instead of taking the softmax of everyone?

Well, the first problem is that If we take the softmax of all of the logits, then the two best performing may not sum up to one which is um Which is a condition that we need in case we want to train multiple models and compare them because i'm pretty sure that the guys At Mistral did not only train one model.

Maybe they trained multiple models with multiple hyper parameters Maybe they tried with four mixture of four experts, but also with three experts or two experts Then they choose the best one So if you want to compare models, you want the weighted sum to always perform The sum of the weights to be only one.

Otherwise the output range may change from model to model and usually it's not a good idea To have the range of the output to change from one model to the next So to keep the range of the output stable, they apply the softmax after they have selected how many Experts they want to work with and choosing the logits of the best two performing experts Okay The next thing we are talking about is model sharding which is also implemented in the code of the Mistral model So let's talk about it When we have a model that is too big to fit in a single gpu We can divide the model into groups of layers and place each group of layers in a single gpu For example in the case of Mistral, we have 32 layers of encoders You can see here one after another I didn't do all 32 of them You just think that this is layer from 1 to 8.

This is from 9 to 16 from 17 to 24 From 25 to 32 and we put each group of layers in a different gpu. So we have four gpus the The way we inference a model like this is as follows So we have our input we convert it into embeddings and we run it through the first eight layers in the first gpu The first gpu will produce an output which will be the output of the eighth layer We transfer this output to the second gpu and we use it as input for the ninth layer Then we run all the this input through all the layers one after another until it It arrives to the layer number 16, which will produce an output.

We take this output. We move it to the next gpu So it will become the input of the layer number 17 And then we run iteratively to all the layers until the layer number 24, which will produce an output We move it to the next gpu. We run it through iteratively until the layer number 32 Then we take the last linear layer and then the softmax to produce the output of the model however You can notice that this method is not very efficient because at any time only one gpu is working A better approach which is not implemented in the code of Mistral, but they reference it in the paper So I will talking about it is the pipeline parallelism.

Let's see how it works This pipeline parallelism. I will talking about the algorithm that was introduced in this paper. So gpipe Basically, it works as follows first. Let me introduce you the problem This actually it's used usually when we are training a model not when we are inferencing But it can also be applied to the inference Imagine we want to train a model on a sharded model.

So a model that is split into multiple Group of layers each group of layer is present on a different gpu Imagine we have four gpus each one with its own group of layers Imagine we want to train this model. So we run our input to the first gpu So we run the forward step to the first gpu.

We take this output and we feed it to the next gpu So then we run forward from there We take the output and we run it through the next gpu the gpu number three We take the output we run it to the next gpu the gpu number four Now we have the output of the model.

We compute the loss and then we can run back propagation the run propagation That's basically just the opposite. We go from the last gpu to the first gpu. So we run back propagation on the fourth gpu Then we have calculated the gradients at the fourth gpu and we use them to calculate the previous gradients at the third gpu And then we take these gradients and we use them to calculate the previous gradients and then we Take these gradients and we use to compute the previous gradients So the forward step goes from the input to the loss the backward step goes from the loss to the input And all the parameters which are also known as the leave nodes in the computational graph However, also as in this case, you can see that at each step We are only utilizing one single gpu and all the other gpus are quite Not working.

They are idle A better way is to use pipeline parallelism So imagine that the previous step of training was done using a very big batch Suppose this batch is made up of eight items What we do with pipeline parallelism is we take this batch and we split it into micro batch So instead of eight items, we create micro batch.

So four micro batch of two items each What we do is We run the first micro batch in the first gpu This will produce the output for the first micro batch and we can feed it to the next gpu But now at the time step one we realize that the gpu one now is free So she can already start working on the second micro batch Meanwhile, the second gpu is working on the first micro batch and when she will finish she can send it to the next gpu and Meanwhile, we realize that now the second gpu is free So we can if the gpu one has finished we can take the output of the gpu one and transfer it to the gpu two And the gpu one will be free so it can work on the third micro batch you can see here Then after the third gpu has finished it will take the output of the third gpu We send it to the fourth gpu, but we realize that the third gpu is now free So if the previous gpu have finished we can transfer the second micro batch to the third gpu The third micro batch to the second gpu and the first gpu which will be free can start working on a new micro batch Which is the fourth micro batch and basically we do this job of time shifting the micro batches And this will result in a better utilization of the gpus because now at every time step we At this time step for example all the four the gpus are working and also at this time step here at the backward step And for each micro batch we calculate the gradient, but we do not update the parameters We do what is called gradient accumulation Which basically means that we calculate the gradient for each micro batch and we keep summing it to the existing gradients But we do not update the parameters of the model after all the micro batch have finished processing the forward and the backward We update the parameters of the model The gradient accumulation is a technique that I have introduced in my previous video on Distributed training so if you want to understand how it works I refer you to my previous video on distributed training in which I explain also the math behind gradient accumulation and how it works But basically this is the solution with pipeline parallelism So we can actually divide our batch into micro batches and this can also work with inferencing because when we inference We just don't have this backward Step here, right?

So we just delete this second half of the table But we can still take our big batch at the beginning. We split it into micro batches and we time shift them according to the availability of the gpu And this pipeline parallelism basically introduces still some Time steps in which not all gpus are working and these are called bubbles To avoid bubbles these big bubbles here.

What we can do is we can Use a bigger initial batch size. So we have multiple micro batches Okay guys now let's go to the last part of this video I know that The mistral code is much more complicated to understand compared to the lama code and I will show you why but I will also help You understand the most complex topic in the code, which is the xformats library Which is a trick they use to improve the inference performance and it's actually a very advanced technique And I want to give you a glimpse into how it works So basically imagine you are running an ai company and you are providing llm inference service So you have a customer that has you for example provide an api and you have customers that send their prompts to your api And then want to run inference through your large language models Each prompt of course may have different length because each customer may be using the large language model for different purposes For suppose Simplicity suppose that each word is a token So suppose you have three customer the first customer says write a poem The second customer says write a historical novel and the third customer says tell me a funny joke of course you could process all these prompts one by one, but that would not be very efficient because The two other two customer would be waiting for the first customer to finish and when you have a lot of customers That's not good.

And secondly, you may not be fully utilizing the memory of your gpu So the best thing that you can do is to do batching you create all these prompts you create one big batch But the problem is that the prompt have different lengths So the first prompt is made up of three tokens the second prompt of four tokens and the third prompt of five tokens One solution is to add padding to these tokens.

So basically we create a batch in which we append Padding tokens to the input sequence until they all reach the same size Then we can run these sequences this batch through our large language model, which could be for example Lama or Mistral As we saw before when we have a input sequence of n tokens The attention mechanism produces an output sequence of n tokens And we usually take the embedding of the last token Send it to the linear layer then the softmax to understand what is the next token from our vocabulary but In the first prompt we see that we have added two padding tokens So we cannot use the embedding corresponding to the last tokens because they correspond to the padding token What we should do is we should take the embedding corresponding to the last non-padding token And then send it to the linear layer and then to the softmax to understand what is the next token And in the case of the second prompt we should be using the fourth token not the last one Only in the last prompt we can use the last token because it's the last it's a non not padding token Now we have done this And how do we actually create a attention mask to run it?

We basically just create an attention mask that is causal that will make each token only Visualize the previous tokens. So each token will be able to relate to previous tokens, but not to future tokens And this mask here will work fine for all the three scenarios. You can see here and I will show you later how We cannot use a different mask for each prompt because all the prompts are of the same length So all the masks must be 5x5 because we cannot use a 3x3 mask for this prompt a 4x4 Mask for this prompt and the 5x5 for this prompt because the input sequence is 5 So we must use a 5x5 mask and we have to use a 5x5 mask that is causal And also has the we can also mask out for example Imagine the sliding window is size is 4 then we can mask out this value here also because we don't want the This token here to watch tokens that are a distance of more than 4 for example So the problem here is that we are calculating a lot of dot products, especially for the first and the second prompt That will not be used.

Let me show you why When we apply this mask, so the 5x5 mask you can see here to this input sequence here Which are I want to remind you is a batch Which will produce the following attention mask In which all this value will be masked out because they are minus infinity minus infinity and it's because of the causality of the mask We cannot mask This value here because they are needed for the last prompt for example And we also cannot mask this value here, which is needed for the last prompt But for the first and the second prompt we are doing a lot of dot products for example These ones between padding tokens and other tokens that we will not be using Because I want to remind you that in the first prompt as the output of the model So we will be using the output as the third token For the second prompt the output at the fourth token and only in the last token.

We will be checking the last output Of the output of the self-attention, but for the first two prompts We will not be even checking the last token output from the self-attention because they correspond to the padding token So is there a way to avoid these padding tokens? Being introduced in our calculation and calculating all these dot products which will result in output tokens that we will not even use Well, there is a better solution and the solution is this The solution is to combine all the tokens of all the prompts into one big sequence Consecutively and we also keep track of what is the actual size of each prompt So we know that the prompt are coming from our API because we are running an AI company and we have this API So we know that the first customer has a token size prompt of size three tokens.

The second one has Four tokens and the third one has five tokens So we can keep track of these sizes in an array, for example And then we build this sequence which is a concatenation of all the prompts that we receive We take this mega sequence. We run it through our LLM model.

So it could be Mistral or it could be Lama This as I told you before An input sequence in a transformer will result in n output tokens in the output So we have here we have 3 plus 4 so 7, 7 plus 5, 12 Tokens as input it will produce 12 tokens as output To understand what is the next token for each prompt.

We need to check the We need to check the embedding corresponding to the token number 3 for the first Prompt to the token number 7 for the second prompt and the last token for the third prompt So we take all these embeddings we run them through the linear layer Then we apply the softmax and then we understand what is the next Token from our vocabulary, but you may be wondering How do we even produce an attention mask that can work with multiple prompts that are combined into one sequence?

Such that the token of one sequence should not of one prompt should not be attend To the tokens of the another prompt but only of the tokens of the same prompt, right? Well, the Xformers library allow us allow us to do that using a method called a block diagonal causal mask Which is also used in the source code of Mistral.

So I want to show you how it works basically Xformers This method called the block diagonal causal mask will produce a mask like this. It will be Group, basically all the prompts into groups Such that each token only can attend to the tokens in the same group here. We have three prompts So the token poem for example can only attend the token of the same Prompt the token novel for example cannot be related to the token poem.

So it will put minus infinity here but all the token of the same prompt will be able to be attended by the the token novel While the token in the last prompt will be only be able to attend The other tokens in the same prompt and this is a special mask built by using the Xformers library Let me show you how it works in the code Okay, I want to show you actually how it works.

So in the Mistral source code, they are using this Library called Xformers Xformers library Allows us to compute very complex attention mask and also to calculate the attention in a very efficient way Using the memory efficient attention calculation, which I will not show in this video. Maybe I will make a future video about it But basically what they do in the Mistral source code If you have multiple prompts, they will create one big sequence and then keep track of the number of tokens of each prompt And then they use these methods made available by the Xformers library to build these complex attention maps that Keep track of the different size of the KVCache because each prompt may have a KVCache that is different from another prompt Because imagine you have a prompt with 5,000 tokens and one prompt with only 10 tokens Of course, you will have a KVCache that is 5,000 tokens in one case and 10 in another case So the mask, attention mask that we build should take care of this And the second thing is that each group of tokens should only be able to relate To the tokens of the same group not to other groups.

So not of tokens from another prompt And this is done with the block diagonal causal mask So basically we tell him okay The first prompt is made up of seven tokens The second prompt is made up of five tokens and the third prompt is made up of six tokens And we are also using a sliding window attention with a sliding window size of three and basically this will create the complex mask That we can see here This is the first group of tokens from 0 to 6 is the first prompt from 7 to 11 is the second prompt and from 12 to Let me check 17 is the third prompt and as you can see it also takes into consideration the sliding window size So each token can only watch at most two previous tokens So the tokens in this in the contained in the sliding window size of size three The second one they use is the block diagonal mask and okay.

This one is used for the first chunk during the pre-filling This one is used for subsequent chunks in the pre-filling And basically it also takes in because during the first pre-filling we don't have the KV cache because it's initially empty But during the subsequent steps, it's not empty anymore So we need to take into consideration also the different size of the KV cache So for example, the first token may have a KV cache of size 10 because the prompt is very short But the second prompt may be very big suppose 5,000 tokens.

So it may have a KV cache of size 5,000 So it takes into consideration also the size of the KV cache And it will produce a mask that takes into consideration also the size of the KV cache The last method they use is this one block diagonal causal with offset padded keys mask Because each prompt may have a different size for the KV cache But only some tokens in this KV so the KV cache size is fixed.

It's a tensor that is of fixed side w But only some tokens may be actual being filled in this KV cache So only maybe the KV cache size is let's say 10 But because the first prompt is very short only three tokens are actually part in the KV cache But when we pass the KV cache to the calculation of the attention We pass all the tensor which is all the 10 items So we need a way to tell to the mask That it should only use the first three items from the KV cache and not all the KV cache Not all the tensor and this is done with block diagonal with offset padding keys mask So this method here, it's very long name very complicated, but this is why they use it And it will produce a mask like this So it takes into consideration the actual size of the KV cache Even if the KV all the KV cache have the same size because it's a fixed size tensor But it tells you how many items there are actually it should use from each cache Okay guys it has been a very demanding video I have to say I had to record it more than once I actually had to cut some parts because I even I got confused sometimes It's very complicated topics.

It's a lot of things that you have to grasp But I hope that it will make your life easier when you want to understand the Mistral code I actually am also putting online My notes the one that you have seen so the two notebooks that I have shown you Plus also the code annotated by me on the Mistral source code Now the Mistral source code I actually never run it.

So because my computer is not very powerful So I never run the actual model on my computer What I did to study the model was to run some random tensors to a model and I created basically a model with randomly initialized Weights but with the less number of layers so it could fit in my GPU and then I just run some random tensors To study all the shapes of the tensor and all the information passing So, I don't know if the code works, but I hope it will work.

I mean, I didn't touch the logic. I just add some comments anyway, you can use the commented code by me to as As a learning tool to complement with the official code of Mistral so that you can understand More about the inner workings of this grid model. I actually really enjoyed studying it.

I really enjoyed Studying the code and I learned a lot of stuff, you know I think it's very very good when you are doing something that is very complicated Because it teaches you a lot because if something is simple, then you don't learn much by the end of the day Anyway, guys, thanks you for watching my video.

I hope you also enjoyed this journey with me Even if it was very complicated I hope that you liked this video and you will subscribe to my channel if you didn't please do it And the best way to support me guys is to share this video with all the people, you know So share it on social media share it on linkedin on twitter, etc because this is the best way to you can help me is to grow my channel and Please let me know if there is something that you don't understand.

I am always available to help And connect with me on linkedin. Bye. Bye

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

Chapters

Transcript