Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Hello guys, welcome back to my channel today. We are going to code a visual language model from scratch Now, what do I mean by first of all by visual language model? And what do I mean for by coding from scratch? The visual language model that we will be coding is called the polygamma and it's a language model visual language model that came out From google around two months ago About the weights, but the paper came out around two weeks ago So we will be coding it from scratch meaning that we will be coding from scratch the vision encoder You can see this here.

Okay the linear projection, which is just a linear Layer the language model itself So which is the transformer language model how to combine the embeddings of the image tokens with the text tokens And of course how to generate the output using the condition. So what is the language visual language model?

First of all, well visual language model is a language model that can extract information from an image So if we have an image like this, for example and a prompt like this, for example, where is the photographer resting? The visual language model can understand where this photographer is resting by looking at the image And generating a response in this case.

The response is in a hammock under a tree on a tropical beach The topics of today basically are first of all, we will be talking about the vision transformer Which is the vision encoder that we'll be using to extract information from this image But this vision transformer has been trained in a particular way called contrastive learning So we will be talking about a lot about contrastive learning because I want to review not only what is contrastive learning But also the history of how it works So the first well-known model is CLIP and then it was transformed into CGLIP by google So we will be seeing these two models Then we will be coding the language model itself So the gamma language model how to combine the embeddings of the vision model and the language model But this one we'll do it in code And we will be talking about the KVCache because we want to Use this language model for inferences So we want to do it in an optimized way and the best way of course is to use the KVCache So we will be coding it from scratch Not only we will be coding it.

I will explain step by step how it works The rotary positional encodings because we need them for the language model and the normalization layers because we have them in the vision model And also the language model. We will be seeing what is the batch normalization, the layer normalization and the rms normalization I will be explaining all the math behind them In this video i'm also using a slightly different approach at teaching let's say Which is by drawing so I will be drawing every single tensor operations that we'll be doing especially in the attention Mechanism because I want people to not only look at the code and hope they get something Like an idea of how it works But actually I want to show each single tensor how it's changing by drawing it from scratch I think this helps better visualize what happens in the transformer model, especially during the attention mechanism So we know what each view operation each reshape operation that we are doing to each tensor and also the matrix Multiplications that we are doing so we can visualize what happens to the tensors itself What are the prerequisites for watching this video?

Well, you have a basic knowledge about the transformer. You don't have to be a master about it It's better if you have watched my previous video on it Which will give you the background knowledge to understand this video and you have a basic knowledge of neural networks So at least you know, what is a loss function, you know, what is a linear layer?

And at least you know, what is backpropagation you don't need to know how it works or the mathematics behind it But at least you know that we train models using backpropagation Having said that guys, let's jump to work. So the first part I will be explaining is the visual transformer So this visual encoder we will be seeing what is the contrastive about it and we will be coding it and then we will move on to how to combine the Embeddings of the image tokens and the text tokens.

The only part that we will not be coding is the tokenizer Because I believe it's a separate topic that deserves its own video. So hopefully I will make another video about it So let's start All right guys before we go deep into each of these topics Let me give you a little Speech actually, so we will be exploring a lot of topics like a lot of topics We will be reviewing for example each of the single Operations that we do in the attention mechanism and we will be looking at it from the code point of view But also from the concept point of view and from the tensor operations point of view There may be some topics that you are already familiar with and that's perfectly fine There are some others that you are not familiar with and that's also perfectly fine because I will be explaining each topic multiple times So for example, we will be Implementing the attention mechanism at least twice So if you don't understand it the first time along with the code, then you will have another time to Understand it and with a different explanation And the same more or less goes goes on with all the other topics.

For example, we will be first introducing the Normalization in one part and then I will review again the normalization The positional encoding done in one way and then we will see another type of positional encoding So don't worry if you don't understand everything at the beginning because I will be reviewing anyway each topic multiple times The important thing is you don't give up So if there is some topic that I couldn't explain because of lack of time For example, I will not be explaining how convolutions work because there are plenty of videos on how convolutions work So if you can pause the video watch five minute video on how a convolution work and then come back to this video That's the best approach I recommend The second thing is always write down all the code that I am I will be showing you so write it Line by line character by character because that's the best way to learn.

So now let's get started Let's start with the first part. So the first part we will be talking about is this contrastive vision encoder Which is something that takes any as input an image and converts it into an embedding Actually a series of embedding. We will see one for each Block of pixels of this image.

So basically our image will be Split into blocks of pixels like this into a grid and each of this grid will be converted into an embedding you can see here This embedding is a vector of a fixed size and that will be concatenated with the Tokens embeddings because as you know, each token is converted into what is known as an embedding Which is a vector of a fixed size.

They will be concatenated and sent to the transformer which will basically attend to this Image tokens as a condition to generate the text. So this is called conditional generation But okay, we will explore all this stuff here Let's talk about this vision encoder now the vision encoder First we need to understand what is why it's called a contrastive vision encoder and to understand why it's contrastive We need to understand what is contrastive learning So let's go back to another slide, which is this one Let's go here so Imagine for now, we will consider the image encoder as a black box and later We will transform this black box into something more concrete now imagine that you have You go to the internet and when you go on wikipedia You see an image and when you see an image there is always a description of what is inside that image If you use a crawler you can crawl all of these images with the corresponding descriptions That in this will produce a data set of images along with the descriptions Now imagine that for some now for now imagine we have a text encoder that is most usually is a transformer model And then we have an image encoder which most of the cases it's a vision transformer And for now, we consider them as black boxes So it's something that takes as input an image and produces Here an image and produces an embedding representation of this image And if you feed a list of images, it produces a list of embeddings one corresponding to each image.

What is this embedding? It's a vector that captures most of the information of this image And we do the same with this text encoder. So the text encoder is a transformer model that produces a series of embeddings. We will We'll see later But imagine you have this text encoder that given a text produces a single embedding of a single text But if you feed it a list of text it will produce a series of embeddings each corresponding to one single text now imagine The data set that we were talking about before which is the data set of images along with the corresponding descriptions So imagine we feed this data set of images along with the corresponding description to the image encoder and respectively to the text encoder It will produce a list of image embeddings and a list of text embeddings Now, what do we want these embeddings to be?

Of course, we want the embedding Of the first image to be representative of that image So we want this embedding to capture most of the information of that image and of course, we want the embedding of the text number one to be A vector that captures most of the information about that text Moreover with contrastive learning we don't want only to capture information about the image or the text But we also want some properties and the property that we want from these embeddings is this We want the embedding of each image when its dot product with the Embedding of the corresponding text it should give a high value for this dot product And when you do the dot product of an image with a text that is not the corresponding one It should produce a low number for this dot product So basically with contrastive learning what we do we take a list of images We take a list of text which is the corresponding text one for each of these images So imagine that the image number one correspond to the text number one the image number two correspond to the text number two, etc etc, etc We encode them into a list of embeddings and then we want to train This model so this text encoder and this image encoder to produce embeddings in such a way That when the dot product of the image with its corresponding text is done It should produce a high value and when you do the dot product of an image with a not corresponding text For example i2 with text3 it should produce a low value now What we can do is basically we take this text embeddings, which is a list of embeddings We take this image embeddings, which is a list of vectors We do all the possible combinations of dot products So the image number one did with the text number one image number one with the text number two image number one with the text Number three, etc, etc Then we do the all the also for the text number one So the text number one with the image number one text number one with the image number two text number one with the image Number three, etc, etc And then we want to find a loss function that forces These dot products to be high so that each text with its corresponding image to be high While all the other possible combinations to be low in value And we do that basically by using what is known as a cross entropy loss.

So To understand why we use cross entropy loss. We need to explore how language models are trained and we will do that very briefly so To not get us confused. So when we train language model, we do the we do so using what is known as the next token prediction task Imagine we want to train a language model on the following sentence.

So I love pepperoni pizza Pizza How do we train such a language model? Well, we give a prompt to this language model for now Let's consider it as a black box. So I love I love pepperoni We feed it to the language model The language model will produce a series of embeddings Which are then converted into logits.

So what is the logits? The logits is a distribution. It's a vector that tells What is the score that the language model has assigned to what the next token should be? Among all the tokens in the vocabulary. So for example, imagine this first number here corresponds to the token.

Hello the second token here corresponds to the The second number here corresponds to the token. Let's say pizza The third corresponds to the token car the fourth Number to the token dog, etc, etc Which one we want to be the next token? Of course, we know that the next token is a pizza So we want the token number pizza to be high and all the other tokens to be low in value So we use the cross entropy loss basically to make sure that the next token is pizza.

So how do we do that? Basically we Language model will output a list of numbers and we force the language model To produce the following output. So pizza should be one and all the others should be zero To compare these two things This one should be a distribution So basically the cross entropy loss what it does it takes a vector it converts it into a distribution With the softmax function and then we compare it with a label and we force the output to be equal to the label This will change the language model To generate a distribution the next time after the training in such a way that the pizza is given a high number and all the others Are given a low number and this is exactly the same that we do here for contrastive learning So we can use the cross entropy loss To force for example in this column here only this number to have a high value and all the others to have a low value And for this row here Only this number to have a high value and all the other number in this Row to have a low value and for example for this row We want the second item to have a high value and all the others to have a low value, etc, etc And we do that with the cross entropy loss Now here is the code that the pseudo code that they show in the Clip paper on how to implement the clip training with contrastive loss So basically we have a list of images and a list of text We encode them and they will become a list of vectors called image vectors and text vectors here image embeddings and text embeddings We normalize them later.

We will see why we normalize stuff But okay, it's make sure that we reduce the internal covariance shift, but for now ignore it Anyway, we normalize them later. We will talk about normalization We calculate all the possible dot products between these embeddings So the text embeddings and the image embeddings, so we basically generate this grid here then We generate the labels the labels are what well for the first row We want the label the first item to be maximum for the second row the second item for the third row the third item And that's why the labels are arranged this This is basically the the function arrange generates a number between zero and in this case n minus one So for the row number zero, we want the item number zero to be maximum for the row number one We want the item number one, etc, etc until the row number n minus one We want the n minus one item to be the maximum one Then we calculate the cross entropy loss between what is the output of the model So what are the numbers assigned by the model to each of these dot products and what we want?

The maximum to be among these numbers. This is the labels And we do it by rows and by columns this one you can see here then we sum these Losses and we compute the average so we compute the average loss between all the rows and all the columns And this is how we do contrastive learning.

Now, let's explore. What is the problem with CLIP? All right. So what is the problem with CLIP? Well, the problem with CLIP is very simple is that we are using the cross entropy loss And the cross entropy loss basically needs to have a compare does the comparison between two distributions So in language model we compare the output logits which are transformed into distribution With the label so which item of this distribution we want to be the maximum one and we do the same here So we have this column We convert it into a distribution and we do it through a function called the softmax function So the softmax function basically it is a function that takes as input a vector and converts it into a distribution What does it mean?

It means that when you have a vector like this, for example, it will be a list of numbers To be a distribution each of these numbers needs to be non-negative. So it needs to be Greater than or equal to zero and plus all of these numbers needs to sum up to one That's what a distribution is Of course The model will predict some numbers and it cannot force all the sum of these numbers to be one and it cannot force the numbers to be non-negative So we apply to the output of the model this function called the softmax Which transforms them into a distribution and then we can compare it with the labels So our label in the case for example for the first For the second row will be this So we want the first item to be zero the second item to be one and this one to be zero this one to be zero This one to be zero this one to be zero, but we need to apply the softmax to the output of the model now the softmax Function has a problem which is And we will see now this is the expression of the softmax basically to we take the output of the model and we exponentiate each item in the output vector, which could be a row or a column And after exponentiating we also divide them with the sum of all the other items So the exponential of all the other items So which means that we need to calculate first of all for each row the exponential of the item And then we need to divide by the sum of all the exponentials of all the other items including itself The the problem is that we are using this exponential.

The exponential is basically a function that grows very fast So if the argument of the exponential Grows the exponential will become huge And this is a problem for computers because in computers we store numbers using a fixed representation Which could be 16 bit or 32 bit which means that we cannot represent up to infinity But we can represent each number up to 2 to the power of n minus 1 basically if you don't have negative numbers So if the exponential is too big then our numbers will grow too much and it may not be represented by 32 bit And that's a problem.

So we need to make this softmax function numerically stable So whenever you heard the term numerical stability in terms of computer science It means that we want to make sure that the number can be represented within 32 bits or 16 bits or whatever range we are using How to make this softmax numerically stable?

Well, the trick is this. The softmax is uh, each item is exponentiated So we do the exponential of each item And then we divide it by this This denominator which is known as the normalization constant, which is the sum of all the Exponentials of all the other items in the vector Now as you know, this is a fraction So in a fraction you can multiply the numerator and the denominator by the same number without changing the fraction So we multiply by this constant called c Each number can be written as the exponentials of the logarithm of the number And this is because the exponential and the log are inverse functions So we can write c as follows.

So the exponential of the log of c By using the properties of the exponential which means that the exponential of the product The product of two exponential is equal to the exponential of the sum of the arguments We can write it like this And then we can bring this exponential inside the summation because of the distributive property of the product with respect to the sum After we bring it inside we can use the same Rule we applied above which is the exponential of the product is equal to the exponential of the sum of the arguments Now what we notice is that if we subtract something from this exponential this log of c We can make the argument of the exponential smaller which may make it numerically stable So what we choose as this log of c, basically we choose the Negative maximum number in the array that we are normalizing using the softmax This way basically the argument of the exponential will decrease and it will be less likely that this exponential will Go to infinity Which makes it numerically stable Now this basically means that to calculate the cross entropy loss for each of these columns and each of these rows First of all the model needs to output a list of Text embeddings and a list of image embeddings as you can see then we do all the possible dot products Then for each column first of all We need to find the maximum value in this column so that we can subtract it before calculating the softmax Then we need to apply the exponential to each of these items then we sum up all of this exponential to calculate the Normalization constant then we divide each of these numbers by this normalization constant so as you can see to apply the cross entropy loss involves a lot of computations and Also, it forces you to always have imagine you want to parallelize this operation Imagine that you want to distribute each row between different devices So this device here needs to have all the row in its memory because it needs to calculate this normalization constant So it has needs to have access to all of this row and if you want to do parallelize by column Then you need to have all the column In your memory because you need to calculate the first of all the maximum item then you need to calculate this normalization constant Then you need to normalize them so dividing by this normalization constant So it is involves a lot of computation But also it makes it difficult to parallelize because at any moment each device needs to have at least one full row or one full Column, which does not allow us to go to very big batch size And this is a problem.

So if you look at the cglib paper, they say that note that Due to the asymmetry of the softmax loss the normalization is also independently performs two times So first of all to make the softmax numerically stable, we need to go through each single vector calculate the maximum Then we need to calculate the softmax but then we also need to calculate the softmax by rows and then by columns why because this Matrix here is not symmetric.

So as you can see This is image number one with all the text and this is Text number one with all the images and this item here is not equal to this item here Because this is image number one with the text number two, and this is image number two with the text number one Because it's not symmetric means that you need to calculate the softmax for each single rows And then you need to calculate it for each single column and then you can calculate the loss So the problem with the clip is that it's very computationally expensive to calculate this loss this contrastive loss that's why in the cglib paper they propose to replace the Cross entropy loss with the sigmoid loss So with the cglib what we do is as follows Again, we have an image encoder that converts a list of images into a list of embeddings one for image image Then we have list of text which convert each text into a list of embedding one for each text Then what we do We calculate this all the possible dot products So the image number one with the text number one image number two with text number two and also image number one with text Number two text number three text four text five blah blah.

So all the possible dot products between all these embeddings then instead of treating the loss as a distribution over a row or a Column or a row So we don't say in this row in this column I want this item to be maximum or in this row. I want this item to be maximum We use what is known as binary We use it as a binary classification task using the sigmoid loss In which each of these dot products is treated independently from each other So this is considered a single binary classification task in which we say okay this item here should be one This item here should be zero.

This item here should be zero. This item here should be zero independently of what are the other items This one here should be zero. This one should be here zero, etc, etc, and we can do that with the sigmoid function So as you can see, this is the function the signature expression of the sigmoid function It takes as input this value called z which will be the dot product of our vectors And the output of the sigmoid is this stuff here, which is a number between zero and one So what we can do is we take each of these dot products.

We run it through a sigmoid And then we force the label to be one for corresponding Text and images and zero for not corresponding ones. So each of these dot products now becomes a independent binary classification task basically this allow us to Grow the batch size to millions of items and also to parallelize because we can put this block here into one device And it can calculate it independently from this other device because they do not need to calculate any normalization Constant for each item or the maximum item in each row or column because each of them is independent from the others Now you may be wondering why are we even using a contrastive vision encoder I mean Why cannot we just use an ordinary vision encoder that just takes an image and instructs some kind of embeddings that capture the information?

Of this image why we want it to be contrastive because We want these embeddings to not only capture a information about the image, but we want these embeddings to be Good representation that can be then contrasted or can be used along with text embeddings And this is exactly what we do in a vision language model.

We extract some image Embeddings which are vectors representing we will see later a patch of the image So this you need to think of this image as being divided into a grid and this first second third four five six So we produce in this case, for example, nine embeddings which are nine vectors Each of them represents information about a patch of the image So we want these embeddings to not only be Representing the information of these patches, but also to be able to be contrasted with the text Which is what we do in a visual language model So we have some prompt and we kind of contrast it with the image embeddings to produce an output It is not really a contrastive learning in this case because we are using it as a condition We will see later how these things are merged But we want a visual language a vision encoder that is already trained to be used with the text because it has a better Representation for the image for being used along with the text.

That's why we use the contrasting vision encoder also, we use them because they are cheaper to train so You can basically to train a contrasting vision encoder You just need to crawl billions of images from the internet Each of them already has a kind of a description because you can for example in wikipedia You always have the description of each image, but also the internet when you have an image you always have the html alt text It's called Which is the alternative text that is displayed when the image is not shown So you always have access to some kind of description Now, of course this vision encoder may be noisy because they we crawl stuff from the internet Which means that this stuff may not always be correct So sometimes you see a picture but the description displayed is not correct or maybe the crawler didn't get the correct information But because we train it on billions and billions and billions of images eventually it learns a good representation of this image So this vision encoder that we will be using is basically a vision transformer.

So now let's talk about the vision transformer Let's talk about it here So the vision transformer is a transformer basically that was introduced in this paper and image is worth 16 by 16 words In which basically they train a transformer as follows. So first of all, what do we How does a transformer work?

we will see later in detail what is the Attention mechanism, but for now, I just need you to remember that the transformer model is a sequence to sequence model which means that you feed it a sequence of embeddings and it outputs a sequence of contextualized embeddings What we do to encode an image with the vision transformer we take an image and we Split it into patches and in this case, for example, we can split into 16 patches So this is the first group of pixels.

This is the second group of pixels This is the group of pixels on the bottom right of the image. This one is on the top right top right, etc, etc we extract Information about this patch using a convolution So when you run a convolution you can extract information about a group of pixels from the image And then for example, this one will produce this output This one the convolution of this patch will produce this output.

The convolution of this patch will produce this output, etc, etc And then we flatten them. So we lose the positional information We just take we don't care if this four is the top right or the bottom left We just concatenate them one with each other We do we lose the two dimensionality in this case basically so we transform into a sequence of patches instead of being a grid of patches Then we add this position information so we say that okay, this is the patch number one So, how do we do that?

This patch basically the embedding of this patch that will be the result of this convolution will be a vector We add to this vector another vector that tells the model Hey, this is the patch number one and this is the patch number two, and this is the patch number three, etc, etc So we do that by adding so this plus operation you can see here and unlike the Vanilla transformer or the transformer model that we see for language models These positional encodings are not calculated using sinusoidal functions, but they are learned So they are vectors that get added always so the positional encoding number one always gets added to the top left Patch the positional number two always gets added to the second patch from the top left, etc, etc The positional encoding number 16 gets added always to the bottom right patch So the model Has kind of access to this to the 2d representation of the image So the model will learn basically that the patch number 16 is always on the top right and this is always on the top left We feed it to the transformer So this is a series of embeddings because the sum of two embeddings is a series of embedding We feed it to the transformer model for now Let's consider it as a black box and later when we code it, we will explore each layer of this transformer The transformer what it does it does the contextualization of these embeddings So at input we have this each series of embeddings each of them representing one single patch The output of the transformer through the attention mechanism will be a series of embeddings again But each of these embeddings is not only capturing information about itself, but also about other patches In language models, we do what is known as We use in the attention mechanism.

We use what is known as the causal mask. So this first Embedding should be only capturing information only about itself the second one only About itself and the previous one the third About itself and the two previous one the fourth one about itself and the three previous one, etc This is what we do with the language models with visual language models in the with the trust Sorry, not with visual language, but with the vision transformers We don't care about this being The model being autoregressive we say so we don't want these patches to only encode information about the previous patches because in the in an image There is no autoregressiveness.

So it's not like the patch number 16 of an image It depends only on the previous patches and the patch number one does not depend on any others Because imagine you have an image in which the sun is here or the light source is here then this part here will be light will be illuminated, but So the illumination here depends on what is coming after in the image So in the image, we don't have this autoregressive relationship Why in the text without we do because we we write the text from left to right or from right to left But anyway, each word that we write depends on what we have written previously But this doesn't happen with image.

So basically this contextualized embeddings They capture information about themselves, but also all the other embeddings and We use this contextualized embedding to capture information about each patch But also how it is present in the image. That's why we want them to contextualize So we want each patch to include information about its position, which is given by the positional encoding But also about what is surrounding this patch in the image By contextualizing them.

So when we code it, this will be more clear for now. I just want you to get a Idea of what we are going to code. So we are going to code a model that will take an image will apply a convolution To extract a series of embeddings. You can see here.

We will add a positional encoding to these ones Which are learned we will apply the attention mechanism Which is will be a series of layer actually of the transferable model that will contextualize these embeddings And then we will use this contextualized embedding as input to the language model for decoding the output of the language model So let's finally start coding Now in this video I will be Using a slightly different approach, which is I will not be writing each line I will be copying each line and explaining it step by step because I want this video to be more about explanation than just Coding because I want to use the code for explaining what happens under the code under the hood So let's create our first file, which is the modeling Oops, I'm using Chinese Siglip.py And let's start by importing stuff which we need I don't need copilot And then we create our first class which is the siglip-config So, what is this basically we will be using this visual encoder and this visual encoder will have some Configurations, why do we need a configuration class because uh, polygamma comes in different sizes Let me put this one.

Okay Polygamma comes in different sizes Which means that each of this size of polygamma each of these models polygamma models has a different configuration for its vision encoder So let's see each of them The hidden size basically it's the size of the embedding vector of this vision transformer that we are going to encode the intermediate size is the Linear layer that we use the size of the linear layer that we use in the feed-forward network The number of hidden layers is the number of layers of this vision transformer The number of attention heads is the number of attention heads in the multi-head attention The number of channels is how many channels is each image has which is RGB The image size is because polygamma comes in I remember three sizes.

So 224, 448 and 896 something like this The default information that we put here is the for polygamma 224 Which supports of course image of size 224. So if you provide any image, it's first get resized into 224 by 224 The size of each patch. So what is the number?

It will be divided each image will be divided into patches. Each patch will be 16 by 16 and the this way is a Parameter for the layer normalization. We will see later The attention dropout is another parameter that we will not be using in the attention calculation Basically, it's a dropout that we use in the attention, but we will not be using it And the number of image tokens indicates how many output embeddings this attention mechanism will this transformer vision transformer will output which is the how many Image embeddings we will have for each image Now before we saw that each an image encoder is something that converts an image into one single embedding So that represents all the information about that image but in the case of the vision transformer we can use all the output of the vision transformer to have because as we saw before Vision transformer is a transformer model.

So which takes as input A list of embeddings and it outputs a contextualized embedding So each of these contextualized embedding will be the tokens of our image so it will not be one single embedding that represents the whole image, but Lists of embeddings that represent a patch of each image, but also information about other patches through the attention mechanism But we will see this later.

So now this class is very very basic. It's just a configuration of our cglib Now let's start by coding the structure of this vision transformer. So let me copy this stuff here How to follow this video now I I am copying the code because I have already written before and I want to explain it instead of Coding it because I also allows me to copy the comments and also allows me to avoid any mistakes while coding it But I recommend that you code it from scratch.

So you take this video and you just type whatever I am pasting here This is the best way to learn because it's like when you study a mathematical proof You should not just watch the proof on the piece of paper Because even if it you think it makes sense to you It doesn't actually because when you write it by hand, so when you code each of these lines by hand Your mind will think why am I typing this?

Why am I writing this? Why am I multiplying this number by this number? Why am I? Calling this function so you question yourself when typing That's why I recommend that you type this code while I am pasting it I do it by pasting otherwise this video will be 20 hours so The first thing that we do is we create this vision Model, this vision model is made up of a transformer and it has a configuration So basically what we are doing is we take the pixel values of this our image, which will be loaded with NumPy So when you load an image with NumPy it gets converted into an array that is channeled by height by width But we can have a batch of images.

That's why we have a batch size here. So the batch dimension And our vision transformer will convert this into a batch size NumPatches Which is how many NumImage tokens we have here and each Vector will be of a fixed dimension called embeddim here So basically our vision model will take an image as you can see a batch of images and it will give us a batch of List of embeddings one list of embeddings for each image where each embedding is a vector of size embeddim Okay.

Now let's code the vision transformer, which is very simple also So let's do it also step by step actually so this vision transformer is basically a Torch layer Where we pass the configuration we save this embeddim, which is the hidden size We saw before which is the size of this embedding vector We first need to extract the embeddings from this We need to extract the patches from this image, which will be done with this layer.

We will call SigLip vision embeddings Then we will run it through a list of layers of the transformer Which is this SigLip encoder because it reminds the encoder of the transformer Which is a series of layers of transformer and then we will have a layer normalization and we will see later how layer normalization works The forward method is very simple So the forward method is basically we take these Pixel values, which is the image which is a patch of images and we convert them into embeddings, which is Which basically means that we are extracting the patches from these images.

So let's visualize it here So what we are doing with this Image embeddings we are taking these images. We will run a convolution here to extract patches Then we will flatten these patches and add the positional encodings And this stuff here will be done by this SigLip and vision embedding then we take these embeddings which are Patches plus the positional encoding and we run it through this encoder, which is a list of layers of the transformer So this stuff here is our encoder.

What is the encoder? Well, the encoder is a list of layers of the transformer So you can think of it as being a list of these layers here. Actually these layers here one after another which includes a multi-head attention, a normalization, a feed-forward network and the normalization In the case of the visual transformer the normalization is done before the feed-forward and before the multi-head attention, but that's the only difference So this part here, so a series of layers is called the here We call it the encoder because it resembles the encoder side of the transformer And then we have a layer normalization.

So now let's go to code this vision embeddings So we want to extract information about these patches Let's do it. Where are the vision embeddings? Here. Okay All right, so The vision embeddings is basically, okay Taking again the configuration because each of these models needs to have access to the configuration because they need to extract different Information from this configuration.

So we have the embedding size, which is the size of the embedding vector, which is the hidden size The image size is how big is the image? And the patch size is how big is the patch that we want to get from this image. So basically we are talking about this In this case the patch size I remember is a 16 Which means that we are going to take this patch here is going to 16 by 16 pixels How do we extract these patches?

We do that through a convolution that is a 2d convolution, which it takes as input The number of channels of the image so three channels are gb and it produces all channels equal to the embedding size So the hidden size The kernel size so as you remember the convolution works like this, so let's use the ipad actually to draw so The convolution works like this.

So we have an image Which is made up of let's say pixels. So suppose this is the grid of pixels And we have a lot of them Basically the convolution works like this imagine the kernel size is three by three So we take a three by three group of pixels.

We apply this convolution kernel So if you are not familiar with how convolutions work, I will not be reviewing that here But basically it means that we have a matrix here You multiply each number of this matrix by the value of the pixel on which it is applied to it will produce features one feature And then you slide this kernel to the next group of pixel then you slide it again Slide it again, etc, etc, and it will produce many features in the output features However at as input we have three channels which you can think of it as three Parallel images one that is only red one that is only green and one that is only blue We run this kernel on all of these channels and it will produce Features how many kernels do we have?

Depending on how many output channels we want. So for each output channel, we have a one kernel that is We have three kernels actually that is used for one for each of this number channels The stride tells us how we should slide this Kernel from one group of pixel to the next and we are using a stride that is equal to the patch size of the Kernels, which is equal to the kernel size.

So which means that we take the first oops We take the first group of let's say three by three kernels Then we skip three kernels to we slide it to the next group of three by three. So there is no overlap So we take this kernel here Then we slide it to this group of pixel here Then we slide it to this group of pixel here so that there is no overlap.

So basically what we are taking is list of features each extracted by a independent patch of this image that we run the kernel on And the padding if valid means that there is no padding added So basically this patch embedding is extracting information from our image patch by patch Where there is no overlap between these patches.

How many patches do we have? Well, it's the size of the image which is 224 in the base version of PaliGamma divided by the patch size So image size is the number of pixels divided by how big is each patch and then to the power of two because we have Along two dimensions this image.

So we run the patch. The patch is It's a square. So it's a 16 by 16 or 3 by 3 or whatever the number patch size is How many positions we have? So how many? Positional encodings we need well It's equal to the number of patches that we have because we need to encode information about where this patch came from So how many positional encodings we need equal to the number of patches that we have And what is each of this positional encoding?

It's a vector. It's a vector of the same size of the patch So it's equal to embeddings. You can see here And it's a learned embedding. So it's a positional encoding that is a learned Embedding how many we have we have noon positions of them each of them with this size here And we will see later that each of them is added to the information extracted from the convolution So that each convolution output encodes information about where it came from in the image we register these positional IDs in the In the module which is just a list of numbers and we will use it later So this is just a range of numbers so between zero and noon positions mine one Now let's implement the forward method This is the reason I like to Copy and paste the code because I can copy all the comments without typing them one by one.

Otherwise, it will take me forever So what we do now is okay. We had our image which is a pixel values here The pixel values came from noon pi so we will see later how we load the image but basically you have to think that you load the image with noon pi and noon pi loads a Batch of images, which is a channel height and width.

It's a tensor with three channels and with the height of the image and the width of the image We will see that this Height and width is equal to the same because we resize each image to the input size of the image expected by the model So we will resize in the case.

We are using the smallest polygama. We will resize each image to 224 by 224 We extract this patch embeddings to this convolution so you can see here So this will basically take our image which is a batch of images and convert it Into a list of embeddings of this size So each image will be a list of embeddings of size embed dimensions How many patches we have well the number of patches For the height and the number of patches for the weight In this case, it will always be the same so you can think of it as a number of patches a total number of patches Each of patches with the dimension embedding dimension And as we saw before we flatten these ones, so we extract them here.

Let me delete it So we extract these patches So we run the convolution and then we flatten them here So basically the convolution will give us 1 2 3 4 5 6 up to 16 or whatever the number of patches is and then we convert it into a tensor where the The patches are flattened So the first patch is here and the last patch is the last element of this tensor and this is what we do here Here because the output of the convolution is a 2x2 grid, but we don't want a 2x2 grid We only want a one-dimensional long list of patches and this is done by this flatten method here Then we transpose because we want the number of patches to come before the embedding dimension Because as input to the transfer we need to give a sequence of embeddings So that's why we want this num_patches dimension to come before so that it becomes a batch of sequence of embeddings and each embedding is a vector of size embedding dimension Each of these embeddings we add the positional encodings which positional encodings?

Well the position Extracted from this embedding layer But which embedding do we want to extract? All the embeddings. So from 0 to Suppose we have 16 patches from 0 to 15 What is the where is this information 0 to 15 is in this self dot position and this which is a range So as you remember a range is just a generates a list of numbers between 0 and the argument minus 1 So we add we extract this the all the positional encodings from this position embedding Layer, which is this embedding layer here.

We add it to the embeddings So what we are doing basically is we flatten this embedding We did that before then we add a positional encoding vector extracted from the positional encoding layer And these positional encodings are learned. So learned why because this embedding layer here is a list of embeddings That when the model is trained these embeddings will change according to the need of the model and basically we encode them So it's not like we are telling the model.

This is position number one. This is position number two We add another embedding that is added to this patch each of these patches And then the model will learn to modify this positional embedding vector in such a way that they should encode the position Information because each of this position embedding is always added to the same patch So the first patch always receives the position number zero the second patch always the position number one We hope that the model actually tries to change this position embedding in such a way that they encode the positional information and actually it does because the model actually learns then the to relate Patch with each other by using their positional information And the only way for the model to do that is to change this position embedding in such a way that they encode the position information If you remember from the vanilla transformer, we use the sinusoidal functions So if you want to look at the original transformer if you remember here We have this position information Where is it here?

So we create this position encoding using sinusoidal functions So instead of learning them we actually pre-compute them and then we force the model to learn the pattern Encoded by these sinusoidal functions in this case. We are not forcing the model to learn any pattern We want the model to create the pattern that is most useful for the model itself so we hope that the model will try to create this embedding layer in such a way that it creates some embeddings that are helpful for the model to to understand the position information and this is the meaning of position embedding Now we skipped before the normalization layer.

So let's go actually to Understand what is normalization and how it works so that we always don't leave anything behind that is not explained All right. Let's talk about normalization. So imagine we have a list of linear layers Now a linear layer is defined by two parameters One is called the input features and one is called the output features Imagine we have input feature is equal to four and output feature is equal to four Actually, there is another parameter called bias So it indicates if the linear layer also has a bias term and suppose that it's true To the input of the linear layer usually we have a batch of items and each item is made up of features Suppose that for now as input there is only one item and it's made up of four features And as you can see the input features are four What will happen with four output features is this the linear layer you can think of it As a number of neurons where the number of neurons equal to the number of output feature of this linear layer what each neuron does is basically it has a weight vector As you can see here made up of four weights How many weights does it have?

Well equal to the number of input features that this layer accepts So which is a four What each neuron will do it will do the dot product of the incoming vector So the input vector x multiply dot product with the weight vector of this neuron plus the bias term Which is one number for each neuron And this basically dot product plus this bias will produce one output feature Because we have four neurons.

We will have four output features So each neuron will do the same job, but each neuron will have its own weight vector and its own bias number So this one here will have its own weight vector different from the other ones and its own bias term here Then suppose that we have another Vector that takes as input four features and produces two output features So you can think of it as a linear layer with the two neurons where the first neuron has a weight vector made up of four numbers because The incoming vector has four features and then one bias term here It will produce an output vector of two items The first item will be this number here and the second item The second dimension will be the dot product of the weight vector of this second neuron with the input vector plus the bias term of the second neuron Now, what is the problem with With the linear layers, but actually with all layers in general The problem is this it's called the covariate shift.

The problem is that When you have an input vector That changes from one batch to another in magnitude Then the output of the layer will also change in magnitude a lot depending on what is the incoming vector So for example, imagine this the first input vector is all the numbers are more or less around one and two And the output is also more or less around suppose around two Then if the next vector that is coming to this layer is Much different in magnitude from the first one then the output will also be much different in magnitude And this is a problem for the model So the problem is that if the input of a layer changes, then the output of this layer will also change a lot So if the input changes drastically the output will also change a lot drastically then because the loss of the Of a model during training depends on the output then the loss will also change a lot because the loss Then determines the gradient during backpropagation It means that if the loss changes a lot then also the gradient will change a lot and if the gradient changes a lot Then because the gradient determines how we update the weights of the model during training then also the update of these weights will also change a lot so basically what happens is that the if the input the distribution of the Dimensions of this vector that is coming to the input of a layer Changes drastically from one batch to the next Then the output of the model will also change and then the loss will change then the gradient will change then the update of the weights Will change so what we will see that the loss will oscillate a lot And also the weights will try to keep up with this changing input distribution Which basically will result in a model that trains slowly.

So here I have made a simple How to say Summary of what is happening So a big change in the input of a layer will result in a big change in the output of a layer which will result In a big change in the loss of the model which will change result in a big change in the gradient Of the during black propagation which will result in a big change in the weights of the network And what is the result of this is that the network will learn very slowly because the network will spend most of its Time but okay most of the effort trying to keep up with this distribution change in the input Instead of actually learning the features How to map the input to the output So the the first solution to this problem was batch normalization, which was introduced in this paper And with batch normalization what we do basically is that we have usually not a single item as input We have a batch of items suppose that we are training a classification image classification model So we have as input a list of images For example the image of a cat the image of a dog of a zebra of a tree of a stone etc, etc So you can think these are the dimensions of the vector that represent the cat These are the dimensions of the vector that represent the dog.

These are the dimensions of the vector that represent the zebra etc, etc So what we do with batch normalization is that we calculate a statistic For each dimension of each item Which statistic do we calculate the mean and the the variance and then we Normalize each item by subtracting the mean and divide it by the standard deviation this will basically make each Dimension of each item be distributed According to a Gaussian with mean zero and the variance of one so basically what will happen is that each if we normalize each number if Because the image of a cat is much different from the image of the zebra Because the color distribution is different.

The rgb distribution is different. So the pixel intensity is much different from each other What will happen is that the model will not see this change in magnitude but it will see And also will not see a change in distribution because all of these items will be distributed according to a mean of zero and the variance of one So what will happen is that the model will oscillate less in the output.

So it will oscillate less in the loss So it will oscillate less In the gradient, so it will make the Weights of the model oscillate less So the model the training will be more stable. It will be it will converge faster basically this way. So To summarize Why do we need normalization is because the input of the model which depends on imagine you are training Classification or the image classification model then the input depends on the image and the image can be much different from each other If the image changes a lot, we don't want the model to feel this change in magnitude of the input We want the distribution of the inputs to be remain constant.

Let's say So that the model doesn't oscillate so that this doesn't force the model to kind of just to keep up with the distribution This change in distribution. How do we do that? We we try to keep the distributions Constant so always try to have the input features to be distributed according to a fixed distribution Which is mean of 0 and 1 and we do that with this formula here, which comes from probability statistics basically each Distribution if you subtract its mean divided by the standard deviation, it will result in a Gaussian distribution of mean 0 and variance of 1 Of course, this is valid also only for Gaussian distributions And And this will basically result in a more stable training Now the best distribution actually worked fine.

However, it has a problem with the problem is that Which best normalization each of these statistics so the mu and the sigma are calculated Along the batch dimension. So we calculate the mu and the sigma for the dimension number one of each of these vectors Along the batch dimension.

So basically to calculate this mean we are summing up the first dimension of each of these vectors And divided by the number of items that we have So we are mixing the features of different items So we are mixing the dimension number one of the cat with the dimension number one of the dog And so basically to to have good results, we need to use a big batch because If we use for example a cat and the dog it will result in one mean But imagine in the next batch, we have the cat and the zebra it will result in a completely different mean And then the next supposing the next batch we have a cat and the tree maybe it results in another different mean So also we will still have this problem of covariance shift because the mean is changing a lot between each iteration So the only solution to this actually is to use a very big batch size So we are forced to use a big batch size in order to alleviate this problem Of kind of mixing the dimensions along the batch dimension We introduce the layer normalization with layer normalization What we do is instead of calculating the statistics along the batch dimension We calculate them along the item dimension So the mu and the sigma that will be used to standardize the cat will only be Dependent on the dimensions of the cat not on the whatever the cat comes with So we are still doing each item minus its mean divided by the standard deviation But instead of this standard deviation and this mean coming from the first dimension of each item It comes from the average of this All the dimensions of the each item independently from the others So it doesn't matter which other item the cat comes with it will always result in more or less the same mu and Same sigma And this makes the training even more stable because we are not forced to use a big batch size And this is why we use normalization Okay, we have seen what is normalization now we should implement what is this thing called the encoder so this is Sigleap encoder Now the encoder is made up of multiple layers of the transformer model And the architecture more or less if you look at the vision transformer paper, it is like this So I changed it a little bit because I wanted to use the exact names that we will be using So we have first of all what we have so far is this thing called the Sigleap vision embeddings Which is basically taking the image it is Taking some patches of this image using a convolution each of this Output of this convolution is an embedding is used as an embedding.

It's a vector And this embedding vector is added to another Vector called the positional encoding which is learned and then we feed this stuff to this thing called the encoder So we convert it into embeddings at the positional encoding then we feed it to the encoder And at the input of the encoder you need to think that we have These layers repeated n times here.

It's written l times One after another such that the output of one becomes the input of the next layer the thing that you need to understand about the transformer is I repeat it is that the transformer is a sequence-to-sequence model that converts a sequence of embeddings into contextualized embeddings What does it mean?

It means that at the input you have a list of Here embeddings each representing a patch of the image as an independent patch So this embedding here only captures information about the first group of pixels This embedding here captures all information about the second group of pixels, etc, etc, etc But then some through some magic called Attention mechanism this contextualized these embeddings become contextualized at the output of the transformer and we will see in detail this attention mechanism Such that this embedding here at the output of the transformer the first embedding is represents information about the first patch plus other it includes information not only about the first part but also about other patches And so is the second the third the fourth and the last one So they become contextualized in the sense that they capture information about the context in which they appear Which is different from language models in which each token captures information about the previous tokens in the case of the vision transformer Each patch includes information about all the other patches Now each of these layers is made up of so we have the this is the input of the encoder let's say And we will have the first layer of this encoder The first thing that we do is we apply a layer normalization and we saw how it works and why we use it The output of this layer normalization is a cop First the input of this linear normalization is saved for a skip connection that we do later Then the output of this layer normalization is sent to the self-attention mechanism It's this one here and this self-attention mechanism takes the output of the layer normalization as a query key and values It calculates the attention just like the usual formula So softmax of the query multiplied by the transpose of the key divided by the square root of the model multiplied by v etc etc The output of this self-attention is then summed up with this skip connection here Then the output of this summation is sent to this layer normalization along with the skip connection that is used later Then the output of the normalization is sent to this multi-layer perceptron, which is a list of linear layers We will see later and then we do another summation here with the skip connection plus the output of the multi-layer perceptron And then we do another layer like this and another another another and the output of the last layer is the output of our vision transformer.

So as you can see the vision transformer takes as an input an image converted into patches. Patches are then fed to this Encoder which is a list of layers and the output is a contextualized patches or embeddings of these patches So let's code this encoder, which is basically this structure here And we will code each part of this structure and while coding each part we will go inside on how it works So the normalization we already know how it works, but we still have to explore what is this stuff here called the self-attention What is this stuff here called multi-layer perceptron?

I believe it's convenient for us to go first through multi-layer perceptron and then we go to the self-attention I think because the self-attention is a little longer to do. So let me do the simple part first Okay, let's code this encoder Now I will copy the first part This one here, so let's copy it here So the encoder is made up of again, the constructor is made up of the configuration We save some stuff which is the hidden size and then we have a block called the self-attention block in this call this Here it's called the siglib attention.

Now Note about the naming I'm using. So I am using the same names as the HuggingFace implementation For only simple reason which is I want to be able to load the pre-trained weights from HuggingFace So the pre-trained weights for the polygam are available on the HuggingFace hub So we want to be able to load them But each of these pre-load pre-trained models they have this dictionary of weights So where the dictionary tells you where to load each of these weights And if the names do not match you need to create some conversion script So I didn't want to do that and also it would just complicate the code uselessly So I just use the same names so that we can Load basically the pre-trained weights from HuggingFace Also because my code is based on the HuggingFace implementation So to create my code I use the HuggingFace implementation, but simplified a lot a lot a lot For example, I remade my own KVCache.

I did a lot of Modifications to simplify it but it's based on the HuggingFace implementation anyway So we have this thing called the self-attention then we have a layer normalization. So we saw it's Where is it? And we have this layer normalization here Then we have this multi-layer perceptron, which is this stuff here.

And then we have another layer normalization, which is this stuff here So we have two layer normalization. So now let's implement the forward method And the forward method I will copy it line by line so we can understand Okay this forward method. Now. The first thing we do is we save a residual connection, which is We basically save the input that we feed to this Encoder because we need to reuse it later.

So we are saving this skip connection because we will need to use it here later Then we run it through the layer normalization the input And it's done here. So the layer normalization does not change the shape of the input It's just normalizing each of these dimensions such that they they all come up It's like they came out from a Gaussian of mean zero and variance of one Then we apply this magic thing that we will explore later called the self-attention and the self-attention system Also does not change the shape of the input Tensor, but as we saw before the attention mechanism is something that takes as input Embeddings and gives you contextualized embeddings.

So it does not change the shape of these embeddings But we will implement it later. So for now just think of it as a black box that you feed in Embeddings and it gives you contextualized embeddings Then we have a residual connection and we can see that here. So this residual connection Skip connection was called Which is this first plus here So we are taking what we saved before with the output of the self-attention So what we saved before is this residual stuff here plus the output of the self-attention, which is this hidden states here This the result of the summation is saved again because there is another skip connection after I don't know why my alt tab is not working.

So We save again another This stuff here. So we save it because later we need to use it here for the skip connection Then we do I guess another linear layer normalization which also does not change the shape of the input tensor And then we have this thing called the multilayer perceptron.

Now the multilayer perceptron is something that It's not easy to explain what is used for but basically The multilayer perceptron we will see later is a series of Linear layers that takes each input embedding and Transforms it independently from each other from the others So while in the self-attention there is kind of a mixing of the patches incoming so that you get contextualized In the multilayer perceptron, there is no mixing between these let's call them tokens or patches Each of them is transformed independently And the multilayer perceptron allow us to increase basically first of all it adds parameters to the model.

So the model has more Degrees of freedom to learn whatever it's trying to learn and the second Objective of the multilayer perceptron is that it allow to prepare Let's say prepare the the sequence of patches for the next layer. So if the next layer expect these patches to be somehow Different the multilayer perceptron allow to transform them Also, it adds a non-linearity.

So the multilayer perceptron also includes a non-linearity which adds Which basically allow as you know non-linearities allow you to model more complex transformations So if you just create a list of linear layers without any non-linearities that you cannot model complex functions so that for example in the classification you cannot Map non-linearly separable data, but with by adding Non-linear transformations you add complexity to the model.

So the model is able to map complex transformations So the multilayer perceptron just adds parameters and this non-linearity which is helpful to To to allow the model to learn whatever complexity it needs To to map the input to the output After the multilayer perceptron, I guess we have a Yeah, we have another skip connection and then we return the output of this skip connection here and also the skip connection does not change the shape of the Of the tensors of the embeddings Now, let's code first this multilayer perceptron.

It's the easiest stuff to do So let's do it uh Let's go here. I I will also always copy first the Constructor and then the forward method so we can explore a little bit the structure and then we explore the logic So this multilayer perceptron just like in the vanilla transformer is made up of two layers plus a non-linear transformation So the first layer takes each of the embeddings which are we we can also call them tokens or patches Because most of the time we are dealing with language models and expands them So each of these vectors which is of size hidden size is expanded into this thing called intermediate size Usually it's chosen as three times the hidden size or four times the hidden size I remember in the vanilla transformer it was four times the hidden size Then we apply a non-linearity to this expanded tensor and then we compress it back to the hidden size dimension So let's do the forward method now Which is this one here So the first thing we do is we convert each of these embedded dimensions into intermediate sizes So again, we have a batch of images Each image is made up of num_patches number of patches each of this patch is represented by a vector of size embedding dimension With the first fully connected layer, we are expanding each of these patches into the intermediate size and then we apply A non-linear transformation in this case.

It's the gelu function now You may be wondering why are we using the gelu function or the zwiglu function or whatever non-linearity there is The reason is always practical. So Basically There is a there is no like a rule of thumb for choosing the non-linearities to use for a specific case There are just some heuristics And the heuristics is that initially the transformer when it was introduced it was with the gelu function as non-linearities between these two fully connected layers But then people explored other non-linearities and they saw that they work better Now non-linearity is actually there is also some logic behind the choice of a non-linearity So because the non-linearity define also the flow of the gradient So for example, if you use the gelu function, if you look at the graph of the gelu function, let me draw it actually The graph of the gelu function is something like this.

So Why I cannot draw it, okay So basically anything that is negative is zero. Let me use another color Anything that is negative is becomes zero basically and everything else is forwarded without any scaling So this means that if the input of the gelu function is negative the output will be zero and actually for any Negative input there will be no gradient because the gradient will be multiplied by zero.

So it will not flow That's why for example, we introduced the leaky relu and other like In the relu family, there are other Functions that allow also a little bit of gradient flow from the negative side So the non-linearity basically tells you How the gradient will flow during back propagation.

So having a non-linearity that allows That allows the gradient to flow back even when it's negative It means that the signal the model is not forced to always have the activation to be positive to have some Feedback from the loss function to optimize its weights And why we are using the gelu because people have tried it and probably it works better compared to the relu function for the same class of applications so in the vision transformer you see the gelu function, but In the lama, for example, they use the zwiglu function in other scenarios They use other functions and it's mostly based on heuristics on how they work in practice also, because a model is usually made up of billions and billions and billions of of parameters and it's not easy to find the regular regularity to understand why Specific non-linearity is working better than the other one Now, okay, then we apply the second linear layer Which is basically recompressing back this intermediate state into the embedding size and then we return it and this is our multilayer perceptron our next part is going to be we are going to code this attention mechanism for the vision transformer and we will see that it's Different than from those of language models because we don't have any causal mask or attention mask All right guys, so we have seen the multilayer perceptron now Let's go to the multi-head attention and for that I want to use the slides because I believe it's a little faster to explain on the slides and then we proceed with the code So what is the multi-head attention?

The multi-head attention is a way of contextualizing stuff Which means that you start with a sequence of for example patches and you can think we have for example Four patches each of this patch is represented by a single vector of 1024 dimensions So you need to think of this as a vector of 1024 dimensions.

So you need to think there are 1024 numbers in this row vector Then we have the patch number two the patch number three and the patch number four Each of this patch was extracted from a group of pixels from the initial image and it's only representing information about the patch It was extracted from so the part of the image it came from With the multi-head attention system.

We uh, what we mechanism what we are doing is we are contextualizing these patches Which means that the output of the multi-head attention is a tensor of the same size As the input so this is a tensor of size 4 by 1024 the output will be a tensor of size 4 by 1024, but where each of these Embeddings now does not capture information only about itself, but also about the other patches in the in the sequence This is for vision transformer for the language models we want something slightly different So for language models, we do have an input sequence, which is a sequence of tokens each token representing one single I don't want to use the term word because it's wrong but In my videos, I always make the simplification that each token is a word and each word is a token But this is not the case actually in tokenizer.

So usually a token can be just any sequence of characters Does not does not necessarily be um, it does not need to be necessarily a word But for us let's treat them as word. It's just simplifies the explanation so We have a list of tokens. Each token is represented as an embedding.

Let's say of 1024 dimensions So it's a vector of 1024 dimensions. So 1024 numbers for this one 1024 numbers for this one, etc, etc The multi-head attention in the case of language models What we want is we want to contextualize each token with the all the tokens that come before it So the output of the multi-head attention in the case of language models And this is this would be known as the self-attention mechanism with causal mask Is a sequence with the same shape as the input sequence So this vector this matrix here is a 4 by 1024.

So the output will be 4 by 1024 And each of these tokens is not capturing information only about itself But also about all the past tokens now the word I does not have any past token So it will only capture information about itself But the word love will capture information also about the token I because it comes before it and the word Pepperoni will capture information about I and love because they come before it etc, etc until the last token which capture information about all the sentence Why do we want to do this in language models?

Let me give you a little understanding of why we do it in this way with language models and why the transformer is revolutionary for language models This is going a little off topic with respect to the vision transformer But I think if you understand this then you will understand the big part of the transformer and why it even exists So let's copy this stuff here Let's open a new page Now what we do with the language models is you need to think that a language model is Something that we need to we retrain on what is known as the next token prediction task Which means that given a prompt the language model try to understand what is the next token that completes this prompt How do we generate text with the language model?

We start with some tokens, which are the prompt we generate the next token We put it back into the prompt and we ask again the language model What is the next token the language model gives us the next token? Then we put it back into the prompt and then we ask again.

What is the next token etc, etc So we need to train a language model to train a language model We need to train a model to predict the next token given the past tokens And the transformer allow us to do that in parallel when training Which means that we start with an input that is a series of embeddings Which are uncontextualized so we start with this one and each of these actually is one single token.

So this is only I this is only love This is a pepperoni And this is a pizza The output of the transformer of the self-attention mechanism will be a series of embeddings that are Uncontextualized in such a way that each token captures information of only about itself, but also about all the past tokens How do we train and the transformer can do it in parallel?

So the self-attention mechanism will take this as input and generate this output in parallel So it's not will generate one token at a time, but it will generate all of them in the in parallel using this multi-head attention How do we train a language model basically? As we saw before the language model is something that given a prompt needs to predict the output.

So what we want is that We can we take the input which is a This sentence here. We feed it to the transformer the transformer will transform it into a sequence of embeddings Contextualized embedding and then we need some labels to train this language model So the labels what will be well, we will we want whenever the language models Is given the word I to predict the word love So big, oh, I think i'm using not the pen here the word love whenever the Language model sees the word I love it should predict the word pepperoni Whenever it sees the word the sequence I love pepperoni it should predict pizza Whenever it sees the sequence I love pepperoni pizza It should predict the token end of sentence, which is a special token telling hey, I'm done with the generation Because the transformer can generate all of these contextualized embeddings in parallel we can also calculate the loss for each of these predictions in parallel and Calculate the with backpropagation updates the weights of the model to tell in parallel How the model should predict each of this token given the The previous tokens.

So when we are given a sentence and we train language model the language model can Can be trained With only one forward pass on how to predict the next token inside of this sentence given the previous tokens as context In only one single pass of the transformer. That's why the transformer is so powerful because this contextualization happens in parallel So we can calculate the output in parallel for each position And because we know already know what is the label because the label is just the next token given the previous tokens we can calculate the loss in parallel for each positions and the model will learn in parallel how to Generate exactly this sentence in in one pass only so the model will not learn to generate one token at a time given the previous but All the sentence in one pass and that's why it's so powerful Now let's go back to our vision transformer Okay, so we have seen what is the difference between the vision transformer and the language model So in the vision transformer, we want to contextualize tokens or patches In such a way that they capture information about all the other patches But in the language model, we want each token to only capture information about itself and the previous tokens How does this self-attention mechanism work?

We start with of course an input sequence. Our goal is to create an output sequence that is contextualized And there are many intermediate steps. So now we will see what are these intermediate steps one at a time so Let's start by creating the class of this this attention mechanism and we will create it.

Let's create it here Okay, so in the input we have the configuration of the model we save some stuff that we will need later So the hidden size the number of attention heads because we are dealing with multi-head attention Head dimension we will see later what is it and why it's used The scale is basically the if you remember the formula for the attention is The queries multiplied by the transposed of the keys divided by the square root of the model And this is one over the square root of the model So the stuff that we need to divide the query multiplied by the keys with Then we have this dropout which is zero.

I never saw it used in In polygamma, but I believe there are other cglib models that use it. So they they put it here But it you can think of it like non-existent for now and then we have these three linear layers called w, k, w, q and w, v which are Parameter matrices that are also present in the vanilla transformer We will see later what they are used for And then we have this output projection which in the paper of the transformer is called the wo matrix and we will see later What is it is used for?

Let's start by implementing the forward. So the forward method is this one What is the input of the forward method? Well, the input of the forward method of this attention mechanism is basically what Is the output of the layer normalization in this encoder layer class So the output of the layer normalization is fed to this self-attention mechanism So it is something of this shape.

So it's a batch size by non-patches by embedding dimension So what is does it mean? It means that we have a batch of images Each of these images is made up of some patches how many defined by this number non-patches And each of this patch is represented by a vector with the size embed dimension You can think of it as a vector of 1024 dimensions.

I don't remember the exact number of dimensions right now You can also think as this non-patches as a sequence length So before we saw that a language model is made up of a sequence of tokens here. You can think of it as a sequence of Patches where the sequence length is this non-patches here The first thing that we do in the self-attention mechanism is we take the input and we run it through three Transformations one is called wq one is called wk and one is called wv and after we run it through these Transformations the output will become query key and values So let's do it And it's this stuff here So we take the input sequence, which is this hidden states and we run it through wq here.

It's called the qproj Wk here is called the kproj w here is called vproj The shape of the tensor does not change. Basically. These are parameter matrices So they just add parameters to our self-attention that transform the input sequence so that they become query key and value So it's the query key and value is just a transformation of the input sequence.

However In this case each token still is independent from the other So there has been no contextualization happening with the linear layers. So linear layers always treat each token Independently from the others just like the multi-layer perceptron each token in the multi-layer perceptron is expanded and then reduced Here, it's not even not expanded nor reduced.

It's just transformed because the size is from embedding dimension to embedding dimension So it's just a transformation of the single token Why we want to do it? Because the self-attention mechanism needs to see the same sequence in three different ways as query key and value So we do three different transformations Later, we will see why they are called query key and values The second thing we do is basically we split this each of these tokens into smaller tokens How many smaller tokens based on how many heads we have and now we see why so let me do something strange Which is i'm not copying the entire line.

I'm copying a part of it so We take this query state Which is a tensor of batch size numpatches embedding dimension and we are splitting the embeddim dimension into smaller parts Called head dimension. How many of this head dimension we have? We have numheads Okay, let me copy it all otherwise, I think it's going to be confusing.

Sorry We also have this transposition later. We will see how it works. We will visualize the tensor operations We do it for the query the key and value, let's do it and then we see what is it about Okay So let's go to the slides So at the input of this fission transformer, we have a sequence of patches you can think of it as a sequence of vectors each vector made up of let's say 1024 dimensions or you can think of it as a Sequence of tokens in case we are working with the language model and each token is represented by 1024 dimensions vector The first thing that we do is we convert this input sequence Which we will call x into query key and value and we do it through three transformations.

One is called Wq one is called wk and wbn Which is basically a matrix multiplication Now if you look at the shape of the input sequence here, it's 4 by 1024 So here you can see the input sequence is 4 by 1024 Where 4 is representing the sequence dimension So how many tokens or how many patches you have and the hidden size represents how many what is the size of this embedding vector?

We multiply it each of these with wq wk and wv Now if you look at the dimensions here wq wk wv they are The size is embedding dimension to embedding dimension. However here I have represented it as embedding dimension to 8 multiplied by 128 so The overall size is the same.

So it's 1024 by 1024 However, i'm splitting this second 1024 into eight groups and later we will see why so you can think of it as a matrix multiplication that takes a matrix multiplication between this tensor here 4 by 1024 and this other tensor which is also 1000 by 24 by 1024 However in which the second dimension is split into sub Groups, how many eight groups because eight is the number of heads we are going to work with each having 128 dimensions if you do this matrix multiplication, it is It will result in this output here.

So basically it's a 1024 multiply this dimension here cancels out as you can see And then we have the second dimension that remains so in the matrix multiplication the inner dimensions cancel out and the outer dimensions remain Now You can if you are confused by this you can think of it like this.

So it's like a 1024 And it's 1024 nothing has changed. I'm just grouping the dimensions. So that's why it's possible But it this grouping is helpful. And now we will see why Let's visualize this tensor operation at the max matrix level So when we do query this x multiplied by wq we have nx which is a 4 by 1024 so it's a sequence of tokens, each token is 1024 dimensions And we are multiplying by a very big matrix, which is 1024 by 8 by 128.

How to visualize this matrix? Well, this is a wq. So it's a parameter matrix It's also wq and wv. So they all have the same dimensions You can visualize this like this. You can think of it as a matrix made up of 1024 rows Each row is made up of smaller vectors How many smaller vectors?

8 of them and each of these smaller vectors is made up of 128 dimensions The overall size of this matrix is still 1024 by 1024 But each of these let's say these vectors are split into 8 groups So that the output is also a matrix in which each of the Tokens is a split into multiple subgroups.

So it's a matrix that is 4 rows So as you can see, this is 4 is the number of rows Each row contains 8 groups of smaller embeddings and each of these smaller embeddings is made up of 128 dimensions So why are we even doing this? With multi-head attention, basically what we want to do if we want The multi-head attention is a way to relate tokens with each other We don't want to relate tokens to each other by watching the full embedding of each token We want to do it with 8 different heads Such that each head works with a smaller part of the embedding of each token So the head number 1 will only watch the first 128 dimensions of each token in the entire sequence The head number 2 will watch the next group of 128 dimensions.

So the dimension from 129 to 256 of each token So this head will learn to relate all these tokens by only watching this part of the embedding of this each token This head will learn to relate tokens by only watching this part of the embedding of each token And this last head will watch to we learn to relate tokens by only watching the last part Last 128 dimensions of the embedding of each token.

Why? In Many languages a word may have different meaning depending on the context in which it appears If we don't have multi-head attention because the multi-head attention we will see it later is based on what is known as What is a dot product? If we compute the dot product over all the all the Token then there is only way of calculating the dot product between two tokens Which is the full embedding of the first token with all the full embedding of the second So there is only one way of relating two tokens with each other By splitting each token into smaller groups Each dedicated to one head.

So this is head 1, head 2 and head 8 and all the intermediate heads are here We learn to relate tokens to each other differently because each head is watching different parts of the embedding of each token And this is useful for language modeling, for example, because in language modeling Especially for example in Chinese Each word may have different meaning depending on the context in which appears So it may be a noun in some context.

It may be a verb in some other context or an adverb in some other context, etc So we hope that this head here, for example learns to relate this token as a verb This head here will learn to relate this token as a noun and this head here Maybe will learn to relate this token as an adverb or some other property that this token has And this multi-head attention also has another advantage Because the multi-head attention is based on dot products between tokens This head here will do the dot product of this first 128 dimensions of this token with the first 128 dimensions of this token And this head because it watches this part of the token embedding and this other head watches this part of the Embedding they can work independently from each other And so because they can work independently from each other this computation can be parallelized That's why in the attention is all you need paper when they talk about the multi-head attention.

They make this Drawing with multiple drawings behind you can see here with the head dimension appearing here, which means that each of this head Is computing this scale dot product attention in parallel With the other heads because each of them is working with a different part of the embedding of each token So they can work independently from each other And this is what we are doing here.

So we group this This the embedding of each token into multiple subgroups Each dedicated to one head because we want this multi-head attention to happen in parallel Because each head is working with a different part of the embedding of each token And so it it becomes Much faster because we can compute all this stuff in parallel anyway What we have done in the code is as follows So we have taken our input sequence now here for the drawing.

I have chosen a 4 by 1024 but in the code it should be Depending on how many patches we have so numPatches by embedDimension We have multiplied each of them by the Q K and V And then we split them here as you can see in the In multiple heads, so we add this head dimension here in my slide I just pretend I am multiplying directly with a Parameter matrix that is already split into multiple heads Why am I doing differently here than compared to the code because we will be it will be useful for this Visualizing it this way is will be useful for when we will be Talking about the language model and especially we will be talking about grouped query attention Because with grouped query attention, we will see that the number of heads for the query Is much bigger than the number of heads for the keys and the values So here in the vision transformer the number of heads of the query key and values is the same So we don't use the grouped query attention and that's why We use the same number of heads for the query key and values Then we do this transposition and now we see what is this transposition So when you do this multiplication here, so you multiply the input by the Q projection.

It will return the same input shape When you do this view, it will just split this last dimension. So this embedDimension into smaller parts So it will become num It will become like this Uh patches by heads, so we are splitting This dimension into these two smaller dimensions. So numHeads by headDimension So basically, what is this headDimension?

headDimension is the embedding full embedding divided by the number of heads So this one imagine this is 1024 Then imagine this is 8 Then this will be 128 because it's 1024 divided by 8 and Because we are not reducing the number of parameters or we are not throwing away anything We are just grouping differently each of these embeddings With this transpose here, we are changing the position of the two Two dimensions which dimension the position the dimension number one and the dimension number two, which is the numPatches with the numHeads So basically we are doing numHeads and numPatches So this will be the output of all this expression.

So it will be a tensor of this Of this shape batchSize numHeads numPatches headDim. Why are we doing this transposition? Let's see so we have When we multiply by this wqwk and wv which is already includes the grouping. We are grouping each of these Vectors into sub groups each dedicated to one head Now what we have here is a sequence of tokens Each token is made up of eight group of embeddings.

Each group of embedding is made up of 128 dimensions what we want, however is because we want to compute the Multi head attention in parallel, which means that each head should be able to visualize The entire sequence but a smaller part of the embedding of each token We need to transpose these two dimensions.

So we exchange the sequence dimension with the head dimension and a way to visualize this is this that Let's do it. So we have this sequence of tokens each token is Divided into eight groups. Each group is made up of 128 dimensions. We want to convert it Into multiple sequences made up of only the part of the embedding dedicated to each token So when you do the transposition of these two dimensions here They become like this.

So 8, 4, 128 How can you visualize this matrix? You can visualize it as follows. It's a big matrix that contains eight smaller matrices each smaller matrices contains four tokens and each token contains 128 dimensions, which is exactly the dimensions That are dedicated to each of this head. So you can think of it as a sequence eight sequences where each sequence is made up of tokens and each tokens contain only the part of the embedding dedicated to each of the head that it's each of the eight heads It's composed of so this sequence here will only contain the first 128 dimensions of each token This sequence here will contain the next 128 dimensions of each token And the last sequence here will be a sequence of four tokens and each token will be made up of the last 128 dimensions of the initial tokens Why are we doing this?

Because now we can compute The multi-head attention using this stuff here Independently from this one independently from this one independently from this one because each head has a sequence of four tokens and each token is made up of 128 dimensions And we end up in what we saw here So we can compute this scale.product attention using the query key and values where the query key values are not the entire Embedding of the token but are only the part of the token dedicated to that specific head So this head here suppose the head number one will be using the first 128 dimensions This second head will be using the second 128 dimension.

The last head will be using the last 128 dimensions, etc So we have created the that's why we did this transposition because we now we can treat each head Independently each head is made up of is working with the four tokens Which is the sequence dimension and each token is made up of the part of the embedding dedicated to that head And this is why we do this transpose here The next thing that we do in multi-head attention is well, we have this Query key and values.

What should we do? We should do query multiplied by the transpose of the key divided by the square root of the model And that's it. Yeah, so let's do it Let's calculate the attention weights, which is this one So we take the query Multiplied by the transpose of the keys where we are transposing the second and the third dimension What is the second and the third dimension?

It's the numPatches with the head dimension because the query is pet size numHeads numPatches head dimension to multiply it with the keys we need to exchange the last two dimensions, otherwise You so multiply it we need like this. We need This stuff here Then we need head dimension and numPatches Such that if you remember in the matrix multiplication the inner dimensions cancel out and the outer dimensions remain so the outer dimensions basically Is numPatches numHeads then the hidden this is Head dimension will cancel out with this one and we will be left with numPatches So the output of this multi head attention basically, it's a matrix that is numPatches by numPatches for each head Let me delete this one So I know it's not easy to visualize it like this.

So let's visualize it on the slides So what we are doing is we are multiplying the query with the transpose of the keys And then we are dividing by the square root of the model, but we already computed it here. So this is the square root of the model And we just because it's already one over square root so we just multiply it we don't need to divide by it So let's visualize in the slides how this multiplication works Okay, we already saw why we do the multi head attention because we want to parallelize the computation etc.

So now what we are doing is we are for each head each head as we saw before is Made up of one sequence of embeddings where each embedding is not the full embedding of the token But it's a part of the embedding of each token. So it's a smaller embedding.

Let's say So each head basically will do the following matrix multiplication when you do query multiplied by the transpose of the keys Each head is made up of a sequence of tokens And each token is not the full embedding of the token, but it's the first 128 dimensions of each token When we do the transpose of the keys each of these row vectors becomes a column vector as you can see And when we do this matrix multiplication for each head we will be getting this Matrix as output which is Sequence by sequence because as you can see when you multiply this matrix here by this matrix here You get four by four matrix as output because the inner dimensions cancel out What does this matrix represent?

Each of these numbers represents the dot product of one token with another token So you can think of the rows as being the queries and the columns as being the keys This one here is the dot product of the first token of the queries suppose that the each of these tokens represent A sentence like I love pepperoni pizza Then this is the word I this is word the word love this is the word pepperoni and this is the word pizza Then this number here represents the dot product of the word I with itself So the first query with the first key This one here represents the dot product of the first query with the second key This one represents the dot product of the first query with the third key And we do all the possible dot products as you can see here now you Are and what does this matrix represent?

This represents somehow the relationship between two tokens So the bigger the dot product the more intense is the relationship between two tokens Actually, it's then defined later. We will see that we apply the softmax But you can think of the dot product as being how the self-attention mechanism is relating to tokens How intense is the relationship of these two tokens?

Why do we have this square root of the model as the denominator because We want to scale this dot product based on because usually when you train a model you train multiple variants of it, for example and We when we and suppose some for example, imagine you want to try you train multiple variants and you have this You try multiple number of heads You don't want the magnitude of these numbers to change between one try and the next one So basically by dividing by the square root of the model you keep the magnitude constant um Now what are what is this matrix doing so this matrix tells us how two tokens are related to each other Now in language modeling we also apply what is known as the attention mask So we don't want the word I to be related to future tokens So usually we don't want to compute this dot product We don't want to compute this dot product and we don't want to compute this dot product because we don't want the token I To be related to all any other token because there is no previous tokens We also don't want the word love to be related to the word the pepperoni and the pizza Because they come after it, but we want of course the word pepperoni to be related to the word love.

So this this There should be a number here. So we don't want to mask out this one This is called a attention mask And how do we apply that basically? If we don't want some interaction between token to happen we can Calculate the matrix as usual So query multiplied by the transpose of the keys and then we replace all the numbers all the relationships that we don't want With minus infinity.

So here we can replace this number here with minus infinity Here we can replace this number with minus infinity and then we can replace this number with Minus infinity So that after we need to apply the softmax the softmax will convert each of these numbers into a probability score because we want the relationship of one token with other tokens to be Between zero and one and also we want each row to sum to one Later, we will see why because actually the when we do the contextualization we are doing a weighted sum, but okay Let's forget it about now Anyway, the point is we apply the softmax row by row.

So if we don't want the relationship of two tokens to be Considered by the attention mechanism. We replace that particular dot product with minus infinity before we apply the softmax Because the softmax we saw before is an exponential It's e to the power of x when e is to the power of minus infinity It will become zero.

So the output of the softmax will become zero for all the interaction that we didn't want So that's why we replace it with minus infinity Now, let me put back whatever we had before Okay, so this is uh where we apply the mask So as you can see if we apply the mask before we apply the softmax It will replace with zero all the interactions that we don't want And this is um what is the This matrix here is known as attention weights so it tells us how intense is the relationship between two tokens and This matrix here is calculated independently for each single head because here I show you only one matrix here 4 by 128 But we have eight of them And each of them is calculated in parallel So you need to think that you have eight of this matrix if you have eight attention heads And in this case in the code, you can see that the output is a list of it's a batch Because maybe we have multiple images Each of these images is managed by multiple heads Each of these heads will learn to relate tokens differently So each of these heads will give us a numPatches by numPatches matrix or sequence by sequence matrix Where each of this number represents how this head is relating two patches with each other So now we have seen how to calculate this attention weights Which basically it's a matrix that tells you how two tokens are related with each other It's kind of a score of how the attention mechanism thinks two tokens are Related to each other We continue our journey The first thing we do.

Okay, we verify the dimension of this matrix And then we apply the softmax the softmax as we saw before is a way to convert these attention scores into Numbers that are between 0 and 1 and also such that they sum up to 1 And we do it by soft with the softmax function, which is applied by rows And that's this dimension.

This is a What is the meaning of this dimension parameter which tells you how you want to apply it? So we are applying it to the last Dimension you can think of this as the row dimension. This is the column So if you apply it on entire all the columns, it means you are applying it by rows then we have the dropout but as I said before we don't use the dropout because I didn't see it in the parameters of the polygamma ever being used.

So we have it, but we don't use it And as you remember the dropout basically takes random With the probability p it will set some activations to zero So some numbers of this input matrix to zero, but we don't use it And it only happens during training and it's a way to reduce overfitting But as it's not used The next thing that we do in the multi-head attention is we are multiplying this attention weights matrix with the v sequence the value sequence So we multiply this matmul means matrix multiplication We are multiplying this attention weights with the value states, which is the value sequence which is a transformation of the input sequence through this wv matrix and also by grouped by Heads, let's visualize this operation so Let's go here so the output of the attention mechanism of the query multiplied by the keys is this matrix here where each number represents the How two tokens are related to each other by applying the softmax this number become between zero and one in each row And also in such a way that they sum up to one So here you can see it's 1.0 because there is only one number here.

It's 0.4 and 0.6 So they sum up to one and here is 0.2, 0.4, 0.4. So they sum up to one etc, etc Now when I say that these numbers represent the intensity of how the attention mechanism relates to token is because now when we multiply This matrix here, which is in the code is written as attention weights We multiply it by the v matrix.

So the v sequence for the value sequence We are computing a weighted sum. Why? When we do this matrix multiplication We are multiplying for example a 4 by 4 matrix by a 4 by 128 matrix Where each of this v matrix is one for each attention head just like each of this matrix here Attention weights is one for each attention head.

So each of these attention heads will be doing this Product in parallel. So each attention heads does query multiplied by the transpose of the keys in parallel the softmax in parallel and this multiplication with the v matrix in parallel I mean not these operations in parallel. It's the attention heads that work in parallel.

The operations are sequential, of course now What is the output of this Product it's a 4 by 4 multiplied by 4 128. So the output is a 4 by 128 because the inner dimensions cancel out and the outer dimensions remain Let's analyze this output matrix here So it will be a matrix with four tokens each token represented by not the full dimensions But because we are working with multi-head attention each head will have a smaller part of the embedding of each token So it will have 128 dimensions in case we have eight heads and the embedding dimension is 1024 this first number here will be the Will be the dot product of the first row of this matrix with the first column of this matrix And as we can see from this row here All the values are zero except the first one which means that only this token here will contribute to the output here, which means that this and The second number in this matrix here So this stuff here will be the dot product of the first row of this matrix with the second column of this matrix But most of the values here are zero except the first one Which means that only this token here will contribute to this second number here So all the dimensions in this row will be contributed only by the first token multiplied each The dimension of the first token multiplied by the number one Because all the other tokens will be multiplied by zero zero and zero Let's look at the second row of this matrix here this one here the first number So the first dimension of the second row of the output Matrix will be the dot product of the second row of this matrix with the first column now The first two numbers are non-zero and the second two numbers are zero Which means that only the dimensions of the first two tokens will contribute to this output embedding For each of these dimensions So for all the dimensions here will only be contributed by the first two tokens because all the other tokens Whatever there is now the number is here They will be multiplied by zeros.

So they will not contribute to this output embedding That's why we can say that this is a contextualized embedding In which the contribution to this contextualization only comes from the first two tokens How are they these two tokens contributing? Well each of these numbers in the second Token will be multiplied by 0.4 and each of the number in the first token will be multiplied by 0.6 This you can see it as the first token contributing 60 percent of the information to this contextualization and the second token contributing 0.4 to this 40 percent to this contextualized embedding And you can do the same for the third output So this output here the first number will be the dot product of this third row Multiplied by this first column and as you can see here, we have a zero because of the causal mask Which means that only the first three tokens will contribute to the third embedding here How much each token will contribute?

Well, it depends on how are these numbers distributed? The first token will contribute 20 percent. The second token will contribute 40 percent and the third token will contribute also 40 percent So that's why when we talk about the attention width matrix, we talk about how to the matrix the attention mechanism Is telling us how intense is the relationship between two tokens so that each token will contribute that token will contribute more to the output embedding So if the the word let's say pizza and I are Very related to each other when then the embedding of the word I will contribute most to the output of embedding of this fourth contextualized position So it means that then the fourth was contextualized position will be 40 percent based on the information contained in the token I and 20 percent of the information contained in the word the love and 30 percent in the In the word the pepperoni, etc, etc, etc So this is why it's known as a weighted sum because you are Summing the contribution of each token if it's not masked out Weighted with the attention score associated by Calculated using the attention weights matrix here and we do this for each of this head in parallel So each head is watching a part of the embedding of each token and it's learning to relate them differently and then doing this weighted sum differently And each head will contribute Will output a list of contextualized embedding but each of this contextualized embedding will not be a full token It will be part of what is the full token and now we'll We see how we can merge the result of this multi-head attention And for that we need to look at the original paper.

So if you look at the original paper We calculated this multi-head attention in parallel. And how can we merge the result of this multi-head attention? Well, we we we we go here and we basically concat these heads So we take the output of the first head we concat it with the next we concat with the third head with the fourth The fifth etc, etc All the heads so until we get the full dimension of the original token back because each head is made up of 100 In case suppose 128 dimensions, so this will be the first 128 dimension then the next 100 and the third 100 etc Until the last 120 dimensions, so we get back the 1024 dimensions back And we do this stuff.

Let's go back here here, so each head Will return a contextualized embedding for each position, but it's a contextualized Embedding That does not include all the original token contextualized but a part of it because each head is working In parallel with a part of the embedding of each token, then we concatenate them.

So What we do is we basically we want to arrive to this stuff here. So we have a contextualized embedding Here one for each of the heads Okay, first we need to do I believe a transposition so we need to transpose back because before We transpose right? So we We put the head dimension first and then the sequence dimension So now we need again the sequence dimension and then the head dimension after so that each We go from this configuration Which is for each head.

We have a contextualized list of tokens We want to get a list of tokens in which each Head is contributing its 128 dimensions, which are contextualized Embeddings, smaller embeddings, let's say So let's do this transposition also in code I believe it's here. So I think there is another checking of the output dimension We transpose back So we do this transposition back.

So we did the first transposition here to exchange the Number of heads with the sequence dimension. Now we transpose back So we go back to the num_patches and num_heads So it's a sequence each sequence is made up of smaller Eight groups or num_heads group and each Head is made up of head dimension dimensions We do this contiguous because we want to reshape.

Okay, it doesn't matter You don't have to know why we do this contiguous, but basically Contiguous means that we want the The tensor to represent the information in the memory in a contiguous way so that the next operation that we are going to do the reshape is basically Does not require any computation because when you do a reshaping or a viewing of a transfer of a tensor There is no change in the memory layout of the tensor Actually, the PyTorch will just change what is known as the stride of the tensor So if you go to a tensor We are going a little off off topic, but There is this thing called the stride which tells you how To go from one dimension to the next without changing the layout of how this tensor is allocated in the memory So when you do a view or a reshape The PyTorch will just change these numbers on the stride.

Okay I will do another video on how this works But anyway, but this contiguous allow us to have this tensor all in the memory as a contiguous memory allocation So that this reshape operation can be done without Without computational overhead Now let's get back on track. So We did a reshape operation in the slides So after we have to do a reshape, we did the transpose operation and now we need to do a reshape operation So the transpose operation basically allow us to get again at the first dimension the sequence dimension Then the grouping of the group of dimensions of each token And each group contains 128 dimensions.

Now, we need to concatenate them. How can we concatenate them? Well, we just want to merge these heads again together into one single token And we do that with this Reshape operation. So with reshape basically, we are going from numHeadsHeadDim to EmbedDim, which is In this case, it's 124 So, how does it work the reshape basically the The PyTorch will take each of these Groups and will just merge them.

So it will just concatenate them with each other. So instead of being a matrix that contains sub-arrays where each sub-array contains multiple sub-arrays and each of these sub-sub-array contains 128 dimensions, it will just become a matrix that contains one array that is made up 1024 dimensions, which is the concatenation of all these heads So this is how we merge the information of all this multi-head attention that was done in parallel into one single Token that is a contextualized version of the initial token So we as you can see we got back the initial shape So we started with before at the beginning of the multi-head attention.

We started with 4 by 1024 Input sequence and we end up with 4 by 1024 Sequence There is one last part that we need to do that is Multiplication with this WO. So if you look at this concatenation that we have done The concatenation basically takes the this tensor this first token here Is just the concatenation of the first 128 dimensions, which are the output of the first head then the second 128 dimension Then the third 128 dimension and then the last 128 dimension.

In total there are 1024 dimensions But there has been no mixing between the result of these heads. So it's just a concatenation of multiple of independent calculations Each calculation done by one head independently from the others But we want the token to not be a concatenation of independent calculations We also want to kind of mix the result of these heads with each other And the mixing happens when you do this multiplication by WO.

The WO matrix is a matrix that is embedding size by embedding size Which basically As you can see does not change the shape of the input. So we have The input of this WO will be a 4 by 1024. We multiply by 1024 by 1024. So it results the same input shape But it will be because Let's look at this number here.

This number here is the dot product of the first row So the first token with the first column of this matrix And the first column of this matrix is 1024 parameters. So all of these heads, so the 128 dimensions of the first head 128 dimensions of the second head, etc, etc Will all participate in the same dot product giving up one single number here So there have been a mixing of the results of this head.

If we don't multiply with the WO there is not There is no mixing between the result of each head which happened independently in parallel And that's why we multiply it by WO So we don't want each token to be a contextualized version of multiple subtokens each calculated independently from each other by the multi-head attention We want of course it to happen because we want to parallelize But then we want to mix the result of this multi-head attention and we do that by multiplying by WO and now let's do it so For now, we just merge.

So this reshape is basically doing the concat that we saw before in the attention paper Now we do the multiplication with the WO which is this stuff here. So out projection It won't change the shape of the tensor that is input to it And then we return it along with the attention weights.

Actually, we will not be using the attention weights And now finally we have implemented the multi-head attention I just realized we forget something guys. So We forgot to implement this encoder. So we created the layer of the encoder, but we didn't create the encoder itself So what we created basically in this vision transformer is this stuff here.

So let me open the slides We created one single layer like this one But we didn't create the sequence of these layers because an encoder is a sequence of these layers. So let's do it It's it's very simple. So this is a single layer But we need to create a sequence of them because we apply one after another such that the output of one is Used as input for the next one.

It's a very simple class. So let's create it Let's create the Constructor so it's just very simple. It's a okay We save the configuration then each we create a sequence of layers where each layer is this encoder layer to which we pass the configuration How many we create based on how many layers it should have so the transformer layers And the forward is very simple.

I can just copy it all. It's basically says, okay We have the input we give the input to the first layer and the output of this layer becomes the input to the next one So we do a for loop and then we return the the output of the last layer This is a very simple and as you can see between each layer, there is no change in the shape of the tensor that is fed I believe I think we have Coded all of the cglip.

So which is our vision transformer You may think that I have lied to you by saying that at the beginning when we were talking about contrastive learning you Okay, actually, let's look at it. Otherwise, we will have the doubt so When we were talking about contrastive learning We were talking about generating one single embedding for each image But here we are generating a sequence of contextualized embedding So how can the image generate one single embedding?

For a single image so in the transformer Is a sequence to sequence model So you give it a list of patches as input and it will give you a sequence of contextualized patches as output When working with something like clip, for example, if you want only one single embedding for each image You can just take the first output contextualized embedding from the transformer as a representative for the whole image Because it will force the model to put all the information in the first contextualized embedding So that's one way to do it Another way is to just take the average of all the output embeddings by the transformer to generate one single embedding Anyway, this was just a closing note before we move to the next part, which is our language model So let's go back to the architecture, which is here So we have coded this part here the vision encoder so we feed an image it will be The vision encoder extracts some patches each of these patches become an embedding to this embedding We add a positional encoding which is learned We send it to this magic box called the transformer layer, which will contextualize them We take the output of this contextualization and this becomes our Image embeddings Now before we can feed it to the language models These embeddings may not be of the same size of the embeddings used by the text layer So we will need to introduce this linear projection So in the next part of the video, we are going to code the language model including this linear projection here And we will learn how to merge these tokens the image tokens and the text tokens Okay, let's start So the next part that we are going to code is basically how to load the image from the disk to convert it into a tensor And also how to tokenize the text And we need we will see that we need to do the preparation of the text has to be done in a particular way Let's see actually why we have it has to be done in a particular way.

So let's open the slides Oops, I think I closed it. So let me open it again All right So as you can see, we need to find a way to combine the image tokens with the text tokens So first we need to tokenize the text But we need to create some placeholders for where we will put the image Tokens before the text token.

So I will use the term image tokens and image embeddings interchangeably because you can think of the image embeddings as kind of tokens that represents the image or and the Text are the embeddings that represent the text that is the prompt from the user so the first thing that we need to do is we need to learn how to load this image into a tensor because then as you can see from our cglib code the input to the cglib is A tensor that is has the channel the height and the width dimension which is then transformed into patches and contextualized, etc, etc Then we need to tokenize the text.

We need to create this list here But we we will create first a list of tokens Each corresponding to the text tokens and then we will add some placeholders for where we will put the image tokens and then it will be the transformer that will Take these placeholders and replace it with the image.

So I know it's a lot of things to remember. So don't worry. Let's code it and we will see it step by step. So let's go We create a new file called, let me check here processing processing.py We do some imports Okay We create these two constants and later we will see why we need them For now, just create them Okay, let's start from the beginning.

So let's create this class called the polygamma processor This stuff here It has a constructor Which is this stuff here It will take as input the tokenizer how many image tokens? We need to generate for the image and what is the image size that this particular gamma will work with We save it.

We save these two values and then what we do We need to add some special tokens to our tokenizer. So now I show you why we need to do it and how it works So the tokenizer that polygamma is using is the tokenizer of the gamma model But the tokenizer of the gamma model was not created With the special tokens for the image.

So what they did was they basically created these additional tokens called the because Polygamma can be used for multiple purposes So what we saw here in my slide is basically here is trying to extract information from an image So we have an image we have a prompt and the polygamma so which is basically the gamma model here will decode the response by interpreting the Prompt and using this one as additional information for the prompt the image Polygamma actually can do much more than this a polygamma can also do image segmentation so it can Segment the part of the image that for example for this leg It can do object detection So it can detect all the instances for of of tree for example If we do object detection for trees, it will probably give us this this okay This is not a bounding box this box here telling that this is a tree If we do it ask it to detect all the feeds it will give us two Boxes one for this one one for this one, etc So polygamma can do a lot of this and the way it does it by using special tokens For the segmentation they are called the segmentation tokens and for object detection.

They are called local location tokens And but we will not be using them. So our goal here is just to inference polygamma So we will not be working with the object detection or object segmentation But if you want more information on how these tokens work, there is a very nice article Not only this one from google.

So here in google they say That polygamma uses the gamma tokenizer, but they extend it with these further tokens that are used to tell In the output of the model, where is the segments? where is the bounding box position that it has detected or where is the location of the Of the segmentation mask that the model has detected Another article that I recommend is the hugging face blog article about Polygamma, let me find it.

I believe it is this one here In which they describe how this attention masks work So as you can see polygamma can detect the cat and will give us this output which is a lock Tokens, as you can see lock 0094, 00256 Which this number 0094, 0256 tell us the position of the top left Top right bottom left and bottom right corner of this bounding box here But we will not be using Here because we are only interested in using the polygamma as a conditional model for generating an output Conditioned on the image that we feed it in But anyway because the tokenizer Used by polygamma is adding these special tokens We also add them here and how to add them and how many to add them is described in this article You can see here.

And so basically we have 1024 location tokens for image detection and then 128 tokens for object segmentation Okay, we save the tokenizer Then what do we need to do? We have we also need to create this constant called image token what is this constant basically when we when we process our text with the gamma tokenizer the gamma tokenizer will only generate of course the The tokens for the text but later we need to also insert in these tokens the image tokens So what we need to do what we do basically is we insert some placeholder tokens That will then be replaced by the embeddings Extracted by the visual encoder and this placeholder tokens that we will be using is this image token here And we add it also And we We add it here in the tokenizer Now how to use this polygamma processor.

So the polygamma processor is a special class that given an Text which is the prompt of the user and an image will load the image Reprocess it so resize it rescale it. Whatever the vision model needs to see And we'll create this Text tokens with the placeholder for the image tokens.

So let's do it We create this Method here the call why we create the call method. Well, basically this allows the the instance of the processor To to be called like a function So when you create the processor you will we will create it like this like polygamma processor and then we can use it like this Passing the arguments here.

So this is why we implement the call method And the call method takes as input a list of text and the list of images but we will actually only accept one text and one images because I don't want to deal with the Padding otherwise, it will complicate our code.

Our goal is not to make it universally Perfect. Our goal is to learn by doing and how it works. Actually, this is this code will be usable So we will actually run the inference later, but it will only work with one image and one prompt at a time It doesn't matter because later we can later I will try to make the code for fine-tuning this model And we will see that we will change this code a little bit to to accommodate for the padding Anyway, we need to process these images and we will use a special method called process images So if we take each of these images and we need to resize it We resize it to the image size that is accepted by this polygamma version.

So As you can see the weights of polygamma Actually show there is multiple weights, but this is two to four only resizes the images to the size 124 by 224 and generates 128 tokens for this in each image then we rescale this image and later we will see why we do it and then we We normalize it using the mean and the standard deviation of ImageNet It's not really the ImageNet mean and standard deviation, but later we will see how it works Anyway, suppose that this method here will load the image will rescale it will normalize it etc and convert it into A tensor that can be then processed by the vision model We do it here so We create here a tensor.

So because this will Return a list of tensor. We need to create a one single tensor with the batch size So we stack them stack basically means that if we have a list of tensor, it will create one single big tensor Where it adds one another dimension called the batch size one So instead of becoming a list of tensor it will become one big tensor This is a NumPy tensor it is converted into a torch tensor And then we Create the input to the model.

So later we will expand this method. So now I just create them What is this method going to do? Well, this method is going to Let's check here. It's going to create the tokens of the text and create the placeholder for the image tokens and then We tokenize it using the placeholder tokens for the image And then we return it.

So now let's expand This stuff I know that I have copied a lot of code. Now, I will explain it one by one So let's start at input. We have a list of text and the list of images. Let's process these images So let's create this process image function What is it going to do?

Let's copy it. It's very simple actually Okay, the process image takes as input a list of images what is the size that we want of these images What is the kind of resampling that we want to do when resizing this image? You can do linear, you can be cubic, etc Rescale factor if we want to rescale this image and the normalization mean and the standard And this has the same meaning as the normalization that we do in the neural networks.

So we want the The image no matter what it represents to always have the same distribution more or less So centered on zero and the variance of one And the way we do it is basically we take the image Values so the tensor we subtract the mean of all the images that we have in our data set And usually we use the mean of the image net data set and the standard deviation of the image Net data set I don't know why in the hugging phase they use 0.5 because it's actually not really 0.5 It's very close to 0.5 each of these numbers, but it's not really so maybe it works anyway And we have one for each channel of the image.

So one for r one for g and one for p So what is this Function going to do first it resizes the image by using this resampling method Then it will convert the image into a numpy array Then it will rescale it so that the pixel values instead of being between 0 and 255 will be between 0 and 1 Then it will normalize using the mean and the standard deviation of image net And then it will move the channel dimension to be the first dimension.

So Instead of being a height width channel, it will become channel height width Let's implement this very simple method. So there is first the resize The resize is just going to resize the image using the methods already implemented by The pill library. So this This one called the python imaging library So it will take the image and it will resize it using this resampling method Then we have this rescale The rescale is just going to rescale the image So it will convert each pixel value instead of being between 0 and 255.

It will rescale it into Between 0 and 1. Why? Because as you can see here, we pass a scale factor of 1 over 255. So that's why we are multiplying it by this scale The next thing that we are doing is normalizing normalizing means that we want the each of these values to be distributed like it's coming from a Gaussian of mean 0 and variance of 1 and we do it by Subtracting the mean and dividing by the standard deviation as you can see here I believe we have already implemented everything for the process images Now, let's go further.

So we have these images we are processing them. So they are still a list of images We convert them into they are converted into a list of numpy arrays and we do that here As you can see first we convert them into numpy arrays then we rescale, normalize, transpose So we have a list of numpy arrays This list of numpy arrays is converted into a single tensor instead of being a list of tensor is becoming one big tensor And then we convert it into a torch tensor.

This torch tensor Is the pixel values that will be fed to the model to the image encoder Now we need to take our text And we need to tokenize it but we need to tokenize it by already accommodating for the position in which we will put the image embeddings And we do that by processing this each of this text through this function called add image tokens to prompt which as the name implies We'll add this image token placeholders to the prompt And the way it's done is here It's very simple actually also We can Save it here.

It's a long comment because I found a little bug in this one, but okay later I explain to you But basically we add some image token placeholders. How many of them? Well, depending on how many image Tokens this model needs in the case of polygama 224. We need 128 tokens, I believe Oh, no, this is not this is the text tokens, I think it's 256 I remember correctly Later we can check.

I think it's in the config.json. Let's go here 256 Image tokens Then we add the beginning of sentence token and then we add the prompt of the user. It's called the prefix prompt How did I come up with this function I didn't come up with it I copied from Hugging face implementation, but how did hugging face come up with this actually?

It's from the paper of polygama So if we go to the polygama paper, let's go here here Here they show you how to prepare the input for the gamma model So we have a list of image tokens Then we have the prompt of the user that tells us what the language model needs to do with these images So if as you saw the example before in in the introduction Here the prefix is this one So we want the language model to tell us where is the photographer resting by looking at this image and the model will generate this output So this is called the the prefix So this is the prefix and the prefix is built by first taking okay We take the image tokens and we are adding them here and based on how many this model particular size of polygama needs then we have the beginning of sentence token and this one then we have the tokens of the Prefix, which is the task that we want the language model to perform And then we have a separator the separator token is a slash n.

So it's the new line new line character So we have this beginning of sentence token. So then we have the token the the task The the prompt by the user based on what task we want the language model to do and then we have the separator token Which is a slash n now in the paper.

They say that they tokenize the Token separately so the slash n needs to be tokenized separately from the rest of the Input because we don't want the slash n to be merged with this with the With the prompt by the tokenizer, so as you know the tokenizer will convert a sequence of Characters into tokens and if in the dictionary of the The language model there is one character suppose that we ask the language model to tell me where is the photographer And suppose that the in the and then we have this new line suppose that in the vocabulary of the Language model there is a token that is like this.

So rougher And escape and it will become one single one single token So suppose that this one becomes the token number three and then there is another token that is a space protog It becomes the token number five and then the token the d is another token. So it's the token number six, etc So we don't want the escape and to be merged with whatever comes before it So they in the paper, they recommend to tokenize it separately.

So that's why I I wrote this Comment here to to note that it should be tokenized separately, but I don't know why in hanging phase they do it Without tokenizing it separately It could be a bug or it could be some other indication that I am missing So I just write it now later.

I will investigate and probably ping the hanging phase team But for now, we just need to think how we prepare the input So the input is prepared like this a number of input image tokens What is each of this image token? It's this placeholder token that we created here this image token how many of them depending on the size of the model and we have this beginning of sentence token and then we have the Prefix the prompt of the user and then we have the slash n.

We take all of this and we tokenize it Using our tokenizer And we return this stuff here. So we return this input Which is the input IDs and the attention mask that will be generated by the tokenizer In this case, we are not using any padding. So the attention mask will be just a list of ones So what is the input IDs?

As you remember tokenizer converts the text into A list of numbers where each number represents the position in the vocabulary of each token So these are not embeddings. These are just input IDs So it's a list of numbers where each number represents the token position in the vocabulary So imagine our vocabulary is made up of words So the word hello the sentence hello world may be tokenized as follows so world It may be tokenized as a list of two tokens, for example, three tokens For example, the first one corresponding to the word hello Then the one corresponding to the space and then one corresponding to the word world Suppose it's the token number nine.

So these are called input IDs. So it's not an embedding It's just one number for each token Then by the embedding layer, this will be converted into embeddings, which will be one Vector for each token. So with the suppose 1024 dimensions So this one will be for the first token 1024 dimensions then for the second token another 1024 dimensions, etc, etc, etc So this is how we prepare the input.

So for now, we have resized the image converted into a tensor Then we have taken our prompt. We have added some placeholder tokens for the image then we have Added the prompt of the user and then the slash and character as indicated by polygamma And now our processor will return this stuff.

Now, we need to understand what to do with this stuff So we need to code our language model. All right guys, so let's continue our journey by creating another file here called modeling_gamma.py Which will be our language model. So the language model that will decode the answer of the the answer Using the prompt or given by the user and the image that we have provided as input So we create this file.

We import a little bit of stuff the usual stuff So torch, some math, typing and then we import siglib model that we have created before so the visual model and the configuration that it needs Let's do a bottom-up approach which means that we first create the structure of the model and then we create each single component So let's do it Let's do it this one.

All right Our main class will be called the polygamma for conditional generation So why it's called conditional generation? Because we are conditioning the generation of text on the image that is provided as input This is why it's called conditional generation and also actually it's because of how we create the attention mask that we will see later because we are attending to all the tokens of the prompt of the user and all the tokens of the image Without any causality so it's used like a condition, but we will see that later.

So The constructor accepts a configuration file, which we are going to create now It will create an instance of the vision model. So the encoder of the image it will create this multi-modal projector Which is a linear layer. Let's actually visualize it all these components So we go here and then we open this stuff.

So basically the multi-modal projector is this linear layer you can see here linear projection and the vision model is this Contrastive vision encoder and then we have gamma for causal language modeling, which is this our transformer decoder So this class basically polygamma for conditional generation is actually the class that will Make make connect all these components together I don't know why my pen is not working my ipad pen Oh now it's working.

It looks like so Yeah now it's working. Okay, let's continue All right, so we have created this it will create an instance of the language model It will save some stuff like what is the language model? What is the vision tower, which is the image encoder the multi-modal projector which is the linear layer that will convert the size of the embedding output by the Vision encoder into the size of the embedding of each text token so that they can be concatenated with together We also save the padding token We need to create another method called tie weights and we will see later what is this about Or actually we can check now what this is about so tie weights basically means this so let's go back to our Here and let's open the attention mechanism.

And actually let's open the transformer model so weight tying is a technique for kind of reusing the parameters of one layer into another And specifically in the case of language model most language models are in decoder only language model Which means that they are only made up of this part of the transformer without the cross attention So there is no this block here So it's they are made up of a self-attention with the normalization then a feed forward with the normalization a lot of layers like this so one after another then we have a final linear layer that projects the embedding output by these layers into Logits, and then we have the softmax to understand which of these tokens has the maximum Probability score given by the language model now in this the job of this linear layer is basically to convert the embedding of the Contextualized embedding output by the last layer of this series of layers Into the vocabulary size, which is exactly the opposite that this job Layer is doing so the embedding layer the embedding layer is converting the token ids So the position of each token in the vocabulary into an embedding while this Linear layer here is doing exactly the opposite converting an embedding into its position in the vocabulary so Many language models not all of them use a technique called Weight tying which basically shares the parameters of this layer and this layer because they are doing basically one the inverse job of the other Which is also a technique actually to reduce the total parameters of the model because if you are sharing these parameters you will You will reduce the number of parameters And in many language models this depending on the vocabulary size These parameters can be actually quite expensive on the overall total number of parameters of the model So it could be like 10% of the parameters in this layer here So if you are sharing them, you are actually reducing the number of parameters Let's say by 10% because depending on the how many Tokens you have in your vocabulary So we created this method here tie weight and later we will implement it also in the language model So in the gamma decoder language model That will tie the weights of these two layers Okay, now that we have seen also this one.

Let's go further, which is the implementation of the forward method. So So we implemented the forward method as follows so it accepts the input ids What are the input ids? The input ids will be the input ids extracted from this Polygama processor which will be the Some image tokens.

So a lot of tokens like this one image image image image How many depending on the size of polygama we are using? Then it will contain a beginning of sentence token. Then it will contain the prompt of the user So for example, tell me where is this photographer and then a new line Character the token corresponding to the new line character Yeah, text, okay, so, then we have the pixel values which is the Again is the image extracted from this polygama processor, which is the image loaded by this polygama processor rescaled resized and Normalized using the mean and the standard deviation of this image net standard mean and standard deviation It is converted into a pair into a tensor and then provided as is Then the goal of this polygama for conditional generation will be to take this image and feed it to the image encoder to get extracted the image tokens Then we have this attention mask.

The attention mask is provided directly by the tokenizer So whenever you tokenize text using a tokenizer, it gives you two output. One is the input ids and one is the attention mask Because we will not be using any padding the attention mask will be a series of one Later, we will see how we also need to modify the attention mask But actually we will not be modifying because we will not be using any padding so Yeah, then we have the KB cache, which we will talk about later when we actually use it So for now just consider it as something that you don't know anything about and later we will discuss Okay, so let's see that Okay We have first we make sure that we are not using any padding because I didn't implement the code to manage the padding Then we extract the input embeddings of the text tokens and the image placeholder tokens So in the language model, we have added a fictional token called Image, so this token here Which will be converted into an input id so it will be converted into a number which corresponds to its position in the vocabulary What we are doing is we are converting all the input tokens Which are the image tokens the beginning of sentence token the tokens of the prompt plus the new line character into embeddings of course the embeddings produced by the image placeholder tokens will be Junk because we will not be using them because they do not correspond to the actual image features But later we will replace them inside of this one with the correct one so now we have this input embeddings the first thing we do is we Extract the features of the image and we do it like this So we feed the pixel values of the image, which is a tensor directly to the vision tower.

So the vision tower is our Siglip vision model. So it means that we are using the forward method here. So we are feeding the pixel values here It will extract what it will extract some patches with their contextualized embeddings So it will for each image it will give us n Patches and each of these patches is a contextualized patch actually The second thing we are going to do is we are going to resize this embeddings image embeddings into the same size of the language model Embeddings And for that we do this other line So we take the image embeddings extracted by the vision encoder and then we resize them using a linear layer called the multi-modal projector So later we will see this is actually just a linear layer that will convert this embedding So this embed dimension extracted from the vision encoder into the hidden size Which is the same embedding size used by the language model for each of this each of its tokens Now we need to merge the tokens extracted from the vision Model with the text token extracted from these embeddings which already contain some placeholders for where we should put the image tokens And for that we will create another method called Let me first paste it Called merge input ids with image features in which we pass the image features extracted from the vision encoder the input Embeddings extracted from the language model with which already contains the placeholders the input ids which are the original input ids fed to the The tokens fed to the language model and the attention mask given by it and the KB cache later We'll see why we need the KB cache Suppose that these input features have been merged so we will get these input embeddings these input embeddings.

What are they? Well, let's visualize it on the Oh, wait, where is it? My okay Uh, oops So let's go here Okay So what we are doing is basically we are creating this stuff here. So we are taking the First we are taking the image features extracted by the vision encoder and these Features are here Then we are resizing them using this multimodal projector, which is this stuff here Which will resize the each embedding vector to the correct size so that they can be concatenated with the embeddings of the text tokens the text tokens When we tokenize them, they already contain some placeholder tokens, which are those image tokens We saw before in the processing_polygamma.py file Our goal is to replace each of them with the features extracted from this vision encoder after it has been resized by the multimodal projector And for that we will use this method here So this method takes the image features extracted after They have been resized the input embedding extracted from the language model which contains the text tokens and the placeholders And it will replace this stuff here So suppose that now it everything has been replaced.

So we treat it as a black box What we are going to do we are going to feed all this sequence Which is a sequence of image features and the text tokens to the language model which will Use the prompt of the user which are these tokens and the image fed by the user to generate some text So let's implement this part here, which is just calling a method And it's very easy Because it's just calling a method and later we will implement this language model So for now, I created the structure of what we are doing So we extract first we tokenize the text the text already contains placeholders We replace these placeholders with the features extracted from the vision encoder.

We feed everything to the language model. The language model will Generate some output and we return this output Now our goal is of course to implement all of these blocks that we have created that we have taken for granted for now The first thing that we can do is to implement this polygamma config which will give us some understanding of what are What is the kind of configuration that this polygamma needs?

For that we create it we need to create this polygamma config Okay, the polygamma config basically takes as input so the polygamma is So what is gamma? What is polygamma? And what is cglib? I think you should already have an understanding of it now. So polygamma is all of this stuff here all this stuff here So it's a combination of a vision encoder and a text decoder language model.

So a gamma model So it's composed of two parts It's composed of a cglib vision encoder along with a linear layer that will change the embedding size And it's made up of a language model called gamma language model So the polygamma needs of course the configuration for this block here So the language model and the configuration for the vision encoder so that it can create an instance of This cglib class and of this gamma language model passing their own configuration to them And this is what you see here So you have the vision config which is the configuration of the vision encoder the text config which is the configuration of the text decoder which is gamma The ignore index is not used.

We will not be using it for labels So if you are training, but we will only doing inference The image token index is the token corresponding to the placeholder image token. So the This token here. So let's this this stuff here The vocabulary size. So what is the vocabulary size of the model?

the projection dimension is how What is the final dimension that the image features should be resized to before feeding to the language model? So what is basically the output size of this linear layer? Then we have the hidden size which is the embedding size of the language model So the language model has some tokens.

These tokens are embeddings and these embeddings have a dimensions. How many dimensions? 2048 in the base version of gamma This stuff is something that HuggingFace needs we will not be using it We save the padding token id if in case it's fast, so we save the vision encoder We save the text encoder and then we need the configuration of the text language model Which is the gamma model to which we pass the of course the text configuration and to the vision encoder.

We pass the vision configuration We have how many number of tokens For image tokens each image will generate which is basically the size of the image divided by the patch size So it's actually how many patches you get for each image Um, which is also corresponds to how many image tokens you get here Because of course if you divide the image by four you get four patches If you divide it in smaller parts, you get more patches and each a polygamma size So polygamma two to four, I think it has 256 tokens.

Another one has more etc, etc Um, the projection dimension is how we want to resize this image tokens, etc So now let's create also the configuration for the gamma model which is just the configuration of any language model because it has A vocabulary size how much tokens we have in our vocabulary the hidden sizes.

So what is the size of the embedding? Embedding vector of each token the intermediate size of the feed-forward layer as we saw before In Sigleap the number of hidden layers. So how many layers our transformer has in this gamma language model How many attention heads we have? Okay here we have a difference This is called the grouped query attention when you have a different number of heads for the query and for the key and values the number of heads here refers to the number of heads for the Queries and the number of heads for the key and values is this parameter here.

We will see later how it works The head dimension is how many Dimensions each head will work with as we saw before we divide this big embedding into smaller groups one dedicated to each head This is how many dimensions each head will watch Now this configuration. It's a hard-coded But actually it will come from the configuration file of the polygamma model that we will load So if you go to hugging face, you can see Hugging face, polygamma You go to two to four you can see here We will load all this configuration from this config.json file Which as you can see contains this text config this visual config which contains exactly the information that we need here This max positional encodings indicates how much the maximum number of positions our model has been trained upon Which is necessary for the rotary positional encodings RMS norm is we will see later.

What is the rms normalization, but just like the layer normalization We have this parameter called rms norm fps. Okay, I will explain it later Actually, the rope data is another parameter of the rotary positional encoding, which is the base frequency And also we will see later. What is it?

the attention bias Indicates if in the attention matrices We are we want the bias because as you remember we have the wqwk and wv matrix These are linear layers and we can have also the bias term, but we I believe we never use the bias for this And it looks like we yeah, we don't use any bias for it.

So if they don't overwrite it then it remains false Dropout just like before we are not going to use it and the padding token id and we save all this stuff. So nothing so Sophisticated here now the first thing that we are going to do since we have already implemented polygama for conditional generation I believe that the first thing that we can do is this method here merge input ids with image features But for that we will need to understand.

What is the kb cache? All right. So let's start coding this method. So Let me go also here in the code that I have already written. So I will code it piece by piece So that we don't get lost in the explanation So we create this method which has this signature If you don't see it all it's this one here And let's extract.

Okay. The first thing we do is we extract some information from the inputs Which are what is the embedding dimension from the image features because we need to Which are already resized Because we pass them after sending them through this multimodal projector So they have already been resized to the same size of the text tokens Then we have these input ids which tells us how many tokens we have the input ids If you remember correctly is the not the embedding of each token It's the number indicating the position of each token in the vocabulary While the input embeddings are the embedding of each token after they have been extracted from the embedding layer of the language model And that's why we have this It's a vector now The first thing that we do is we scale these image features so We scale these image features which also helps.

It's like the same kind of scaling that we use in In the attention mechanism, so we do query multiply by transpose of the key divided by the Square root of the model here. We do the simple the same kind of scaling Because probably they have tried multiple variations of the model and we want the magnitude of the numbers to remain the same That's why we divide it by the the size of the hidden side.

So if they if you want to double the for example the embedding Size of the image features you want the magnitude of numbers more or less to remain the same. That's why you you scale them Now the first thing that we need to do is to create the final tensor that will hold the combined Features of the image tokens and the text tokens and this is and it's this tensor here It's made up of zeros and it has the size of batch size Sequence length.

So what is sequence length? The sequence length is the number of input ids we have What are these input ids? The input ids that are coming from this processing polygamma class which are the placeholder for the image tokens the beginning of sentence text the tokens of the prompt and the new line character So the token corresponding to the new line character So we create this sequence of empty embeddings of which size of embedding size dimension Embedding dimension which is the same size of the embedding vector of language model because the image Tokens and the text token will have the same size which is embedded dim here We want to be of the same size of the same d type So if it's floating point 32 of the input embeds and we put it on the same device The first thing that we do is we create some masks that will be useful for understanding which is a placeholder token Which is a text token and which is a padding token, even though we will not be using any padding So I just took the original implementation, which was already handling the padding, but we will actually never have padding tokens How to understand which one is a text token?

Well, a text token is something that is not an image placeholder token and it's not a padding token What is an image token? Well something that is equal to the image placeholder token and the padding tokens are the tokens that correspond to the padding token id this mask will be useful for us to understand where to put the embeddings of the image tokens in this Final embedding tensor where to put the text token in this final embedding tensor and where to put the padding tokens in this final embedding tensor We expand them so Here we see them and later we will see why we need to expand them.

So basically we are creating I believe the few dimensions more because we need to create the batch size dimension and the sequence dimension I don't know. We already have the sequence dimension because it's already given by the input ids We are creating the batch dimension and then we are expanding it to this embed dim dimension Later we will see why we need it.

So basically this means that The text mask here. So let me draw a sample of how it may look like Oops, what did I do? here the text mask here Will be something like this. So if suppose that the The input ids are the tokens corresponding to the image.

So suppose that it's the 567 so we have So we have many tokens corresponding to the placeholder for the image then we have the beginning of sentence token suppose usually it's the token number one Then we have the prompt of the user So suppose that it's a token number 56 78 and 99 and 21 and 11 then we have the Slash and token.

So it's suppose it's the token number two What we the text mask here will be basically something that is like this so it will be zero zero zero zero zero And then it will be one one one one one one and then it will be zero uh, actually one because the slash n is still part of the text the image tokens mask will be one one one one one and then a series of zero because all the others are text tokens And the padding will be Equal to all zeros.

So I don't write all of them, but you can understand all zero because we don't have any padding token Then we are expanding them to This expand basically repeats these zeros and ones along this dimension the embedding dimension that we are adding here with this unsqueeze And we will need it later for the for another method, which is the wear method So for now, just keep in mind.

We are just expanding this token by repeating this series of zero and one along a new dimension So the first thing that we do is we copy the text Embeddings into this final embeddings and we do this by using this method. So we say this final embeddings This wear method basically says that if this condition is true It will take the input from the second argument.

Otherwise, it will copy the third argument So if wherever this condition is true, it will copy this stuff here wherever this condition is false. It will copy this stuff here so We are saying that whenever Um The the text mask is one We copy the embedding from the input embeds which correspond to the text inputs plus the placeholder for the image But we will only be copying the text Text tokens because for the image image tokens, we will have zero in this mask Otherwise just keep the final embedding as it is Then we add the image tokens As you can see here Which is using another method called the must scatter and we cannot use the torch dot where because the sequence length of Image scaled is not equal to the sequence length of the final embedding But basically this does the same job as the where So what we are saying is that copy from the scaled image features where this stuff is true So we are copying the image features where where the image mask is true where the image mask is true Where we have the placeholder tokens for the image so we are copying in the final embedding the image tokens Where before we had the placeholders?

Then we copy the padding And the padding we just zero out everything because we don't care about what is in the paddings So what we are saying is that wherever the padding mask is true Just copy a zero a tensor made up of zero. Otherwise keep the final embedding as it is Now comes the interesting part so for now we have created the final embeddings What is the final embeddings is this stuff here.

So let me show you again from the ipad. It's this stuff here So now here we have the first image token embedding second image token embedding third image token embedding blah blah up to 256 image token embeddings in the base version of polygama if I remember correctly And then we have the embeddings of the tokens corresponding to the prompt Plus the padding but the padding we will never have because I excluded it from my implementation So now we come to the interesting part Which is the creation of the attention mask and the attention mask has to be created in a particular way based on How we are working with the KV cache And for that I need to introduce the KV cache.

So that's why this part is interesting. So let's go So let's talk about this thing called KV cache But before we talk about the KV cache, we need to understand what is the problem that the KV cache is solving So when we train a language model So as I we saw before the transformer can be thought of as a model as it's a sequence to sequence model Which means that you feed it a sequence of n tokens and you get as output n tokens These n tokens as output are not normal tokens anymore They are contextualized tokens means that each of them is not capturing information only about itself But also about other tokens which depend on the mask that you use if you use the causal mask It means that only each token will only capture information about itself and all the previous tokens If you are not using any causal mask, then each token will encapsulate information about all the other tokens in the sequence Which is what we do with vision encoders like the image encoder we saw before the Sigleap one Because the transformer is a sequence to sequence model, so let's open our ipad Now because the transformer is a sequence to sequence model It's very useful during training So suppose that we want to train we train a language model on the following sentence.

So it's always the same which is I love pepperoni Pizza Pardon my calligraphy I write very fast recently we feed it to this black box that we will call the transformer model Each of these stuff here each of these uh tokens is actually an embedding So we will get an as output a list of embeddings, but they will be contextualized Contextualized one for the first token one for the second token.

So this is the second embedding This is the third embedding and this is the fourth embedding I am again making the simplification that each word is a token and each token is a word How we train a language model? Well, we force the language model to predict the next token given the contextualized embedding So this contextualized embedding here contains information only about the word I in case we are using the causal mask so let's here is This only contains information about the token I This contains information about the token I but also the token love this contains information about the token.

I love Pepperoni pep and this contains information about all the other tokens. I love pepperoni Pizza What labels do we use when training a language model Well, in this case, we want the first language model that given the prompt it should predict. What is the next token? So given only I the the language model should predict the word Love so the the the label here is love Given only the token love.

I love so the prompt. I love that the language model should predict the token pepperoni Given the token the prompt I love pepperoni the language model should predict pizza And given all the sentence it should say end of sentence so it means hey i'm done with the generation Now this is how we train a language model.

How do we actually inference a language model is the same way So we start with what is known as a prompt so suppose that the user only gives us one token as a prompt the word I And suppose that our language model has been trained on the sentence before so I love pepperoni pizza How can we generate the entire sentence?

Well, we feed this single token to our black box, which is our transformer So now I will write it reversed because I don't have space above transformer The transformer will generate it's a sequence to sequence model, which means that it takes as input one embedding Corresponding to our prompt token I and it will generate one contextualized embedding So it will be one embedding what do we do with the language models we project this single embedding into logits so we use the linear layer at the Output of the of the transformer, which is this stuff here To generate logits for this token.

So let's go back here This this is the output embedding so out put embedding We convert it through the linear layer into logits This logits tell us what is the score assigned by the language model to each token So how likely that particular token is the next one to convert it into a probability score?

So something that sums up to one we use the softmax. So suppose that we have already applied the softmax Actually, let's apply it softmax. So It will remain a single embedding Sorry a single logits token, but the difference is that now they sum up all to one Which one we select the one with the highest number usually this is called a greedy strategy There is another strategy called the top p which means that we sample from the top the tokens with the top score Up to 90 percent.

So suppose that there are three tokens here Okay, actually the top we will see later when we implement the inference for now Just think that we are always sampling the one with the highest probability score. So we use the greedy strategy by using the greedy strategy What will happen is that probably the model if it has been trained well, it will tell us that the next token is very likely the token Love so this is how we know.

What is the next token? How do we generate then the next next token? We take this token love This token love and we put it back into the input of the language model So now we feed a new input to the language model. Let's remove this stuff Delete Now we are feeding two tokens to the language model Language model is our transformer model.

So it's a sequence to sequence model It means that it takes as input two tokens. It will output two tokens So it's taking as input two embeddings. I am drawing here the text But actually you need to consider that these are two embeddings of these two tokens So we feed two embeddings.

It will output two embeddings one corresponding to the token I One corresponding to the token I love Very ugly writing. So let me write it better one corresponds to the token I so the first position one corresponds to the second position which is Because this is a contextualized embedding.

It will include information about I and love Now before what we did was to project this output embedding into logits here We have two embeddings which one should we project into logits? Of course. It's the second one. Why? because This embedding includes information about the two tokens. So it's like we are using the entire prompt.

So what we do is we Send it to our linear layer Linear layer It will become logits. So let's write actually logits Then we apply this thing called softmax which will convert this logits into probability scores How do we understand what is the next token? Using I love as prompt.

Well, we sample from the softmax which one the one with the highest score. So We take the one with the highest score as the next token so if the language model has been trained Well, it will be the token pepperoni So it will be the token Pepperoni Now, what do we do?

How do we generate the next next next token? We take this word pepperoni We feed it back into the language model and we ask again the language model. Hey generate the next token Let's delete this stuff here I love Pepperoni We feed it to the language model We are feeding three tokens to the language model which are converted into three embeddings then are fed to the transformer The transformer will output three output embeddings one corresponding to each position Now without writing the first position will correspond to a contextualized embedding that only includes information about the token I the second Embedding contextualized embedding will include information about I and the love the third contextualized embedding will include information about I love Pepperoni, which one should we project?

Of course the third one because it's the one that encapsulates information about all the prompt So this way we keep doing this way and we generate text Now, what is the problem here? The problem is that at every step of inference We are generating a lot of embeddings. Suppose that the prompt is very large A lot of embeddings that we are not using so we are creating them because the transformer is a sequence to sequence model It's generating them But then we are only projecting one single embedding to the logits and then to the softmax to understand what is the next token And as you know, the transformer model uses this thing called attention mechanism and the attention mechanism generates this matrix That is a sequence by sequence, which is the attention scores matrix that we saw before which means that when you have a thousand tokens It will generate a matrix that is a thousand by one thousand.

So it's a one million numbers in that way So it's a huge matrix and then you only need to use a part of this matrix that will generate this embedding here So is there a way to not generate the embeddings that we are not going to project into logits?

But only generate the one that we only need to generate the next token Yes, and it's possible through what is known as the kb cache and the trick is here. So now let's open this other slide The trick is this one. So when we calculate the attention matrix, so the query multiplied by the transpose of the keys divided by the square root of d Model or d head in case we have a multi multi head attention What we are getting is suppose that we want to generate the word pizza by using the prompt I love pepperoni If we do it naively we will pass all these Embeddings, so I love and pepperoni to the transformer.

The transformer will convert them into query key and values using the projection wq wk and wv Let me check if my yeah, it's still working um It will convert them into query key and values and now then we use the query key and values to calculate this Matrix here.

So the query multiplied by the transpose of the keys, which is this matrix here Then we multiply this matrix by the v matrix with by the v sequence and it will give us the output of the Attention, which is contextualized embedding you can see here and we saw also before that when we multiply by v We are doing what is known as a weighted sum using these weights as weights in this weighted sum now When this is the input of the model So the input of the model is I love pepperoni and the output that we are getting is a three contextualized Embeddings so the embedding corresponding to only to the word I the embedding corresponding to the word I love and the embedding corresponding to the I love pepperoni We know that we only need this one here because this is the only one that we need to project into logits And then to generate the next token.

So is there a way to not compute these two stuff here that we will not be using? Yes, and the trick is here The trick is this Embedding contextualized embedding here is the result of the multiplication of this matrix by this matrix but not all of this matrix by the v sequence, but only the last row of this matrix by the v sequence because This number here comes the the number Let me okay Then this number here comes from the result of the dot product of this row here With all the columns of this matrix here So this number here comes from the dot product of the first The last row of this matrix with the first column of this matrix the second number in this matrix output Vector comes from the dot product of the last row of this matrix with the second column of this matrix the third number here comes from the Dot product of the last row of this matrix with the third column of this matrix, etc, etc for all the 128 dimensions So what we need to generate only this one is the last row of this matrix, but all the v sequence So basically to have so Because the attention matrix as we saw before we can consider the rows To be the queries and the columns to be the keys to have only this last row here We need only the last token as query But all the previous tokens including itself as keys and we need also all the tokens as values That's why what we do is the following when we generate text with a language model What we do is Imagine we have a prompt Um Let me draw in such a way that it's not confusing.

So I think we can continue here. So Imagine we start again the process of generation of text, but this time we do it with the kv cache So we start with one token. Let me do it Top to bottom. Otherwise, it gets confusing because before I did top to bottom.

So Okay, we use only the token i as input to the language model The language model will convert it into an embedding blah blah blah, then we feed it to the transformer Suppose that it's only made up of one layer. Actually, it's a series of layers uh this Single token will be converted into query key and values.

So it will be a sequence of tokens But in this case, we only have one So the q sequence will be one token. The k sequence will be one token. The v sequence will be one token We do this thing called self attention uh Which will calculate that matrix so the query multiplied by transpose of the keys which will be a matrix that is one by one because We only have one token And then we multiply it by v so it will result in only one contextualized embedding as output So it's this stuff here what we do we project it into logits Which is another vector then we convert it into softmax which is another vector uh And then we sample the next token The difference with the kv cache is that whenever we pass a token to the input of this self attention We cache the key sequence and the v sequence into a buffer called the kv cache so now imagine that there are There is a box here called the kv cache That initially is empty.

But after we pass the token I It will contain the embedding. So the q embedding. Sorry the k embedding corresponding to the token I And also this is the kv cache. So it is made up of the key cache and the v cache This is the key cache Then we have the v cache which is initially empty But after we send in the first token, we save this v sequence.

It only contains one token. So we save it here So it's the token I We compute the self attention Using the query key and values. It will result in only one output embedding. We project it into logits We project it into softmax. We sample. What is the next token?

Very probably it will be the token love What do we do now What we did before was that we took this word love Put it back inside of the prompt and then ask the language model again. What is the next token? But with the kv cache we do something different With the kv cache.

We always take the previously generated token. So in this case is the token love We use it as input Only the single token love Let me delete a little bit here And we use this single token as input to the language model Now what happens is that we feed the transform this single token love into its embedding which is an Uncontextualized embedding we feed it to the first layer of the transformer as a query key and values for now The query key and value contains only one token the token correspond the embedding corresponding to the token love however when doing self attention We don't use only one single token for love For the key for the keys and values we take this single token love we append it to this buffer called Kv cache.

So now it contains love here for the values. Also it contains love And then we use this buffer as the key and value sequence in the self attention So we take this token love we convert it into query key and value the query key and values are one single token But the query the key and value we append them each of them into their respective buffer here And then we use the content of this buffer to calculate the self attention What happens is that we have only one query, but now we have two keys and two values Which will result in exactly the calculation of this last row of this matrix That the last row that we are interested in to predict only the next token and not generate all the other contextualized embedding In this case, we are only seeing Two tokens, but later we will see with the third token.

It will be exactly the last row of that matrix anyway The output of this self attention because we have one query two keys and two values I can guarantee mathematically it will be one single embedding you can verify by yourself But basically if you have one query as you saw before the self attention mechanism Will generate a matrix that is a sequence by sequence But in this case, it's the the roles of this matrix are defined by how many queries you have.

So we have only one And we have however two keys So the key number one and the key number two So it will be a matrix that is one by two and it will result in only one output embedding token when you multiply it by b And we saw that before actually when we calculated the dimensions of the output embedding We saw that it's only the last row that generates the last embeddings and this is exactly what we are doing here Anyway, this the self attention calculated like this So using the query the single token, but as keys and value the content of the buffers the keys and the kv cache To calculate the self attention we result in only one output embedding Which is exactly the contextualized embedding that we are interested in to generate the next token We project it into logics.

We'll project it to the softwares and it will result in the next token being pepperoni Naively what we did before was take this for the pepperoni and feed it back into the prompt and then feed all the prompt to The language model but with the kv cache it's different.

So we use the last generated token pepperoni Let me write it all pepperoni We feed it to we convert it into a single embedding So the query key and value here are one single token But before computing the self attention, we put this key and value inside each of their buffers So now the buffer for the k contains pepperoni as well And also the v contains pepperoni Then to calculate the self attention we don't use this key and v we use the content of the kv cache because it contains three tokens So as query we use only one token, which is the word pepperoni But as key and v we use the content of the kv cache.

So it will result in a matrix that is Exactly the last row that we saw here because it's exactly this one now because we have as a query Only the word pepperoni and as key is the token. I love pepperoni Which will result when multiplied with the v sequence, which is three tokens because we have also the v cache Will result exactly in the computation of this output embedding here, which is only one single embedding Which is exactly the one that we need to predict the next token, which will be the token pizza, I guess Etc etc.

So this is the kv cache this kv cache basically allow us to during inferences So during token generation to avoid generating all the embeddings Of all the input sequence, but only generate the last Embedding contextualized embedding which is exactly the one that we need to we need to predict the next token There is another thing that we used to know about kv cache, which is the pre-filling the pre-filling is basically we started here with With a single token as a prompt of the user So we only use the word I but usually the prompt is a little longer.

So it's not only one token from the user the user maybe Suppose that the user uses multiple tokens, so it uses the word I love What we do is because we have already access to all the tokens of the prompt of the user We are not generating them. We can pre-fill instantly using all of the prompt of the user All the kv cache corresponding to the prompt of the user so we can do instead of doing first adding I and then adding love We add both of them in the same forward pass.

How to do that? We take we use both of them. We convert them into embeddings So it will result in two embeddings. We feed it to the language model as query key and values Initially, the kv cache is empty This will result in a cool sequence of two tokens the k sequence of two tokens and the v sequence of two tokens We put the k and the v inside of their respective buffer called the k buffer and the v buffer which comprise the kv cache So now it contains I and love this contains I and love then we Calculated the self-attention So now we have two tokens for the query two for the keys two for the values because the content of the kv cache contains two tokens Which will result in a two by two matrix, so it will result in two output embeddings And two output softmax which one we project in the um in the logits only the last one Because we are we are not interested in predicting the word love.

We are only interested in knowing what comes after love. So we only take the Embedding corresponding to the position of the word love we project it into logits And we project it into softmax to understand what is the next token So only during this pre-filling phase we actually allow the generation of multiple output embeddings And then we discard the one that we don't need Why do we do it because we don't want to add one single token at a time because it will be too slow If you have a lot of tokens, you just add them all at once in the kv cache And then you use this kv cache which is pre-filled now to generate one token at a time The reason we do it is because the gpu is very fast at parallelizing stuff So it's very good at parallelizing computations So actually by doing all of these computations inside of the gpu Will result in a much less wall clock time instead of adding one token at a time And this is guys the kv cache.

So now we can finally code it Okay, let's code the next part. So we copy this part here and all of this And all of this actually let's copy it all So now that we know what is the kv cache We know that we have two parts to do when we work with the kv cache The one part is called pre-filling and one is token generation during the pre-filling.

We send all the prompt of the user to the kv cache To the model using as a query key and value and this will create the initial cache that will then be used by subsequent During token generation. So where we generate one token at a time Why do we do this two phase because we want the the prompt is already available to us We don't want to edit one token at a time while the token generation We want to generate one token at a time because we don't have these tokens so to create the attention mask for the for working with the kv cache basically, so when we are working with the pre-filling phase, we will have that the Number of queries key and value will be the number of the tokens inside of the prompt.

So we generate a mask that is sequence by sequence Because it will be used in the attention mask. So let's visualize it actually so suppose that we are doing the following so This suppose that we receive a prompt that is I love pepperoni and we want to generate the next token, which is pizza The attention calculation will result in the following attention score So it's a matrix that is three by three in which we want to mask out some interactions between tokens especially for each query cannot attend to future keys And the way we do that is we create an attention mask Of the same size of the attention matrix as you can see so three by three.

So sequence by sequence in which we Before we apply the softmax. We add this thing called mask to this matrix And this mask is made up of minus infinities for all the position in which we don't want any interaction to happen And this is what we are doing here.

So at the beginning we create We are inserting the prompt of the user and we should mask out future tokens, however in And we create a mask that is a token sequence by sequence So this is during the pre-filling so when the KB cache is not or the KB cache does not contain any item means that we are Doing it for the first time.

So we are pre-filling the prompt of the user Now we are not adding any minus infinity value to this KB is to this attention mask during the pre-filling. Why? For to understand that we need to understand how polygamma attends to the Image tokens and to the prompt of the user.

So for that, let's open the page of polygamma And here we can see the attention mask So a prompt in polygamma is made up of the image tokens, which are 256 in the case of the smallest polygamma Then we have the prompt of the user which is a beginning of sentence token plus the prompt of the user So for example, the prompt of the user may say extract where the photographer is in this picture And then we have a separator token, which is the new line token we saw before As you can see the attention mask here is not masking out anything for the part that corresponds to the Prompt because the prompt of the user is made up of the prompt So the textual prompt plus the image and we don't mask out anything.

Why? Because and it's quite and it's different than what we usually do with language models because for the image tokens We can understand that we don't mask out anything because each text token that we will generate needs to access all the image tokens So it will be conditioned on all the image tokens.

That's why it's called conditional generation And that's fine because we saw that each image is each image feature each image embedding is encoding Not only itself But also all the other embeddings and we want each text token to watch all the image to be predicted and that's fine the point is why in the The prompt is not causal So as you can see the first token of the prompt, which is this one so suppose that the prompt is two tokens, for example, I love and We want to generate the word pepperoni and pizza, which should be the first output token and the second output token you can see here Why are we not applying any causal mask to the tokens of the textual prompt?

Because the textual prompt is usually very short And we want and it usually describes what is the task that we want the vision language model to perform and it's a choice that the palygamma authors made which is because usually this This prompt represents the task that we want the language model to perform We want all the tokens that will be generated to watch all of the tokens in the prompt Moreover, we want each token in the prompt to watch even future tokens of the prompt itself So you can think of this As the query this one as the keys When we will do prefilling what we will have is the following so we will have The prompts let's use a different color.

So we will have all the tokens of the prompt which are the Textual prompt which is the textual prompt that we will send to the model plus the image tokens And we do not need to generate any mask here because each Text prompt can watch even future tokens of the text prompt because you can see that this is the keys This is the query number one of the text prompt and this is the key number one of the text prompt This is the key number two of the text prompt and as you can see the query number one of the text prompt So this beginning of send the token can attend to the key number two of the text token It's a choice that the palygamum Authors made so they they said okay, usually the prefix of the Because we are not generating this prefix, which is the prompt that we send to the model telling what the model needs to do with the image We do not need to add any causality because we do not Need the model to be causal with respect to this prefix because we are not going to generate it however, the only thing that we are going to generate is this thing called suffix which are the Output tokens predicted by the model using the prompt textual prompt and the image And this needs to be causal So the first token output by the model needs to attend all the previous keys, which are the image token So these three image tokens plus the four tokens of the text prompt Then the next token predicted by the model should be able to access again all the image tokens So the first three tokens then the four tokens of the textual prompt plus the last generated token By the model then when we generated the next next token, it will need to access the First three image tokens then the next four text tokens of the prompt And the two tokens predicted by the model before so it is causal only in the generated text not in the prefix part Which is different than normal language models in normal language models when we prefill even the When we prefill the the prompt the prompt Itself is prefilled using the causal mask because the the prompt is just A part of what the model would generate if it would start with only the first token But this is not the case in PaliGamma.

It's a choice that the PaliGamma team made So it's not like the language model has to work in this way or there is any advantage or disadvantage The only advantage if we want to say is that the information about the prompt Is replicated in each of these tokens because each of these tokens basically Includes information also about future tokens that are part of the prompt and this happened when they train the model so when you train the model also you don't mask out the The future tokens inside of the Textual prompt you only mask out what you expect the model to generate Using the image token and the textual prompt.

So to rehearse Let's go back to this image. What is the text prompt? So when we inference a language model we provide a Visual text visual language model. We provide an image as condition and then we provide some Text prompt which is a description of what we want the language model to do with this image For example tell us where is the photographer in this picture?

And then the model will generate some tokens as outputs telling us where the photographer in this case is and what we do when we train this language model is that Let's go back here We do not mask the tokens of the textual prompt So when we ask the language model what to do with this image We do not mask out during training and also during inference, of course because the model needs to work in the same way But we mask out only what we expect the model to generate So the causality is only in the generated tokens and it's a choice that you make with the language model It's not necessarily it has to work with this way because normal language models They actually mask out all the tokens There is no like not masking out of the prompt because usually the prompt itself You can consider it as something generated by the model, even if it's not So this is a more of a philosophical question that's a technical But the reason is that it's a choice made by the polygamous authors also in visual language model Especially like polygamous the task so the prompt the textual prompt is usually very short It tells the model what to do with the image that it's being fed so for example localize where is the cat in this image or Extract all the numbers or tell me where is the photographer in this image, etc, etc And also the usually the generated output of the model is very short So we don't use at least polygamous models like polygamous are not used for generating very long Content but they can be of course fine-tuned to do it So, let me delete this part.

Otherwise it remains here forever Okay All right, so now we have seen how we generate the The the mask for the pre-filling so in the past for the pre-filling We do not mask out anything because we do not mask out the text prompt and we do not mask out the image prompt The interesting part is that when we generate the text we have we generate one token at a time with the KB cache Which is this this else part here We also do not mask out anything.

Why? Because let's go back to the polygama here picture. So here When you generate the first token, the first token needs to access all the image tokens and the text tokens So does not we don't need to mask out anything When we generate the next token as you can see it needs to access all the image tokens and all the text tokens Plus the last generated token here.

So we do not need to mask out anything then again for the next next token We need to access all the previous tokens plus the two previously generated tokens So we do not need to mask out anything because we are generating one token at a time So it needs to access all the previous tokens plus the image tokens plus the textual prompt So we never need to mask out anything.

So you may be wondering why are we never masking out anything? Because we are working with the KB cache and with the KB cache We only generate one single row of this matrix at a time And as you can see We always generate the last row and the last row is always the last token that needs to access all the previous tokens So we never need to mask out anything.

However during training when you train a model on something then you need to mask out because the model will generate all the Contextualized embedding in parallel and you want each contextualized embedding to only be contextualized on the previous token So you need to mask out. So during training we will have a causal mask, but during inference, which is our case We don't have any causal mask at least when working with the KB cache and at least When working with models like polygamma if you work with a normal language model like normal like llama For example when you do the pre-filling you actually need to mask out the pre-filling part But in the case of polygamma because of the choices made by the polygamma team.

We do not need to mask out anything And this is why we do not need to mask out anything So when we will in the future plan to make another video on how to fine-tune this model that we have made And we will see that we will need to introduce some kind of mask And the mask will have to be generated exactly like shown by the polygamma paper.

So let me check if my it's still working Sometimes I lose connection with my cam. So I need to check every once in a while. So We add then okay we have created this mask which is filled with zeros because We need to fill up minus infinities to all the positions where we want to mask out something But we never mask out anything.

So we always make this tensor full of zeros when we are pre-filling we generate a sequence by sequence mask, but when we are Generating tokens, we only generated the last row of that metric. So we have only one Query, so as you can see assert query is equal one So we only have one query and then we have how many keys we want which is how many keys there are in the KVCache We add the plus one to this KVCache because before using the KVCache we add this current token So the query token inside of the KVCache then we extract it before calculating the self-attention like we saw before As you know the KVCache when we do the attention computation, we have one attention computation for each head So we need to add the head dimension because there will be one attention matrix for each head And that's why we add this head dimension here Okay.

Now we have generated the KVCache Let me check what else we need to do We need to generate the positions of the tokens that will be used by the rotary positional encodings So when we are working with the pre-filling part of the KVCache It means that we have n tokens that are part of the prompt of the user which are the image tokens plus the text tokens Then we need to generate enough positions to apply the rotary positional encoding.

So which the positional encoding How many of them we need we need up to how many tokens there are in the prompt Which is indicated also by the number of ones in the attention mask which is generated by this processing polygamma code So when you generate the tokenized text It will give you the input IDs and another tensor of the same size as the input IDs with all ones Indicating that we do not mask out anything and if you count the number of ones it also gives you how many tokens there are In the input IDs, so that's what we are doing here We generate enough positions.

So when we are doing the pre-filling suppose that the pre-filling is made up of 256 image tokens And then three tokens of the textual prompt. So what we will this will generate basically 0, 1, 2, blah, blah, blah 255, 256, 257, and 258 A sequence like this. This sequence will be then used to understand which Rotary positional encoding we need to apply to each token when we are however doing the Token generation we only have one single query to which we need to apply the positional encoding And for that we only take the one token So this will generate only a one single mask, which is the position corresponding to the last To the last token Okay So when we do token generation basically we have some tokens that are already saved in the KV cache And then we have one new token, which is the last predicted token, which we use as a query To understand what is the position of this token We also pass the attention mask in the case of the attention mask It will indicate that it's all made up of ones how many ones well indicate well Based on how many tokens there are in the KV cache Plus one because we also have the new token that we need to add to the KV cache before doing the self attention So what we are doing here is the same.

So we are counting how many ones there are in the KV cache Which is already plus one And then we take this last number And we this is how we generate the position IDs And then we return this stuff here, so let me return this stuff Okay, so we have implemented this method So what this does this method do this method basically takes as input the image features It takes as input the input IDs and the input embeddings What are the input embeddings are the image the embeddings of the image placeholder, which we will not use And then the image features our goal is to put all the image features in the right places in this input embeddings based on where Are these image embeddings placeholder positions?

And we did we do it here then Here actually then we create the attention mask, which is basically just made up of zeros which Do not confuse the zeros in the attention mask We are creating here with what we are probably commonly used to see in the attention mask So let me show you actually this one also So usually you are probably used to see the attention mask as a bunch of num ones and zero and the zero indicates which number Should be masked and the one which indicates what is the number that should not be masked This ones and zero is actually then converted into a number of in a series of minus infinities and zeros before Being added to the attention matrix Instead of creating a ones and zero which then converted into minus infinities and zeros We are already creating the mask that can be directly added to the attention mask So we are creating a bunch of zeros, which basically means that You add a bunch of zeros to this matrix So it's like you are not masking out anything If you want to mask out something then you need to add some minus infinities in this mask, but we never add any minus infinities So we are not masking out anything And this is our method that combines the image features with the text tokens Our next goal is to create the structure of the polygama Actually, we can create this polygama multimodal projector.

Yeah All right. So let's create this polygama multimodal projector. Let me put away this stuff here We just copy it. It's very simple. I just I don't even need to copy first the constructor and then So the polygama multimodal projector is just that linear layer that converts the size of the image features Extracted from the vision encoder into the same size of the embedding size that is used by the language model So it's just a linear layer that converts the hidden size of the vision model into the projection dimension, which is equal to the embedding size of the text text model here So this project projection dim is equal to the you can see it here is equal to the hidden size That is been then used by the language model So it's basically resizing the the embeddings so that they can be concatenated with the text tokens Let's go back here.

So as you can see, we are just applying this linear layer Our next step is to code the language model itself. So the language model the gamma language model is a transformer model So it it will code a language model. So Transformer model so we create this gamma for causal language modeling Which takes the configuration of the gamma model as input and the gamma model, which we will create later Basically in the hugging phase whenever you see something something for causal language modeling It is a transformer model plus a language modeling head, which is the linear layer in the transformer that projects each embedding into logits So this is basically the transformer model this gamma model and then this is gamma for causal lm is the gamma model plus a linear layer That's why we are reusing this instance plus a linear layer.

So the forward method will be very simple We need to implement these two methods which are used for the Weight tying so we saw before that weight tying basically means that we share the weights of the embedding back Layer with the logits layer. So this is what we are doing So when we type weights, we just copy from the embeddings to the language modeling head Which is the linear layer that converts the embedding into logits Then we have the forward method which is also very simple because it will not do anything except for Applying sending the stuff to the language model and then applying this Linear language modeling head which is the linear layer to convert into logits So as you can see here We send the input directly So the attention mask the position IDs the input embeddings the kvcache we send it to this language model, which we will implement later The output of this language model will be a series of embeddings, but we do not want embeddings.

We want logits. So This is what we do We take the outputs. We take the hidden states from these outputs, which are the series of embeddings We apply the language modeling head. So it's the linear layer. We make sure it's a floating point numbers we return and return whatever Result is it so we return the logits and if the user specified the kvcache, we also return the updated kvcache.

That's it Because here there is no logic the logic will be here in gamma model Yeah, so let's go to implement the gamma model, all right So what is a language model a language model is an embedding layer plus a series of transformer layers And then we have the language modeling head.

The language modeling head is already implemented here in gamma for causal language modeling So we just need to create the other part which is the embedding layer and the list of transformer layers Let's do that. So we create first the constructor. So this gamma model which takes the configuration some Information that it needs so the vocabulary size why we need a couple vocabulary size because we need to create the embeddings how many embeddings we have Depending on the number of tokens in our vocabulary each embedding vector will be of size a hidden size This indicates the position of the embedding token inside of the vocabulary And basically I think the embedding layer takes it as input so that it does not update the gradient for this token here And then we have a list of layers for the For our transformer These are called here are called gamma decoder layers.

So they are the transformer layers. We have how many of them we have Depending on this parameter num_hidden_layers. And then we have a final normalization, which is a rms normalization, which I will describe later What is it and why it's different from a layer normalization? We need to implement this method here get_input_embeddings, which is used by the language modeling head.

So as you can see we use it here We use it here to extract the initial embeddings From the language model which are then combined with the image features we saw before here and then send to the language model So the language model here is receiving not the input IDs, but it's receiving the embeddings already So the image embeddings plus the text embeddings Which is the same embeddings that we will receive here in the forward method of gamma model Now, let's make the forward method Which is also very simple because we do not implement much logic here So we receive the attention_mask, the position_ids, which are the position that we will apply for each token How to apply the positional encoding to each token We didn't talk about the positional encoding yet because we apply the rotary positional encoding in this case, which are applied During the calculation of the attention So they are not applied at the beginning like we saw before with the Sigleap or with the vanilla transformer But they are applied just before calculating the attention We have the input embeddings which we saw before are the image features plus the text tokens And in case we have the KB cache also the instance of the KB cache, which we didn't implement yet But we already know how it works so Let's do it.

So the first thing that it does it is Taking and applying some kind of normalization, which is the same reason we apply Normalization also to the input of the image features We want the kind of the magnitude of the numbers to remain the same even if the number of dimensions increases then this language model is made up of a series of layers of Transformer layers.

So what we do is the output of one layer becomes the input of the next one And that's what we are going to do here Oops, I've copied it So we take the decoder layer we send it the first hidden state which is the input of this forward after it's been normalized We send the attention mask.

We send the positional encodings the KB cache and it will return something which is Contextualized embeddings which become the input of the next layer So we replace basically these hidden states with the output of the first layer so that it becomes the input of the next layer And we do it for all the layers The output of the last layer we send it to a normalization Layer which is the rms normalization, which we didn't see yet, but we will talk shortly So I want to actually redraw what we are doing so far.

So we have arrived So for that, let's go back to the ipad All right, so What we are doing basically is this so we have created the Embeddings before we have merged them with the image tokens and the text tokens We did not apply any positional encodings because we are doing the rotary positional encodings Which are applied exactly when we calculate the attention So if we were to draw the the gamma architecture, it would be like this.

So we have the embeddings Then I remember there is some kind of normalization Doing but it's not a linear not a normalization layer. It's just we are normalizing the embedding So it's not a layer actually so we do not have to draw it Then we have a series of layers and we have n of them Each of these layers is made up of a normalization RMS normalization Then we have self-attention So attention then we have a Plus so a skip connection here Uh, I think I made it too small.

So let's make it a bigger This layer Then we take the output of this one and send it to another normalization, which is an again in rms normalization Then we send it to a feed forward network The output of this one is sent again to another Skip connection Then the output of the last layer will be sent to again another normalization, which is the rms normalization Then we send it to a linear layer for the logits Linear and let me shift it down and then we have the softmax so so far So far what we have made is basically we are now creating this structure here, but without coding the single block So we are just creating this Forward method that will run the output of the embeddings to each of this layer one after another and will apply the final normalization Rms normalization, which is this stuff here And then it will be sent to the linear layer when it will be sent to this linear layer With gamma for causal lm because as you can see gamma for causal lm will take the output of this Model, what is this model?

It's everything Except the linear layer and then we'll apply this linear layer called the language modeling head which will convert it into logits And after we will apply the softmax, but that is for sampling So now we need to create this decoder layer. So what is this decoder layer?

This decoder layer is this stuff here. We need to code the normalization. We need to code the attention mechanism We need to code the field forward network and of course all the skip connections. So let's do it All right. The first thing that we can implement actually very easily is the rms normalization.

So let's explore it So I have a slide ready there for that So as we saw before with layer normalization What we are doing is that we are normalizing each value using some statistic collected from the value from each item itself in the batch So each item in the batch suppose It's a batch of pictures and the first picture is that of the cat in the layer normalization What we are doing is for each dimension of this vector We calculate a statistic using this vector which is the mean and the standard deviation And then we normalize each value in this vector using these two statistics.

How do we normalize? Well, we recenter it around Here it's not written, but I can show you the formula here You basically subtract the mean that you calculated and you divide it by the standard deviation And the layer normalization actually works fine But recently in most language models, we are seeing another kind of normalization that is known as root mean square normalization Basically what we do with this normalization is that each of these features in this Each item of the batch We are normalizing it in such a way that it becomes like it's coming out from a distribution Gaussian distribution with a center of zero and a variance of one What they claim in the root mean square normalization paper is that they say that the success of the Layer normalization is not because of its recentering invariance, but because of its rescaling invariance which means that To actually reduce this internal covariate shift, which is the reason we use normalization The model does not need to see the values Centered around zero.

It just needs to see the values mostly surrounded around whatever mean they are centered upon So the values of this cat, for example, they do not need to be all around zero They could be all around 500 or all around minus 100 as long as they are more or less around 500 or more or less around minus 100 all of them That's the meaning of reducing the variance to one So we want most of the values to be around whatever mean it is And this is a hypothesis Made by this paper and it's actually verified because most language models right now They do not suffer from the internal covariance shift because they can be trained successfully very fast just like the layer normalization ones But by using this root mean square normalization, why it is advantageous Instead of layer normalization because instead of computing two statistics for the mean and the variance We only need to compute one statistic, which is this root mean square statistic Why we do not compute just the standard deviation like we do with the layer normalization because to compute the standard deviation You need to have the mean But we do not want to compute the mean because we do not want to recenter them So we do and because we don't compute the mean we cannot compute the the standard deviation So we replace this standard deviation with another statistic that allow us to Reduce the variance, which is this root mean square statistic Which is calculated as follows.

So we take each item in this vector So this item, this item, this item, this item, this item, this item We make the power of two of each of this item. We sum them up all together. We calculate The mean of this summation so divide by n basically Square root and this gives us the square root mean Square statistic for this item then we take each of this item and we divide it by this statistic Multiplied by a learnable parameter called gamma, which is one for each feature So basically with root mean square normalization, we are obtaining the same covariate, internal covariate shift I mean, it solves the same problem of the internal covariate shift as layer normalization, but by computing one less statistic So we compute less statistics.

So it is faster basically and Okay. Yeah, so let's implement it. Let me put away this stuff All right, so now we copy this class we put it here Then we later we explain it it's very simple Uh, let me copy all the forward method It's very simple. Okay.

So what we are doing with rms normalization is that okay we are creating a weight matrix, which is the number of parameters one for each feature in the vector to which we apply this root mean normalization how many Dimensions will have this vector well the same as the tokens because we are we will go we're going to normalize tokens So this dim will be the hidden dimension of our language model We compute this root mean square statistic as follows.

So we calculate the power of two of each item We compute the mean of this Power of two. So what we are calculating here is basically this term here. So let me Show you this term here Then we do one the square root of this which is this r sqrt but actually we are not doing the square root we are actually calculating the One over the square root of whatever is the argument of the r sqrt.

So stuff here And instead of dividing each item we are multiplying with one over sqrt, which is exactly like dividing by one by the square root Why do we have this item here plus self dot eps in the argument of the square the square root Well, because this r sqrt is one over the square root of Whatever is inside But if the computation of this statistic produces a number that is very close to zero in this division We are basically dividing by zero which will make the output of this division this number here very big.

So instead of To avoid this division by zero we add to the denominator of this division. So this denominator we add a very small number called eps As you can see, it's a very small number to avoid this division by zero And it's the same parameter that we also pass in the layer normalization as you can see here We pass this parameter, which is a very small number to avoid this division by zero So the forward method is basically just doing this normalization and then we multiply each of this number by this gamma parameter Which is a learnable parameter as you can see Here, so we have here we have this gamma parameter And then we return it That's it.

This is normalization Now we can move to the next part, which is the coding of this decoder layers All right Let me check gamma model so we can create the decoder layer. So let's copy some code All right, so the decoder layer as we saw before it's this stuff here So we need to create something that manages all these blocks here So something that takes an input a list of embeddings apply a normalization then apply a transformer Attention, sorry Then it applies a skip connection Then the output is sent to another normalization then to a feedforward layer block then again another skip connection then produces some output So we will just create this simple block which is the same structure as the decoder layer that we have the encoder layer that We have created in cglib.

So it's the equivalent of This block here the encoder layer. It will be doing the same job So, let's do it So what we are doing is we are saving some stuff So the hidden size of the model then we are creating the attention Block, which we will code later the multi-layer perceptron, which is the feedforward network block The first normalization and the second normalization because in the decoder block we have two normalizations So as you can see here, we have one normalization here and one here So the forward method is the same very similar to the one we have coded for cglib so we take some hidden states, which is the Input to this layer the attention mask, which will be sent to the attention mechanism the position Ids which also will be sent to the attention mechanism because we are using the rotary positional encodings And the kb cache which also will be sent to the attention mechanism So let's actually let me just copy it and then I explain it because it's the same as the encoder So we take the input we apply the first normalization to this input which is This stuff here this normalization Then we send the output of the normalization This hidden state we send it to the self-attention block along with the attention mask the positional encodings and the kb cache And this will produce an output which will be then summed up with the skip connection here, which is this stuff here So we take the output which is hidden states plus this residual which we saved before to create the skip connection then we create another skip connection and we send the output of the of the Self-attention to the second normalization, which is this stuff here this normalization The output of the normalization is sent to the multi-layer perceptron, which is this one here And then we take the output of the multi-layer perceptron Which is the feed forward network plus the skip connection that we saved before which is this residual stuff here And that's this plus sign here and the output is then returned and this is the decoder layer Now we need to code the multi-layer perceptron and the self-attention Block, I believe the the faster stuff to do is the multi-layer perceptron.

So let's do that first So let me go there It's also very similar to the multi-layer perceptron that we have already coded for the Sigleap, but it's slightly different So the multi-layer perceptron here, which is also known as feed forward network is basically as we saw before in the sigleap It is something that two linear layers that first expands the embedding Vector applies some non-linearity and then reduces it back to the original size and this is what is done here But in this case, we also have another linear layer called the gate projection Which is used by the activation function that this gamma language model is using We saw that different language models have different activation functions, which is based mostly on heuristics on how they work So let's implement the forward method, which is very simple here and we will see why we need this gate projection I made a code to convert this very long.

I mean this very long this this this line into Series of steps so that you can see each single step being done independently but let me describe it what we are doing here basically is First we are applying the gate projection to the input to this feed forward network, which is a list of embeddings as we saw before And the function that we are using is the gelu function, which I believe is the same that we are using also for the sigleap Let me check Uh, yeah the same function But we also have this gate projection here So basically it's adding some learnable parameters before sending it to this activation function We multiply the output of this activation function with the up projection The up projection is basically the one that takes the embedding size from the original embedding to the intermediate size So it's expanded size And then the result of this multiplication, which is a vector Which is a tensor of size batch size sequence length and the intermediate size is then reduced back to the original size by this Down projection because with the up projection you are expanding and the down projection you are putting it back to the original size So the down projection will take the intermediate size back into the hidden size and this is the multi-layer perceptron of gamma It's slightly different than the other one because we have this gate projection Which is additional parameters basically And it's the same kind of gate projection that we also have if I remember correctly in lama in which we have this regular function With its own gate projection.

It's just parameters that are learnable before applying the non-linearity We also said that the non-linearity is chosen based on heuristic on how they work well in particular case But also on some properties that we want from them with respect to the gradient. So some Activation functions allow the gradient to flow for negative values.

Some others don't allow it, etc, etc So it's all based on practical application. Someone trained tried using it so that it works better and then we start using it Okay, now we also have the multi-layer perceptron now comes the biggest part And but not the hardest because we are already familiar with the attention mechanism So we need we need to code the attention mechanism which will comprise the self-attention the use of the KV cache The grouped query attention which is something new and the rotary positional encoding.

So it will be a little bit of learning experience. So let's start All right. So let's start coding the next part, which is gamma attention. So we start by creating the class Let me copy it And I will do it slowly because this one has a lot of innovations So let's start by creating the constructor, which is our usual constructor It takes in the configuration of gamma.

We also take another parameter, which is the id of the layer so the position of the layer in the Transformer because as you know the decoder the gamma is a decoder Only model it's made up of many layers and each of these layers will have its own KV cache So to know which KV cache to use because there is one cache for each layer.

We need to also pass the layer index To each layer so it knows where to put its key and values Then we save some parameters So the attention dropout which we will not use the hidden size is the size of the embedding vector of each token the number of attention heads for the queries The number of the head dimension which is how many Dimensions each head will work with In the multi-head attention Which is a part of the entire embedding of each token How many heads we have for the number for the keys and values in the multi-head attention?

And this is different from those for the query because we are going to talk about grouped query attention So we can calculate how many groups we have in this grouped query attention, but later I will explain how it works The maximum positional embeddings which are how many positions we can encode in the positional encoding using the rotary positional encoding And what is the base frequency of the rotary positional encodings?

Now we have some other stuff So first of all, we make sure that the hidden size is divisible by the number of heads because as you know Each head has to watch a part of the embedding of the entire token So it must be divisible by the number of heads Then we create our projections which are the wq wk and wv projections that we saw in the multi-head attention But in this case, we can see that we have not hidden size as input as output number of features But the number of features are calculated as the number of heads multiplied by the head dimension Now why this is different from the multi-head attention that we have implemented for Sigleap?

So if we go to look at Sigleap and we look at the attention you can see that each of these wq wk and wv metrics matrices is a Hidden size by hidden size here. It's called the embedding dimension, but okay, it's the same thing So it's the size of the entire token with the output features being also the same number of dimensions Here, however, it's slightly different.

Why? If we look at what is the numHeads numHeads is the number of heads for the query and this is actually the the full the number of heads for the query in grouped query attention is Equal to the is bigger than the number of heads for the than for the keys and values later We will see why but for now, let's concentrate on the dimensions.

So in this case this wq matrix So it's called the qproj which stands for which is the wq Matrix in the multi-head attention has an output a number of output features. So suppose that the number of heads So number of heads is equal to 8 and suppose that the hidden size is equal to 1024 So the wq matrix will be a matrix that is 1024 by 8 multiplied by the head dimension, but the head dimension is what the head dimension is how many Dimensions it had will watch by using the number of heads of the query as a reference So 1024 divided by 8 which is 128 I think so.

Yeah So it's 8 multiplied by 120. So actually the wq matrix is 1024 by 1024 What changes in grouped query attention is the wk and wv projection actually wk actually will be 4 because that's the hidden size as input and the output features will be the number of heads for the key values Which actually we can check here In the configuration we can see that the number of heads for the Queries is 8 and the number of heads for the key and values is only one So actually this is the case of not of grouped query attention.

It's multi query attention. So Let's say okay. Suppose that we have only one head here. Also one multiplied by 128. So it's equal to 1024 by 128 And the same size is also for wv because as you can see the expression in wv is the same it's the number of heads for the key value multiplied by the head dimension and then we have the output projection, which is a then Hidden size by hidden size because the number of heads multiplied by the head dimension So it's actually number of heads is 8 which is always referencing the number of heads of the queries So this is 1024 by 1024 So as you can see the difference with the grouped query attention is that we have less head for the keys and values Which results in a smaller Projection for the embedding of each token When it's used as keys and value.

Let's see why so let me open a new Page and let's switch to the ipad which is here Okay, when we do um Normal multi head attention what we have is that each token is divided into multiple groups of dimensions One dedicated to each head suppose that we have an initial token Let me use a pen and let's use a smaller size.

So imagine that we have a token with 1024 Dimensions in total if we divide that in eight heads We will have that each of the head will manage 128 dimensions of this token so one to 128 then the second head will manage 129 to 256 Etc, etc until the last one which will be I don't know how to do the calculation.

Let me check 896 I guess 896 up to 1024, right? 128 yeah should be correct. So this is the head number eight This is the head Two and this is the head one When we do the product query multiplied by the transpose of the keys each of the query is Multiplied so dot product with each of the keys, but only in the part Dedicated to each head because each head is working independently So suppose that this is our query.

So this is our query. Let me write it with a different color. So this is our Query and then we have some key And this key also in the normal multi head attention. We have the same number of heads for the query and the keys So suppose that we have the same number of heads also here so we can copy this stuff, I guess Too hard to copy Okay copy paste So what will happen with the multi head the normal multi head attention is that each head will do the dot product of the first Head of the head number one.

For example, we'll do the dot product of the first 128 dimensions of the query with each of the keys because you need to think that we don't have one key. We have multiple keys Because it's a matrix. The matrix is a sequence by sequence. So each head each query is attending to all the past keys So here we can write Key number one key number two So key number one key number two and key number three and this is the query number one and we do it for all the Queries so for each token each token will attend all the past tokens as keys At least in the language modeling So what will happen is that we are doing a dot product With the first head will do a dot product of the first 128 dimensions between the query and the key then again between this query and this key and then between This query and this key in parallel the head number two will do the same stuff so the head number two will take the next group of 128 dimensions or the dimensions from 129 to 256 and will do the dot product with the next group of 128 dimensions for each of the keys So it will do the dot product of this query with this key and then this query with this key And then this query with this key all in in parallel Each head is working in parallel Now what happens is that and we do it for all the heads The problem with the multi head attention is that the and this was described in the multi query paper So if you want I can give you the reference to the paper.

It's called multi query paper Multi-query attention paper And it's this one here in this paper Basically, Noam Shazir described what is the problem with multi head attention at least from a computation point of view He claims that with multi head attention The problem is not in the number of computations that we are doing which is the bottleneck of the computation but rather the number of Data transfer that is happening in the GPU because of this multi head attention and for that we need to talk about How the GPUs work so in a GPU what we have Is this a GPU has a very big memory called the high bandwidth memory Which is in the order of gigabyte or tens of gigabyte.

I think the 100 goes up to 80 gigabyte. Then we have some smaller memory called local memory. So local memory And this one is in the order of the megabyte. I don't know if it's 10 of megabyte I think in the tens of megabytes, so it's a one a magnitude of order smaller three magnitudes of order smaller and and Then we have the cores The cores are many and they all work in parallel all of these cores So when you do a matrix multiplication, what happens is this You have the matrix that you are trying to multiply in the high bandwidth memory The the kernel that manages this matrix multiplication Which is a CUDA kernel in case you are using an Nvidia GPU will copy for example the first part of the matrix from the high bandwidth memory to the local memory and Each core will work with a part of this big matrix to compute this matrix multiplication in parallel So each one is will be working with a smaller part of this matrix to calculate this this part in parallel it's much easier to visualize with the summation because for example if you are summing two matrices like this matrix and this matrix and You get this matrix as output.

What happens if you divide it into four parts is that The result of this part of the matrix only depends on these numbers and these numbers So the first head can work with these two parts the second core Sorry, not head the second core can work with these two parts sum them up to produce this one the third core can work with these two parts and Resulting in this part and then the last core can work on this part which will result in this part of the matrix So as you can see the metric summation can be done in parallel by multiple cores each working with a part of the matrix What happens when we do multi head attention is that?

the The dimension suppose that because the heads are working in parallel the first head will copy the first 128 dimensions of the query to the local memory of the GPU which will then be Accessed by the cores to compute these dot products Meanwhile the second head at the same time needs to copy the second 128 dimension of the each token to the local memory and Then needs to also copy for each query the second 128 dimensions from the high bandwidth memory to the local memory so that the cores can work with it Now what happens in the multi query attention paper.

So this paper here what they say is that The bottleneck of the computation of the attention is not in how many dot products we are doing But how much it how much time it takes to copy the memory from the high bandwidth Bandwidth memory to the local memory so that the cores can work with it Why because in the GPU we have a lot of cores that are very fast at computing computation But the GPU is not so fast at copying stuff around so the memory copying is very slow compared to how much Computations it can perform.

For example, let's open the A100 GPU data sheet It's here you can see that the A100 has okay 80 gigabyte of memory in the high bandwidth memory And it can do this kind of Teraflops operations per second if you are working with the 32-bit But as you can see the GPU memory bandwidth is much slower than the number of operations it can do Because the teraflow floating-point operations per second means billions thousands of millions of Billions of operations per second so it means thousands of giga operations per second while here we have only 2,000 gigabyte per second of memory transfer speed So basically in in a lot of computations that we do in the GPU The bottleneck is not how much compute we are using but how much data transfer is happening for this compute and as a matter of fact Flash attention basically exploits this difference in computation and memory transfer To reduce the memory transfer and redo computations because you it's faster than to redo computations twice instead of copying a different stuff from the GPU To For the computation.

So basically what we do is we are willing to sacrifice computation To reduce the data transfer. This is what we do with flash attention This is also one of the reason we use the gradient checkpointing So gradient checkpointing basically means that during the backward pass we redo some computations instead of saving them because if we save them then we need to recopy them from the high bandwidth memory to the local Memory, so it's faster to redo them instead of copying them the already processed one To speed up the computation So the one clock time which means the total time to compute the attention is determined Actually is bottlenecked not by the number of dot products that we are doing but how much data transfer happens So how to reduce the data transfer that we do when we do the multi head attention One way is to use less heads for the keys so what will happen is that the first head imagine we only use one head for the keys instead of Having multi head also for the keys and values.

So we don't have this part anymore we only have a multi head for the we have many heads for the Let's see We only have one we only have multi head for the queries So we don't have multi head for the keys or we have less heads for the key Imagine that we are in the extreme case in which we only have one head for the key and value But we have multi head for the query.

What will happen is that the first core will copy the first 128 dimensions for the queries from the high bandwidth memory to the local memory and also the 128 dimensions for each token for the keys It will perform the computation now. Meanwhile, the also the second head needs to do its computation.

So in parallel So, how can it do it needs to copy the 128 dimensions for the query? but it does not need to copy then the next group of 128 heads Dimensions from for each of the keys because it can be it can reuse the one for the keys so they each group of Heads of the queries is sharing some heads for the keys so that they don't need to copy Again for different heads these dimensions, but they can share the already copied ones So this is the extreme case of having only one head for the keys, but we can have a group of heads So we can do for example that Instead of we have eight heads for the query and then we have four heads for the keys so the head number one and two for example for the query will share this head here and Then the head number three and four will share this head here So the head number one and two for the query will share this head here so that the total amount of transfer for the keys is only this part here and Then the head number let's add here add number three and the head number four will share a different Head of the keys, but it's shared as you can see every two head.

We are sharing one head of the keys So these two head will not need to copy 128 dimensions each but 128 dimensions in total for both of these heads This reduces data transfer which speeds up the computation of the attention And this is the reason we have here in the computation of the attention the projection for the WK and WV Has less parameters because we are trying to compress these tokens into smaller tokens Equal to the number of heads that we need for this projection So for the keys, for example, if we have only two heads for the keys we will compress these tokens into 256 dimensions so that every Four heads of the query will have one head for the key Imagine we have four heads for the keys and values then we will have this one will be four So what will happen is that every two heads of the query will be using one and this one will become 512 Every two head of the query will share one head of the keys.

So the total data transfer is reduced So we speed up the computation of the attention Of course, you may be wondering but this should also reduce the quality of the model because we have less parameters We have less expressive power for the keys and values and it's true So if you look at the paper, they say that in the multi query attention It reduces the quality of the model, but not much so it's something that we can afford to lose and the group query attention is basically a Let's check group query attention paper, which is this one So in the multi query attention, you have one head for the keys and values Which is shared for all the heads of the queries in the group query attention We have a group of heads for the queries sharing one head of the key So when you have multi query attention, you have only one head here for the query and the keys and values When you have a group query attention, you have multiple heads Of the keys sharing one head of the queries sharing one head of the keys and values So basically the multi query attention Multi query attention, which is only using one head for the keys and values reduces a lot of the quality a good compromise is between the full multi head attention and multi query attention is the group query attention which reduces Slightly less the quality of the model, but still gives you this computational advantage of reducing the quantity of data transfer another very big advantage of Group query attention is that you reduce the size of the KB cache because as you remember We have one KB cache for each layer and in each KB cache we need to save each token so if we compress these tokens the total amount of memory required for the KB cache reduces them and Actually, the KB cache is also one of the bottlenecks in today's language model So we have these big language models that are like 70 billion parameters or whatever But the the problem using them is not even actually the GPU memory requirement just for storing the model But actually for storing this big KB cache because you have to store each single token in each of the layers of the model Which actually grows very fast if you have a lot of tokens Okay.

Now that we have seen how the group query attention works, we can proceed further Let's continue our journey So the next part that we need is this beautiful thing called the rotary positional Encodings that I will not explain right now. We I will explain them after Explaining completing the attention module for now, we just consider them as a black box that adds some information encodes the information of Position in the tokens and later we will see how it works Let's implement the forward method.

So the forward method is this one so basically it takes the hidden states, which is the input to the After the in the decoder layer is the output of the first RMS normalization Then we have the attention mask the position in the position that we need to apply to each token because we need to apply the positional Encodings and then the KB cache in case we are using it and now we will implement it So the computation of the attention is the same as before Let me copy a big part.

So like this The first thing we do is we extract the batch size and what how many What is the length of the queries? So what is the length of the input sequence because as you remember when we do token generation During the prefilling the QLAN will be all the inputs prompt But then during token generation the Q will only be one single token because we want to Generate all the last part of the attention matrix.

So the last row so we need only one query But how can we have all the keys to attend to because we have something called the KB cache which will store all the keys So what we are computing here is the same as before So we are converting the input sequence into query key and values and then we are splitting this Embeddings into groups of dimensions based on how many heads we have for the query key and values For the query, we will split it into numHeads number of groups Each number or each group will have headDim number of dimensions and for the keys and values We will have numKeyValueHeads number of groups and each group will have headDim number of dimensions to manage Then we do this transposition so I can show you again.

What does this transposition do? So let's do it Let's go back to our here So the first part that we are doing here big up to the transposition is this one So we are multiplying the input sequence with WQWK and WV and splitting these embeddings into heads So that each embedding is a group is a list of groups where each group is managing some dimensions So now what we end up is basically a sequence of what?

Tokens where each token is made up of groups and each group is managing for example 128 dimensions Then we use this transposition because we want to have at the first dimension the heads dimension So that we have a structure like this So instead of having a sequence of tokens where each token has groups of dimensions We want a list of groups where each group is a head Each head has some tokens how many equal to the sequence length and each token is a mini token Which is the dimensions dedicated to that specific head.

So the head number one will have 128 dimensions the head number two will have the next number group of 120 dimensions etc until the last one which will have the last group of 128 dimensions This allow us to compute the multi-head attention this for this using this Sequence this sequence this sequence and this sequence all in parallel Okay, and this is the meaning of this transposition Transpose the next thing that we do is we apply the rotary positional encodings and now We didn't talk about the rotary positional encodings and we will talk about later But for now, you need to think that we are not changing the shape of these keys and queries and values We are just modifying them by adding some information that Encodes their position and it will be done by this method called apply rotary positional embedding We will see later how it works for now just think that in the query and the keys we have encoded some information which will be leveraged by the attention mechanism to Relate tokens to each other differently based on their position basically, but we will see that later.

So Suppose that we have already encoded the positional Information. So now we need to as you remember when we do work with the KV cache we pass only one single token as input to the layers of the Transformer and this single token is added to the KV cache in the keys and the values cache of this Particular layer then we retrieve the content of this KV cache which includes the newly added the token and all the previously saved token and then we use this Output of this KV cache to calculate the attention.

So let's implement this KV cache so it's very simple because it's only one method to implement which basically will just take the Single token that we are sending in which is this key states will add it to the key cache will take this value states which is one single token add it to the value cache and then retrieve all the content of the cache as Output so all the past token it has seen plus the current one So let's implement it and we go to the beginning of the file here Class KV cache.

Let's do it like this So we create a constructor as you can see it is a kind of a buffer where that includes one buffer for each layer of the model one for the keys and one for the values We also have this helper method that allow that tells us how many items the KV cache currently stores So if this KV cache does not contain any item we say zero if it contains something then we return What is the number of items it stores which as you remember when we add the something to the KV cache we are adding This tensor here, which is the key value states and value states which are tensors of this shape So batch size and number of heads sequence length and head dimension Which means that the sequence length is the second last dimension.

So that's why We return the second last dimensions to retrieve the sequence lengths currently stored in the KV cache We then implement the update method which is also very simple and I added some comments to it to make it simple So basically it means that it this will add the content of this key states and value states to the KV cache of this layer And then it will return whatever is stored for this layer So if we have never added anything to the KV cache of this layer, then we create it.

So we basically append this tensors It means that we have nothing else to concatenate it with However, if we otherwise we are we already have some tokens in the key cache and the value cache of this particular layer Then we concatenate whatever is already present with the newly incoming token along which dimension along the sequence dimension and the sequence dimension We saw before is the dimension -2.

That's why we concatenate them along the dimension -2 so after concatenating them we retrieve all the content of the K and V cache and return it for the current layer and this is what is happening here Here so we add this incoming key values and key states and value states to the KV cache Then we retrieve them and we use them to compute the attention Now you need to remember that when we do use the KV cache There are two phases when working with the model with the KV cache There is one part called the prefilling in which we have the prompt the prompt in our case will be the image tokens plus The user prompt so the what the user wants the model to do with this image It will be a list of tokens.

So this key states and this value states will be a list of tokens So they will be all added to the cache for the first time because initially the cache will be empty and will be retrieved here When we do token generation, we use the last token output by the model and We add it one at a time to the KV cache But we always retrieve all the content of the KV cache to compute the attention because the each query needs to attend all the past keys and values It needs to attend all the past keys which are then used to compute the weighted sum using the values Um, okay, what is the next part of the computation of the attention?

Well, well, well here we have this repeat Now we need this method called the repeat KV which basically will repeat the dimension of the Of the keys and values that are missing for the heads of the query Um, okay, let me explain it with the iPad because it's much easier to draw than to explain by words.

So let's go here Let's go here Okay. So what happens with this repeat method is that we have the projection Through WK and WV of the token that results in a smaller token Which gives us some benefit from the KV cache point of view for example But to compute the attention each head needs to share the heads Each query heads needs to share the head with other query heads when working with the keys so for example The first two heads of the query needs to share one head for the keys Then the second two heads for the query needs to share one head for the keys what we do is basically we Repeat this because we are working with the naive implementation of the attention which does not really Actually benefit from this optimization.

So what we do is basically we just repeat the missing heads as You can see here. So we we take the heads that are missing and we just repeat them to match the heads to match the heads of the query so Like this one so that it's like each head each query head which has its own head also for the keys This is because actually we are not creating a custom CUDA kernel for the computation of the attention So we repeat it and we just pretend like the grouped query attention never happened but for example If you use a flash attention flash attention actually leverages the reduced number of heads of the keys and values to optimize the computation of the attention So basically we are kind of reversing the effect of grouped query attention when calculating the attention because we don't have this Custom CUDA kernel that can leverage this by not copying the missing heads The repeatKV function is very simple So we can implement that as well because it will just repeat the heads that are missing for the keys and values So let's implement it here As you can see if we have a tensor and we know that this tensor has the following shape So the batch the number of heads the sequence length and the head dimension If we only need to repeat it once then we just return it because we don't have to repeat anything otherwise, we introduce a new dimension, which is how many times we want to repeat this number of heads and then we We do this reshaping which will basically repeat this number of heads that much number of time Actually, the repetition is done by the expand method here.

So we introduce a new dimension here Which is the number of repetitions and then we expand it. This expansion basically repeats whatever content is This content here for each of the heads in the nrep heads So basically we are repeating whatever comes after these two dimensions this number of times and then we remove this helper dimension that we have created the nrep dimension that we only created to repeat the number of heads and How do we do it?

We must multiply the number of repetitions that we need with the number of key value heads So at the output of this method the number of heads that you will have is the same as the number of heads of the query So let's go back here So now it will this key states and value states will have the same number of heads as the query So now we can just compute the attention like we have always been doing so by doing the query Multiplied by the transpose of the keys divided by the square root of the model, etc, etc So let's do it We also add the attention mask so we compute the attention weights just like this standard formula query multiplied by the transpose of the keys divided by the square root of The D model the model is the number of dimensions managed by each head We then add the attention mask right before Using the softmax.

So the attention mask That's why we in our case will always be made of zeros because we don't have any padding so we don't need to mask anything and also during the prefilling we don't mask anything because We always let the prompt the user prompt. So the text prompt to also attend feature tokens.

Why? Because the polygamma Autors made this decision and They decided that the prompt the user prompt or the task prompt does not need to be causal because anyway It will never be generated by the model. It will always be set by the user So we apply the softmax and then the dropout but the dropout we never have so this stuff here is very simple So we apply the softmax Row by row then we apply the dropout but the dropout is always zero and we as you know The dropout is only applied during training but just ignore it like it's not there Then the output of the multi head attention is multiplied by the value states So this attention weights is multiplied by the value state value matrix, which will result in that weighted sum we saw before so each token is an aggregation of previous tokens based on the Score defined in the attention matrix.

So if you want to visualize it again, I can show it to you again. So let's go here When we do the multiplication with the V which is here Basically this output token Let's say this one here is a contextualized token and that will include information about three tokens.

I love pepperoni and It will be a weighted sum of these three tokens So I love pepperoni based on the following weights So basically the token I will contribute to 20% of information the token love will contribute to 40% of information The token pepperoni will contribute 40% of information and the last token will not contribute any information because it has been masked out So this is what happens when you multiply the V that you are doing a weighted sum using the attention weights as weights Then what else we need to do we need to check okay the output shape and that's fine I can do that so we do this one and Then we transpose back Like we did before So we transpose back to have again the sequence length as the second dimension then the num heads as the third dimension then we Concatenate all the heads together just like we saw before so now each token is back to the head hidden size Dimension where this hidden size is the concatenation of the output of each head but we if you just concatenate the output of these heads then the each embedding will just be an Independent calculation of each head concatenated together So we need some kind of mixing mechanism and this mixing mechanism is given by WO which will mix all these Dimensions with each other so that the result of each head is kind of mixed with each other through this WO projection So that this output Token from this multi head attention is not just a concatenation of multiple independent heads But it's something that is also mixing the results of this independent heads And then we result will return the result of this multi head attention Now one thing that we have considered as a black box so far is the rotary positional encoding So we have said okay we are encoding somehow the positional encodings in these queries and keys and then the Multi head attention will leverage it now.

It's time to expand on that and understand how it works. So let's do it All right. So let's talk about positional encoding guys so traditionally we are used to work with the Positional encodings applied directly at the entrance of the transformer, which means that we take some embeddings So we transform we have our tokens which indicates the position of the token in the vocabulary We convert them into embeddings using the embedding layer, which is this stuff here And then we add some other Vectors to these embeddings that encode the position information of each token because otherwise the model has no notion of position the model treats each token as you as you saw before each head just does a dot product of two tokens and If the position information is not encoded in these two tokens that the dot product can only access the embeddings So it does not have any notion of which token comes first and which comes later So to encode this information, we basically traditionally we are used to add a positional encoding here to the embeddings of each Token and so that the embeddings basically encode the information of the position in the original transformer paper.

They proposed this sinusoidal positional encodings which are also known as absolute positional encodings because they encode the absolute position in the Inside each token. So the token number one will have some dimensions some vector that will encode the position number one The token number five in the sentence will have the position number five added to it, etc, etc What we use in most language models nowadays is the rotary positional encodings Which are in the family of the relative positional encodings and they work as follows.

So let's open the paper They were introduced in this paper called the raw former enhanced transformer with rotary positional embedding Basically the idea with the this Positional encodings is that we do not add them directly to the embedding of each token so that each token encodes the information of its position, but they modify the attention mechanism in such a way that the attention mechanism takes into Consideration the position of the tokens to relate them differently based on their position.

Let's see how they did So basically in the paper they say okay We have this multi multi head attention mechanism that uses the dot product as to relate tokens to each other so they said okay, can we find an encoding of the embedding vectors of tokens such that When we do the dot product, which is an inner product.

So this sign here means the inner product So can we find an encoding for the token called FQ for the query and FK for the keys? that encodes the position information inside the embedding XM for the query and XN for the keys such that when we do the dot product So this function G this dot product, the output of this dot product Only depends on the embedding of the first token the embedding of the second token and the relative distance between them So that's why they are called relative positional encodings because they depend the dot product is modified so the attention mechanism is modified such that the dot product should depend only on the Embedding of the first token on the embedding of the second token and the relative distance between them So we need to find a way to encode information inside of our embedding such that this dot product will depend only on the embedding of the first embedding of the second and the relative distance so how to encode this information inside the Embeddings.

Well, they Proposed the following case for the 2D case. So imagine we have an embedding vector made up of only two dimensions How to encode the information of the position in this two-dimensional vector as follows basically, we create a matrix that is a Rotation matrix. So if you have ever worked with the rotation matrix like when you do rotation of a vector in 2D space you basically multiply the vector by this matrix here where the Argument of the cosine and the sine is a multiple of an angle that defines by how much you want to rotate this vector by So if we basically Multiply the two dimensions of this vector by this matrix here Which is we will see what is it and then this matrix here, which is a rotation matrix Then basically we are rotating this vector by some angle defined by this M theta angle This will encode the information so the output of this operation So the output of this operation will be a 2D vector which will encode the information of the position based on this position M Such that when we do the dot product of two vectors encoded like this, this dot product is guaranteed to be To be a function of the embedding of the first vector, embedding of the second vector and the relative distance that was encoded into them The difference of the distance that was encoded into them Basically, but we usually when we have an embedding we do not have a 2D vector We have a multi-dimensional vector, maybe 1000 dimensions or 2000 dimensions So they take the 2D case to the general case and the general case basically they say okay instead of multiplying the token by So instead of using this 2D rotation matrix, we need to have this big rotation matrix here for an D-dimensional vector.

So here is the d-dimensional vector If you look at this vector this matrix here as you can see it is a sparse matrix Which means that it is mostly made up of zeros and only some elements are non zeros So if we encode the information using this transformation here by using this matrix here We will be doing a computation that will result in the following property being verified which is that the when we do the dot product this dot product will only depend on the Embedding of the first token the embedding of the second token and the relative distance of the two positions that were that was encoded into these tokens But we will be doing a lot of unnecessary computations because a lot of zeros will be Will be multiplied by other elements which will result in zero.

So we are doing a lot of computation Uselessly because in a sparse matrix If most of the elements are non zeros and only some of them are non zeros That means that you are doing a lot of computations uselessly Because you already know that in advance that they are going there.

They are zeros So is there a better way to compute this encoding mechanism to reduce this unnecessary? Computations knowing already that most of them are zeros and we also know where they should be zeros Well, yes, there is it is possible and they propose another more computationally efficient realization of this matrix Which basically says that if you want to encode the position information inside your tensor inside your embedding You need to take the embedding Here this so a d-dimensional vector because we know it's a d-dimensional vector.

So where d can be 1000, 2000 Whatever it is. Suppose in our case, it's 1024 You multiply it element wise. So this is element wise multiplication by another matrix constructed as follows Where the first element is a cosine of m theta 1 and the second element is cosine of m theta 1 etc Where m is the position that you want to encode in this vector and the theta 1 theta 2 are Computed using the following formula here.

So they show it here Theta I is equal to the 10,000 to the power of minus 2 I Divide by D where I is from 0 to D divide by 2. I remember correctly They show it here. Yeah, I goes from 1 to D divide by 2 so Let's go back So basically what we are doing is we are multiplying each dimension of this vector by a cosine Where where the argument of the cosine is a multiple of a base theta Multiplied by the position of the token that we want to encode into this token plus The dimensions of this vector but rotated and with changed signs Multiplied element wise with the sign of the same arguments that we use for the cosine And if you encode your vector like this And when you do the dot product of two vectors encoded like this What will happen is that the dot product is guaranteed to be The number that comes out of this dot product Will be depending on the embedding of the first vector So the information that was encoded before adding the positional encoding the embedding of the second vector So the information that was encoded in the vector before adding the positional encoding and the relative distance plus they also say that Basically the rotary positional encoding also have a decaying effect based on the distance between two tokens which means that the dot product as we know the dot product is converted into a score by the Softmax, so it tells us how intense is the relationship between two tokens So the bigger the dot products the more that that token will contribute to the output Contextualized embedding as we saw before So each of the attention scores tells us how much information that token will contribute to the output contextualized embedding So with the rotary positional encoding what happened is that this dot product will modified in such a way That the dot product will be high when two tokens are close and as they move apart So the distance between the two tokens for which we are doing the dot products grows The dot product will decay will decrease in magnitude So the output number will be smaller and smaller and smaller based on the relative distance between the two tokens And they give a relative upper bound based on the relative distance between two tokens So, rehearse, to encode the positional information of a token using Rotary positional encoding we need to do the following computation where we take the vector of the token We multiply it by a special matrix constructed like this plus again the Vector of the the token itself, but with dimensions changed in position So first we create a special vector where we put first the second dimension of the vector, but with the change sign then the first Dimension then the fourth dimension with its sign change then the third dimension, etc, etc And then multiplied by a sign this matrix constructed as follows using the theta values calculated according to this formula here this one here and The each of this sign and cosine is basically Working with an argument that is a multiple of this base theta multiplied by the position that we want to encode into this token And if you want to visualize in the rotary positional encoding paper They also say what is the meaning of this rotary positional encoding?

So basically each two dimension as you can see from this matrix here Each two dimension are being rotated by the same angle So basically it's we are have a token that is made up of many dimensions So each pair of dimensions is getting rotated like a 2d vector So each two dimensions are considered like a two dimensional vector Which is getting rotated by an angle that is a multiple of the base angle Multiple with respect to the position that you want to encode And this is the the meaning of the rotary positional encoding.

So the rotary positional encoding to rehearse again modify the attention mechanism in such a way that the attention score that is generated is dependent on the Relative distance between two tokens and they also prove in the paper that this attention score Decays as the distance between the token grows Okay, now that we have seen how it works.

Let's code it And actually in the code that we are going to write you will see that I am going to use the HuggingFace implementation of the rotary positional encodings And we will see that the rotary positional encoding that it's implemented in the HuggingFace library. It's slightly different from the Hugging with the formula that you see Here this one here But it according to the authors it results in the same computation.

So They they do it this way They I will also share the blog post in which they they explain why they do it this way So it's a slightly difference, but the idea is the same So it will result in a slightly different calculation, but the effect is the same.

So let's do it All right, let's implement this rotary positional encoding So the first thing we need to create is this gamma rotary positional encoding class So for that we can do it. I think here it's same no problem Let's do it here Okay, so then we are giving some parameters dim is the head dimensions because each head because the Rotary positional encodings modify the attention mechanism The attention mechanism is performed independently for each attention heads So each head will have its own positional encoding applied to the tokens So this dim is the set to the head dimension.

So the number of dimensions managed by each head in the multi-head attention Then we have the max positional embeddings, which tells us What is the maximum number of positions we can encode? this is Set to 8000 actually in the gamma configuration here. It's initialized to 2000, but actually it will be overwritten And then we have the base parameter theta which is set to 10000 also in the original paper So let me show you from the paper Let's go here I think I can find it Here as you can see, it's 10000 to the power of minus 2 id.

So this stuff here and Then we have this inverse frequency. So this inverse frequency is just the formula you can see here So 10000 to the power of minus 2 i divided by d where i goes from it's written here i goes from 0 to 1 to d divided by half so d divided by 2 And so the formula we are using is actually I think this one here to calculate it So 10000 to the power of minus 2 i divided by d So it's 10000 divided It's 10000 to the power of minus Minus something but when you have the negative power, it means one over the same thing with the positive power So that's why we have one over 10000 to the power of the positive power.

So Let me write it. Actually when you have x to the power of minus 3 It means 1 over x to the power of 3. So that's why you have 1 over 10000 to the power of something And what is this something that we are raising to the power 10000 to?

It's a list of numbers that goes from 0 to dimension divided by 2 which is the i Divide by d where d is the number of dimensions So of the vector to which we will apply the rotary positional encoding which is according to this formula here so i goes from 0 to d divided by d divided by 2 and d is the number of dimensions of the vector to which we apply the rotary positional encodings in our case It's equal to the head dimensions because each head will have it positional encodings applied to it We use this arrangement to generate a list of numbers from 0 to d divided by 2.

So basically it's a 0 to dim by skipping every 2 What else we need to do here I believe we need to go let me check Okay, so now we can implement the forward method of this So to calculate the rotary positional encodings we need to generate so now let me check the go back to the paper and then explain the Forward method so to calculate to apply the rotary positional encodings.

We need the vector itself Oops The vector itself and then we need to multiply each dimensions by some cosine and each dimensions Rotated and with its change its sign changed with some signs computed as follows so given some positions we can for each position m compute the cosine and the sine that will be Needed to multiply by these vectors and this is what we do in the forward method here We actually extract the cosines and the sines that will be applied to each tokens Depending on the positions of these tokens.

So for each token, we will have a different position So this m parameter indicates the position of the token So for each m we can compute the cosines and the sines and this is what we do in the forward method here So we take the inverse frequency we add another the Another dimension, which is I believe it's for the batch dimension And then we Disable the auto cast so the auto cast in torch is for mixed precision so I don't want to go too much into the detail of this stuff, but Mixed precision is basically when you train a when you train a model You don't have to work with the floating point 32 numbers always because the most modern gpus They also support working with the 16 bit numbers Which makes computations faster and also reduces the memory of these computations.

Of course, you use a little bit of precision But the the precision that you need for some operations is not necessary for some operations. You don't need that much precision. So the multi-automatic mixed Precision, I think it's called Handles this automatically for you So it will use the smaller precision for the numbers when computing certain operations and higher precision So 32-bit when computing other operations such that we are kind of we never lose much quality in the model Probably here for the rotary positional encodings.

We want to retain the full quality of so the full Precision, so we disable this automatic Auto custom Okay, so We are basically multiplying each frequency by each position that we want to encode because as you can see from the paper So let's go here We need to multiply this m by the base frequency.

We have already the base frequencies in this infrequent freq expanded So we are multiplying it by each m. So we are computing the arguments of this cosines and sines here We concatenate this Cosines and sines. Why? Because we have them for dim divided by two So for half the vector, but we need it for the entire vector And we are concatenating here.

Now. This is actually different from what we do in the paper because in the paper We need to repeat each argument twice for each successive dimension So for each two dimension, we need the same argument what we are doing here with the concatenation is actually we are taking this one then this one then The theta 3 then theta 4 and then again, we are repeating theta 1 theta 2 theta 3 theta 4 instead of doing theta 1 theta 1 theta 2 theta 2 theta 3 theta 3 so the overall numbers of Numbers that we will produce in the arguments that we produce is the same But instead of being like in the paper theta 1 theta 1 theta 2 theta 2 theta 3 theta 3 theta 4 theta 4 blah blah We are actually doing theta 1 theta 2 theta 3 and then we are repeating them Theta 1 theta 2 theta 3 Why are we doing this?

Now, it's a very long story, but basically it looks like when HuggingFace converted the Weights of the model for example llama from the original pre-trained model into the HuggingFace they permuted the Projection the query and the key projection which is the embedding of the token Each dimension was permuted And then to accommodate for this permuted dimension They are doing again a different computation for the rotary positional encodings So the overall effect that will result from this computation is the same as the original paper but they are doing this double permutation because one permutation was already done when doing the Conversion of the script from the original pre-trained model to the HuggingFace format And this issue is explored in the in the HuggingFace transformer repository by this user who posted why the positional encodings are done differently than the paper and the authors the HuggingFace explained saying that When they converted the weights from the original model to the HuggingFace model They permuted the dimensions of the wq and wk and wq and wk are the projection metrics that are used to compute The query and the keys we apply the rotary positional encodings to the query and the keys.

So we need to recompute do another permutation to counter effect the effect of the first permutation. So that's why the The computation we are doing does not reflect exactly the paper Let's go forward so we have created the argument of the cosine and the sine so now we compute the cosine and the sine Doing with this argument.

So when you calculate call the cosine function on a tensor it will calculate the cosine using the Dimensions of this vector as arguments for the cosine and the same we do it for the sine So the output of this forward method here in the paper is basically this two thing here that we need for Applying the rotary positional encoding to each vector and we have computed the cosine and the sine for each Position that we have in our sequence.

So for each m that we have in our sequence So let me delete this stuff. Otherwise it remains in my notes forever Let's go forward now. We need to implement another method called apply rotary positional embedding Which we include here and which I also copied from HuggingFace What we'll do basically, okay, this will add another dimension, which is the head dimension to these cosines and and sines that we pre-computed Where did we pre-compute them?

Well, we computed them here So as you can see, we extract the cosines and the sines using the rotary positional encoding class that we have created before Using the value states is not used. It's just used to extract the data type of the resulting vector And the position ids that we want to encode So the m parameter of each of the arguments of the cosine and the sine So we compute the cosines and the sine and then we use them to apply the rotary positional encoding to the query and the keys Which will result in the output query and the keys with the rotary positional encoding applied.

So now we are implementing this method here Which will encode the queries While multiplying the dimension of the vector query with the cosines, which is this part of the formula So as you can see the vector multiplied by the cosine And then the rotated vector so with its dimensions changed and the signs changed multiplied by the sign Which is this part of the formula here.

We need to implement this method here rotate half Which is again not equal to what is in the paper because we need to change the we need to permute the dimensions because the original vectors so the q and k are permuted by This query projection and this key projection This rotate half method basically will take the first part of the Embedding and then it will take the second part of the embedding with its sign changed.

I believe here And it will concatenate it it's different than the paper because in the paper we need to create Here we need to create minus x2 then x1 minus x4 x3. But here what we are doing is minus Let me check imagine the token is made up of 1000 dimensions.

So we are doing minus 500 5124 dimensions. This is minus 1 513 minus 514 minus 515 blah blah blah And then we have 0 1 2 3 up to 512 So it's different than this one here But because of the permutation that was done to the wk and wv wq and wk projections So Okay, now we have also implemented the rotary positional encodings which encode the position information Right before the attention so that the attention mechanism will reflect this encoded information inside of each token It matches with the dot product What else do we need to build here?

I believe we have everything. So let me do a very simple Check I think we have everything Guys, I think now we can proceed to the inference code. So we need to use this method So these classes that we have built to actually inference something. Let's do it All right, guys, let's go to the inference code.

So let's create a new file called inference .py I also have prepared the test image that I will be using to Inference the language model. I will ask the language model What is this building and the language model should tell me that what is the name of this building?

You can prepare any any image that you like So I also have this inference.py. So Let's start by writing some code. I will copy a large amount of code Because it's very nothing. No much machine learning here So basically i'm using a library called fire. So let's import stuff first Uh, oops this one Let's import some stuff so i'm importing a pill for the image loading torch fire fire is a library that allows you to Pass the command line arguments to a file to a script as parameters to a function So it will do automatically the parsing of the command line parameters And what I need to pass as the as command line is the model path So what are the weights of the model the prompt that we will be using to inference the model The image that we'll be using as condition for this prompt And the max number of tokens to generate the temperature that we want to apply later We will see what is it the top p The later we will see the do sample if we don't want to use the greedy strategy And if you don't want to use the cuda or the nps in case you are on the macbook So we forced to use the cpu as device for the computation in the neural network The first thing that this method will do is okay, we'll print which device we use Then it will load the model using this Method that we will implement later with the load hugging face model given the path and the device will load the model with the hugging from the hugging face By copying each tensor in the right position, but because we kept the name the same as the hugging face model So we don't need to do any name conversion We copy some we basically take the input and we process it using this polygamma processor which will take his input the tokenizer And the the prompt and the image and it will Transform it is input for our gamma model, which will then decode it And we will do that this in test inference method So for now, we are just creating the polygamma processor and the model itself using this load hugging face model, which we will create later Actually, no, let's do it now.

So let's create a new file called utils And this utils file needs to have the following code So it's importing some stuff and then it's loading the hugging face method. So it's loading the tokenizer, which I said we will be using the Hugging face one. So we will not be coding the tokenizer But the weights of the model we can load them and if you look at the hugging face model If you go to the repository of the model, you will see that each model is a list of Safe tensor files each of these safe tensor files is actually a dictionary that contains the weights of the model So you can actually click on this icon here and it will show you what each of these them contains As you can see this one contains the multi-modal projector weight and bias This one contains the vision tower embeddings, encoder layers one, layer two, layer three, etc, etc for all the layers And for each layer it contains The wq projection, wk projection, wv projection, the weights and the bias The weights, the bias of the layer normalization, the weight of the layer normalization, etc, etc and each file contains a dictionary that contains some part of the Weights of this model So what i'm doing here is I find all the safe tensor files and then I load them each of them into a dictionary And then I use this dictionary to load the state dict of our neural network I also create the model using the config.json file that is present in the repository of the hugging face Model, so every hugging face model has this config.json So we create the configuration that is used to create our model using this configuration file and then I call tie weights which will copy the weights of the Embedding layer to the language modeling head which is the linear layer that projects the embeddings into logits And then we return the model and the tokenizer.

So here there is no machine learning. I'm just loading the The weights of the model from the safe tensor files creating the Model using the configuration saved in config.json And then loading this state dict which means that I am loading the weights into our class This this class here into this model class and then i'm tying the weights and returning the tokenizer and the weights So now we can launch the inference.

So we have the model and the tokenizer. We have created the processor So we have initialized it then we need to launch the inference Let's see how the inference works. So let's go back to here This test inference is also not so hard, but we need to do some Explanation on some parts So what we are doing is first of all, we take this Inputs so the image and the prompt Which is a text and we pass it to the processor and the processor will give us As you can see from the processing polygamma, it will return us the pixel values And the input ids and the attention mask So we get this These values from the processor.

So we need to create this function which is also a simple helper function that allows to Get the output from the processor So we load the image we create the prompt which because the processor expects as input the text As a list and the image as a list even if it only works with one of them with a list of size one it takes the output of the processor which is the input ids the attention mask and the Pixel values of the image then it moves to the right device each of them So move to the right device is also a simple function that moves each tensor to the device specified by this function this parameter device And then returns it so now we have the input ids we have the attention mask we have the pixel values We create a KV cache, which is empty And what we do for based on how many tokens we need to generate.

Oh, I already removed the label So let me remove it Based on how many tokens we want to generate with we launch the inference At the beginning this input ids only includes the prompt So it includes the image tokens and the text tokens without of course any output tokens because we need to generate the output So what we are doing at the first iteration of this for loop is the prefilling So the KV cache is empty the input ids contain the image tokens Placeholders and the text tokens the pixel values contains the image Loaded as a numpy array and then the attention mask, which is just a list of ones because we are never working with padding now the model itself So the polygamma model, which is this here will merge the image features that we are passing So these pixel values it will run them through the image encoder, which will return some image features These image features are replaced with them We replace the image placeholder tokens with the image features extracted from the image encoder.

So now we have a list of embeddings Where the first embeddings are the image embeddings and then the text embeddings And then we send it to the language model for decoding. So let's go back to the inference So the first iteration of this for loop is the prefilling Which means that the query key and values are the same sequence length and they contain the tokens of the prompt The output of the prefilling is a list of embeddings Which we project into logits But we take only the last logit to predict the next token So that's why we take out the logits and we take only the last logit here So this is the sequence dimension and we take the last item in this sequence dimension Now, what do we do with this logits?

So now let's go to the iPad actually because I want to explain how top p works So let's go. Let me check if this is working. Yeah still working So now we can do this one Oops This one. Okay Let's open a new page. So when you generate logits basically, it corresponds to a kind of a distribution after you apply the softmax So the logit is a vector.

So let me draw here is a vector Where the number of dimensions is equal to the vocabulary size So you have one number for each Token in the vocabulary and it indicates it's an indication by the model on what the model thinks should be the next token What we do is we can do to understand.

What is the next token? We need to apply the softmax which will convert each of these numbers. So each of these numbers into a Probability score so something that sums up to one and it's always non-negative And we could take for example the highest one to predict what is the net to understand what is the next token another way is to use the Sampling method.

So this is a list of numbers, right one for each position in the vocabulary So for example for the token hello, the model could say some score the token pizza It should give another score for the token I don't know car it will give another score, etc, etc We can also Do sampling which means that we sort all of these numbers that we get So all of these numbers that we get we sort them in decreasing order And then we take the the top ones Such that sum up to a probability Score so with top p what we are doing with the top p of 0.9 Suppose that to the token.

Hello, we have assigned the model has assigned a probability Let's say of 0.2. This one is a 0.5 and this one is 0.1 Then we have some other token. Let's say 0.05 and then another token. That is 0.1 Again, I don't know if this sum up to one but okay and then some other token and some other token We sort them in decreasing order which means that we sort them like this.

So we take hello zero, oh no, the first one should be pizza Pizza 0.5 then we have a hello 0.2 And then we have a car 0.1 then we have something else that 0.1. Then we have something else that is 0.05, etc, etc With the top p, let's say of 0.0, not 0.9.

It's a little bit more 0.0, not 0.9. It's a little too much. Let's say top p Of 0.7 We will basically sample from this distribution by only considering The token such that their cumulative score is at least this one So we will take basically all the tokens that when they sum up We sum them up with their probability score.

They sum up to this amount and then we sample from them sample kind of a weighted sample in which we Take into consider for example with the 0.7. We will consider only these two tokens and then we Sample from we then rearrange these numbers such that again, they sum up to one So suppose that after applying again the softmax this sum up to this will be changed So this will become let's say 0.75 and this will become 0.25 Then we sample again from this distribution So basically what will happen is that 75% of the time we will choose this token and 25% of the time We will choose this token.

This is the meaning of top p. So among all the tokens we talk we sort them we take only the one that With who that with the cumulative probability score that reaches this top p And then we some sample from them just like they are a distribution by themselves Before sampling them because they need to be a distribution so we need to apply the softmax again So this is what we do with the top p instead what we do with greedy is that we just take the highest one And that's it.

But with the top p we are actually sampling from this distribution But we are not considering everything To sample because some of them are basically the model is saying don't use this token because the probability score assigned to it It's very very slow. So why should we even consider it?

So that's why we use top p We only consider the most likely one chosen by the model So we don't introduce any noise in the generation process What else I think nothing so let's go So what we are doing here we are sampling with the top p if we decided to sample Otherwise, we just take the one with the highest probability score, which is the greedy strategy if we don't want to sample There is also this thing called temperature.

So what is temperature temperature basically means that we want to divide The as you can see here we divide the logits before applying the softmax So that the Basically what happens is that before we apply the softmax these numbers are not Probability score so they not sum up to one So for example, this may be 10.

This may be 7. This may be 5. This may be 2. This may be 1 This may be 0.1, etc, etc When we apply the softmax, we are basically sorry when we are applying the temperature we are Making the difference between them a little smaller So we basically if the model is giving us the following distribution So it's telling us that this token is likely but this is very Much more likely and this is less likely and this is less likely etc, etc What we are trying to do with the temperature is basically we are reducing the gap between these peaks So that the when we do the sampling here We are more likely to choose more diverse tokens Because then with the temperature what will happen is that the hello instead of being chosen 25% of the time It will be chosen.

Let's say 33% of the time and this will become 0.66. So basically we are introducing some noise in the choice that we do But only restricted to the top 0.70% tokens chosen by the model I know it's a little difficult to visualize but basically with the temperature We are trying to make it more likely to choose more diverse tokens because we are reducing the gaps between the probability scores of the tokens And then we do sampling with top p which I will Put later is a simple method that Does what we saw before so we sort by descending order and then we sample from the distribution So actually let's do it.

I think it's let's do it one by one So sample top p Sample top p we can put it here. So as you can see we are sorting in descending order We are calculating the cumulative sum. We are only taking the one that have the cumulative sum equal to the p parameter We do it here so we mask out all the others and then we Normalize again so that they sum up to one because we have removed some tokens from this distribution And then we sample from this distribution using this multinomial and then we take the token chosen by this sampling operation So we have applied the top p so now we know what is the next token we Take this token and we add it to this generated tokens array If the next token corresponds to the stop token, which is the end of sentence token, then we stop the generation Otherwise we keep generating And then we take these input IDs as you can see then as for the next iteration Because we are using the KVCache At at each inference step we use as query only the last predicted token So this is what we are doing here.

So at the second iteration of this for loop Our input IDs will only become one single token And so the first iteration we are doing the prefilling. So the input IDs is all the tokens of the prompt So the image tokens and the text tokens of what we want to do with this image At the second iteration these input IDs will only be one token So how can the model will work with only one token because the model always has access to all the previous Keys and values because they are have been saved in the KVCache.

So when we calculate the attention the model will add this Single token to the KVCache retrieve whatever is inside the KVCache and use it to calculate the attention This way we generate tokens We keep increasing the attention mask by adding one because we want to attend to all the past token in the KVCache Because we don't have any padding Usually you are You are used to think of the padding as something that is present on the right But actually padding can also be done on the left So because on the left, we don't have any padding token So the attention mask is always made up of ones and also in my implementation.

I am not never working with the paddings We generate these tokens we concatenate them together because we save them into an array So we need to generate a tensor which is then sent to the tokenizer for decoding and then we print print the output of the model And now we can finally run the generation.

So the inference so I will copy the script that I have already prepared And this I have already saved the weights of the model So if you want to run this code, you need to download the repository of this model clone it locally and then you use it as a you send the you set the Path to where you save it you give the prompt that my prompt is this building is and the model should tell me What is this building and the image file is this building here.

It's a building in Xi'an, China And then we use this temperature the top p and we do not sample I want the greedy strategy and I also want to use CUDA. We run the script like this. So now let's run it I hope there are no problems I think yeah should be no problem.

So launch inference. Let's see All right guys, so after I have launched the inference actually my computer went a little crazy So I had to switch back to using the cpu And then it worked because I don't know why my CUDA sometimes doesn't work and it blocks all my computer So if you run the inference using the code that we have made it should give this output So this building is the oldest clock tower in the world Which is actually I don't know if it's the oldest tower in the world, but actually this is called the jungle So it's a clock tower in Xi'an.

So it's a very famous building and looks like the output is correct So thank you guys for watching this video. I know it has been a very very long journey And I had to do a lot of explanations. I had to kind of improvise sometimes to do this explanation So there it is possible that may there may be some Imprecisions in my way of explaining because I don't have a transcript that i'm reading For all of the things that I have talked about So sometimes, you know I just look at the code to try to come up with the right words to how to explain it And of course you cannot find always the right words immediately Maybe you need to watch it at least for one minute to get the right words but Hopefully at least 90% of the content is super Correct and the other 10% maybe will have some noises So I will try to clarify the things that I have not been explained correctly in the comments or in the description of the video Thank you guys for watching this video.

So please share it with your friends and Like it if you like it and subscribe to my channel A lot of people have asked me. What is the best way to contribute economically? To me to support me, but I believe I I thankful thank god. I don't need any economic support for now If I would ever need it, I would be the first one to ask So if you want to help someone economically, there are many people in the world that you can help So there are people in war areas in palestine in ukraine.

You can help them economically But for me, I just need you guys to follow me and to share my video. This is the best way to help me out Also, I work at a company known as writer and my team is currently hiring We are looking for amazing researchers and you can find more information in the description of the video We train our own models.

We have plenty of gpus So if you are a researcher in dealing with the language models, but any area of machine learning you are feel free to Send your resume. So thank you guys and have a nice day

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Chapters

Transcript