back to indexCoding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation
Chapters
0:0 Introduction
5:52 Contrastive Learning and CLIP
16:50 Numerical stability of the Softmax
23:0 SigLip
26:30 Why a Contrastive Vision Encoder?
29:13 Vision Transformer
35:38 Coding SigLip
54:25 Batch Normalization, Layer Normalization
65:28 Coding SigLip (Encoder)
76:12 Coding SigLip (FFN)
80:45 Multi-Head Attention (Coding + Explanation)
135:40 Coding SigLip
138:30 PaliGemma Architecture review
141:19 PaliGemma input processor
160:56 Coding Gemma
163:44 Weight tying
166:20 Coding Gemma
188:54 KV-Cache (Explanation)
213:35 Coding Gemma
232:5 Image features projection
233:17 Coding Gemma
242:45 RMS Normalization
249:50 Gemma Decoder Layer
252:44 Gemma FFN (MLP)
256:2 Multi-Head Attention (Coding)
258:30 Grouped Query Attention
278:35 Multi-Head Attention (Coding)
283:26 KV-Cache (Coding)
287:44 Multi-Head Attention (Coding)
296:0 Rotary Positional Embedding
323:40 Inference code
332:50 Top-P Sampling
340:40 Inference code
343:40 Conclusion
00:00:00.000 |
Hello guys, welcome back to my channel today. We are going to code a visual language model from scratch 00:00:04.720 |
Now, what do I mean by first of all by visual language model? 00:00:08.000 |
And what do I mean for by coding from scratch? 00:00:10.240 |
The visual language model that we will be coding is called the polygamma and it's a language model visual language model that came out 00:00:18.960 |
About the weights, but the paper came out around two weeks ago 00:00:22.960 |
So we will be coding it from scratch meaning that we will be coding from scratch the vision encoder 00:00:29.360 |
You can see this here. Okay the linear projection, which is just a linear 00:00:35.680 |
So which is the transformer language model how to combine the embeddings of the image tokens with the text tokens 00:00:42.160 |
And of course how to generate the output using the condition. So what is the language visual language model? 00:00:48.160 |
First of all, well visual language model is a language model that can extract information from an image 00:00:52.960 |
So if we have an image like this, for example and a prompt like this, for example, where is the photographer resting? 00:00:59.120 |
The visual language model can understand where this photographer is resting by looking at the image 00:01:04.640 |
And generating a response in this case. The response is in a hammock under a tree on a tropical beach 00:01:10.080 |
The topics of today basically are first of all, we will be talking about the vision transformer 00:01:15.760 |
Which is the vision encoder that we'll be using to extract information from this image 00:01:19.760 |
But this vision transformer has been trained in a particular way called contrastive learning 00:01:25.280 |
So we will be talking about a lot about contrastive learning because I want to review not only what is contrastive learning 00:01:32.400 |
So the first well-known model is CLIP and then it was transformed into CGLIP by google 00:01:40.480 |
Then we will be coding the language model itself 00:01:43.040 |
So the gamma language model how to combine the embeddings of the vision model and the language model 00:01:52.640 |
And we will be talking about the KVCache because we want to 00:01:58.320 |
So we want to do it in an optimized way and the best way of course is to use the KVCache 00:02:05.200 |
Not only we will be coding it. I will explain step by step how it works 00:02:08.800 |
The rotary positional encodings because we need them for the language model and the normalization layers because we have them in the vision model 00:02:15.760 |
And also the language model. We will be seeing what is the batch normalization, the layer normalization and the rms normalization 00:02:21.520 |
I will be explaining all the math behind them 00:02:23.520 |
In this video i'm also using a slightly different approach at teaching let's say 00:02:28.640 |
Which is by drawing so I will be drawing every single tensor operations that we'll be doing especially in the attention 00:02:34.800 |
Mechanism because I want people to not only look at the code and hope they get something 00:02:41.920 |
But actually I want to show each single tensor how it's changing by drawing it from scratch 00:02:48.640 |
I think this helps better visualize what happens in the transformer model, especially during the attention mechanism 00:02:54.340 |
So we know what each view operation each reshape operation that we are doing to each tensor and also the matrix 00:03:01.360 |
Multiplications that we are doing so we can visualize what happens to the tensors itself 00:03:05.680 |
What are the prerequisites for watching this video? 00:03:09.120 |
Well, you have a basic knowledge about the transformer. You don't have to be a master about it 00:03:14.880 |
It's better if you have watched my previous video on it 00:03:16.960 |
Which will give you the background knowledge to understand this video and you have a basic knowledge of neural networks 00:03:22.320 |
So at least you know, what is a loss function, you know, what is a linear layer? 00:03:25.440 |
And at least you know, what is backpropagation you don't need to know how it works or the mathematics behind it 00:03:32.560 |
But at least you know that we train models using backpropagation 00:03:35.460 |
Having said that guys, let's jump to work. So the first part I will be explaining is the visual transformer 00:03:44.160 |
So this visual encoder we will be seeing what is the contrastive about it 00:03:47.840 |
and we will be coding it and then we will move on to how to combine the 00:03:52.960 |
Embeddings of the image tokens and the text tokens. The only part that we will not be coding is the tokenizer 00:03:59.700 |
Because I believe it's a separate topic that deserves its own video. So hopefully I will make another video about it 00:04:08.160 |
All right guys before we go deep into each of these topics 00:04:14.800 |
Speech actually, so we will be exploring a lot of topics like a lot of topics 00:04:20.800 |
We will be reviewing for example each of the single 00:04:23.600 |
Operations that we do in the attention mechanism and we will be looking at it from the code point of view 00:04:28.880 |
But also from the concept point of view and from the tensor operations point of view 00:04:34.640 |
There may be some topics that you are already familiar with and that's perfectly fine 00:04:39.120 |
There are some others that you are not familiar with and that's also perfectly fine because I will be explaining each topic multiple times 00:04:48.320 |
Implementing the attention mechanism at least twice 00:04:50.960 |
So if you don't understand it the first time along with the code, then you will have another time to 00:04:56.080 |
Understand it and with a different explanation 00:04:59.520 |
And the same more or less goes goes on with all the other topics. For example, we will be first introducing the 00:05:04.880 |
Normalization in one part and then I will review again the normalization 00:05:09.140 |
The positional encoding done in one way and then we will see another type of positional encoding 00:05:13.760 |
So don't worry if you don't understand everything at the beginning because I will be reviewing anyway each topic multiple times 00:05:23.600 |
So if there is some topic that I couldn't explain because of lack of time 00:05:27.200 |
For example, I will not be explaining how convolutions work because there are plenty of videos on how convolutions work 00:05:32.480 |
So if you can pause the video watch five minute video on how a convolution work and then come back to this video 00:05:40.560 |
The second thing is always write down all the code that I am I will be showing you so write it 00:05:46.400 |
Line by line character by character because that's the best way to learn. So now let's get started 00:05:52.880 |
Let's start with the first part. So the first part we will be talking about is this contrastive vision encoder 00:05:58.400 |
Which is something that takes any as input an image and converts it into an embedding 00:06:03.700 |
Actually a series of embedding. We will see one for each 00:06:07.360 |
Block of pixels of this image. So basically our image will be 00:06:12.320 |
Split into blocks of pixels like this into a grid and each of this grid will be converted into an embedding you can see here 00:06:29.040 |
Tokens embeddings because as you know, each token is converted into what is known as an embedding 00:06:35.040 |
Which is a vector of a fixed size. They will be concatenated and sent to the transformer which will basically attend to this 00:06:41.520 |
Image tokens as a condition to generate the text. So this is called conditional generation 00:06:48.800 |
But okay, we will explore all this stuff here 00:06:51.760 |
Let's talk about this vision encoder now the vision encoder 00:06:55.200 |
First we need to understand what is why it's called a contrastive vision encoder and to understand why it's contrastive 00:07:02.160 |
We need to understand what is contrastive learning 00:07:04.240 |
So let's go back to another slide, which is this one 00:07:13.600 |
Imagine for now, we will consider the image encoder as a black box and later 00:07:17.840 |
We will transform this black box into something more concrete 00:07:23.600 |
You go to the internet and when you go on wikipedia 00:07:26.260 |
You see an image and when you see an image there is always a description of what is inside that image 00:07:31.680 |
If you use a crawler you can crawl all of these images with the corresponding descriptions 00:07:37.460 |
That in this will produce a data set of images along with the descriptions 00:07:42.560 |
Now imagine that for some now for now imagine we have a text encoder that is most usually is a transformer model 00:07:50.400 |
And then we have an image encoder which most of the cases it's a vision transformer 00:07:58.560 |
So it's something that takes as input an image and produces 00:08:01.940 |
Here an image and produces an embedding representation of this image 00:08:07.040 |
And if you feed a list of images, it produces a list of embeddings one corresponding to each image. What is this embedding? 00:08:13.920 |
It's a vector that captures most of the information of this image 00:08:17.600 |
And we do the same with this text encoder. So the text encoder is a transformer model that produces a series of embeddings. We will 00:08:27.120 |
But imagine you have this text encoder that given a text produces a single embedding of a single text 00:08:33.040 |
But if you feed it a list of text it will produce a series of embeddings each corresponding to one single text 00:08:42.240 |
The data set that we were talking about before which is the data set of images along with the corresponding descriptions 00:08:48.420 |
So imagine we feed this data set of images along with the corresponding description to the image encoder and respectively to the text encoder 00:08:57.520 |
It will produce a list of image embeddings and a list of text embeddings 00:09:02.580 |
Now, what do we want these embeddings to be? Of course, we want the embedding 00:09:08.980 |
Of the first image to be representative of that image 00:09:12.740 |
So we want this embedding to capture most of the information of that image 00:09:16.500 |
and of course, we want the embedding of the text number one to be 00:09:20.180 |
A vector that captures most of the information about that text 00:09:26.560 |
Moreover with contrastive learning we don't want only to capture information about the image or the text 00:09:33.200 |
But we also want some properties and the property that we want from these embeddings is this 00:09:45.520 |
Embedding of the corresponding text it should give a high value for this dot product 00:09:51.840 |
And when you do the dot product of an image with a text that is not the corresponding one 00:09:56.880 |
It should produce a low number for this dot product 00:09:59.520 |
So basically with contrastive learning what we do we take a list of images 00:10:04.320 |
We take a list of text which is the corresponding text one for each of these images 00:10:08.880 |
So imagine that the image number one correspond to the text number one the image number two correspond to the text number two, etc 00:10:16.400 |
We encode them into a list of embeddings and then we want to train 00:10:20.800 |
This model so this text encoder and this image encoder to produce embeddings in such a way 00:10:26.880 |
That when the dot product of the image with its corresponding text is done 00:10:31.600 |
It should produce a high value and when you do the dot product of an image with a not corresponding text 00:10:36.960 |
For example i2 with text3 it should produce a low value 00:10:42.640 |
What we can do is basically we take this text embeddings, which is a list of embeddings 00:10:47.520 |
We take this image embeddings, which is a list of vectors 00:10:50.660 |
We do all the possible combinations of dot products 00:10:53.680 |
So the image number one did with the text number one image number one with the text number two image number one with the text 00:11:00.480 |
Then we do the all the also for the text number one 00:11:03.520 |
So the text number one with the image number one text number one with the image number two text number one with the image 00:11:10.240 |
And then we want to find a loss function that forces 00:11:13.520 |
These dot products to be high so that each text with its corresponding image to be high 00:11:18.880 |
While all the other possible combinations to be low in value 00:11:22.560 |
And we do that basically by using what is known as a cross entropy loss. So 00:11:29.120 |
To understand why we use cross entropy loss. We need to explore how language models are trained and we will do that very briefly 00:11:38.160 |
To not get us confused. So when we train language model, we do the we do so using what is known as the next token prediction task 00:11:45.680 |
Imagine we want to train a language model on the following sentence. So I 00:11:58.480 |
How do we train such a language model? Well, we give a prompt to this language model for now 00:12:15.760 |
The language model will produce a series of embeddings 00:12:18.580 |
Which are then converted into logits. So what is the logits? The logits is a distribution. It's a vector 00:12:27.200 |
What is the score that the language model has assigned to what the next token should be? 00:12:32.560 |
Among all the tokens in the vocabulary. So for example, imagine this first number here corresponds to the token. Hello 00:12:43.120 |
The second number here corresponds to the token. Let's say pizza 00:12:46.640 |
The third corresponds to the token car the fourth 00:12:54.800 |
Which one we want to be the next token? Of course, we know that the next token is a pizza 00:12:59.680 |
So we want the token number pizza to be high and all the other tokens to be low in value 00:13:04.480 |
So we use the cross entropy loss basically to make sure that the next token is pizza. So how do we do that? Basically we 00:13:13.040 |
Language model will output a list of numbers and we force the language model 00:13:17.200 |
To produce the following output. So pizza should be one and all the others should be zero 00:13:28.880 |
So basically the cross entropy loss what it does it takes a vector it converts it into a distribution 00:13:34.900 |
With the softmax function and then we compare it with a label and we force the output to be equal to the label 00:13:45.200 |
To generate a distribution the next time after the training in such a way that the pizza is given a high number and all the others 00:13:52.320 |
Are given a low number and this is exactly the same that we do here for contrastive learning 00:13:59.680 |
To force for example in this column here only this number to have a high value and all the others to have a low value 00:14:08.480 |
Only this number to have a high value and all the other number in this 00:14:11.920 |
Row to have a low value and for example for this row 00:14:14.560 |
We want the second item to have a high value and all the others to have a low value, etc, etc 00:14:22.480 |
Now here is the code that the pseudo code that they show in the 00:14:27.520 |
Clip paper on how to implement the clip training with contrastive loss 00:14:31.840 |
So basically we have a list of images and a list of text 00:14:35.360 |
We encode them and they will become a list of vectors called image vectors and text vectors here 00:14:44.720 |
We normalize them later. We will see why we normalize stuff 00:14:49.040 |
But okay, it's make sure that we reduce the internal covariance shift, but for now ignore it 00:14:53.680 |
Anyway, we normalize them later. We will talk about normalization 00:14:56.900 |
We calculate all the possible dot products between these embeddings 00:15:01.520 |
So the text embeddings and the image embeddings, so we basically generate this grid here 00:15:08.720 |
We generate the labels the labels are what well for the first row 00:15:13.280 |
We want the label the first item to be maximum for the second row the second item for the third row the third item 00:15:22.800 |
This is basically the the function arrange generates a number between zero and in this case n minus one 00:15:29.680 |
So for the row number zero, we want the item number zero to be maximum for the row number one 00:15:35.600 |
We want the item number one, etc, etc until the row number n minus one 00:15:38.880 |
We want the n minus one item to be the maximum one 00:15:42.480 |
Then we calculate the cross entropy loss between what is the output of the model 00:15:45.920 |
So what are the numbers assigned by the model to each of these dot products and what we want? 00:15:50.560 |
The maximum to be among these numbers. This is the labels 00:15:54.240 |
And we do it by rows and by columns this one you can see here 00:16:03.200 |
Losses and we compute the average so we compute the average loss between all the rows and all the columns 00:16:10.480 |
And this is how we do contrastive learning. Now, let's explore. What is the problem with CLIP? 00:16:20.560 |
Well, the problem with CLIP is very simple is that we are using the cross entropy loss 00:16:25.760 |
And the cross entropy loss basically needs to have a compare does the comparison between two distributions 00:16:32.160 |
So in language model we compare the output logits which are transformed into distribution 00:16:38.080 |
With the label so which item of this distribution we want to be the maximum one and we do the same here 00:16:45.600 |
We convert it into a distribution and we do it through a function called the softmax function 00:16:50.960 |
So the softmax function basically it is a function that takes as input a vector and converts it into a distribution 00:16:57.860 |
What does it mean? It means that when you have a vector like this, for example, it will be a list of numbers 00:17:04.960 |
To be a distribution each of these numbers needs to be non-negative. So it needs to be 00:17:09.760 |
Greater than or equal to zero and plus all of these numbers needs to sum up to one 00:17:18.320 |
The model will predict some numbers and it cannot force all the sum of these numbers to be one and it cannot force the numbers 00:17:27.440 |
So we apply to the output of the model this function called the softmax 00:17:31.440 |
Which transforms them into a distribution and then we can compare it with the labels 00:17:35.040 |
So our label in the case for example for the first 00:17:40.160 |
So we want the first item to be zero the second item to be one and this one to be zero this one to be zero 00:17:45.200 |
This one to be zero this one to be zero, but we need to apply the softmax to the output of the model 00:17:57.920 |
this is the expression of the softmax basically to we take the output of the model and we 00:18:03.120 |
exponentiate each item in the output vector, which could be a row or a column 00:18:08.240 |
And after exponentiating we also divide them with the sum of all the other items 00:18:17.760 |
So which means that we need to calculate first of all for each row the exponential of the item 00:18:23.840 |
And then we need to divide by the sum of all the exponentials of all the other items including itself 00:18:28.800 |
The the problem is that we are using this exponential. The exponential is basically a function that grows very fast 00:18:41.680 |
And this is a problem for computers because in computers we store numbers using a fixed representation 00:18:48.480 |
Which could be 16 bit or 32 bit which means that we cannot represent up to infinity 00:18:53.520 |
But we can represent each number up to 2 to the power of n minus 1 basically if you don't have negative numbers 00:18:59.520 |
So if the exponential is too big then our numbers will grow too much and it may not be represented by 32 bit 00:19:07.440 |
And that's a problem. So we need to make this softmax function numerically stable 00:19:13.520 |
So whenever you heard the term numerical stability in terms of computer science 00:19:17.360 |
It means that we want to make sure that the number can be represented within 32 bits or 16 bits or whatever 00:19:28.640 |
Well, the trick is this. The softmax is uh, each item is exponentiated 00:19:41.680 |
This denominator which is known as the normalization constant, which is the sum of all the 00:19:47.360 |
Exponentials of all the other items in the vector 00:19:52.320 |
So in a fraction you can multiply the numerator and the denominator by the same number without changing the fraction 00:19:59.840 |
Each number can be written as the exponentials of the logarithm of the number 00:20:06.160 |
And this is because the exponential and the log are inverse functions 00:20:10.400 |
So we can write c as follows. So the exponential of the log of c 00:20:14.480 |
By using the properties of the exponential which means that the exponential of the product 00:20:21.280 |
The product of two exponential is equal to the exponential of the sum of the arguments 00:20:28.400 |
And then we can bring this exponential inside the summation because of the distributive property of the product with respect to the sum 00:20:37.920 |
Rule we applied above which is the exponential of the product is equal to the exponential of the sum of the arguments 00:20:43.620 |
Now what we notice is that if we subtract something from this exponential 00:20:52.480 |
We can make the argument of the exponential smaller which may make it numerically stable 00:20:58.320 |
So what we choose as this log of c, basically we choose the 00:21:02.700 |
Negative maximum number in the array that we are normalizing using the softmax 00:21:07.440 |
This way basically the argument of the exponential will decrease and it will be less likely that this exponential will 00:21:22.460 |
Now this basically means that to calculate the cross entropy loss for each of these 00:21:32.940 |
First of all the model needs to output a list of 00:21:36.460 |
Text embeddings and a list of image embeddings as you can see then we do all the possible dot products 00:21:45.260 |
We need to find the maximum value in this column so that we can subtract it before calculating the softmax 00:21:51.120 |
Then we need to apply the exponential to each of these items 00:21:54.780 |
then we sum up all of this exponential to calculate the 00:21:59.160 |
Normalization constant then we divide each of these numbers by this normalization constant 00:22:03.800 |
so as you can see to apply the cross entropy loss involves a lot of computations and 00:22:09.960 |
Also, it forces you to always have imagine you want to parallelize this operation 00:22:21.640 |
So this device here needs to have all the row in its memory because it needs to calculate this normalization constant 00:22:27.960 |
So it has needs to have access to all of this row and if you want to do parallelize by column 00:22:35.800 |
In your memory because you need to calculate the first of all the maximum item then you need to calculate this normalization constant 00:22:41.960 |
Then you need to normalize them so dividing by this normalization constant 00:22:47.400 |
But also it makes it difficult to parallelize because at any moment each device needs to have at least one full row or one full 00:22:53.960 |
Column, which does not allow us to go to very big batch size 00:22:57.880 |
And this is a problem. So if you look at the cglib paper, they say that note that 00:23:04.360 |
Due to the asymmetry of the softmax loss the normalization is also independently performs two times 00:23:10.600 |
So first of all to make the softmax numerically stable, we need to go through each single vector calculate the maximum 00:23:19.160 |
but then we also need to calculate the softmax by rows and then by columns why because this 00:23:25.800 |
Matrix here is not symmetric. So as you can see 00:23:28.920 |
This is image number one with all the text and this is 00:23:32.840 |
Text number one with all the images and this item here is not equal to this item here 00:23:37.480 |
Because this is image number one with the text number two, and this is image number two with the text number one 00:23:43.640 |
Because it's not symmetric means that you need to calculate the softmax for each single rows 00:23:48.040 |
And then you need to calculate it for each single column and then you can calculate the loss 00:23:52.840 |
So the problem with the clip is that it's very computationally expensive to calculate this loss this contrastive loss 00:24:00.680 |
that's why in the cglib paper they propose to replace the 00:24:13.160 |
Again, we have an image encoder that converts a list of images into a list of embeddings one for image image 00:24:19.880 |
Then we have list of text which convert each text into a list of embedding one for each text 00:24:29.320 |
We calculate this all the possible dot products 00:24:31.880 |
So the image number one with the text number one image number two with text number two and also image number one with text 00:24:37.160 |
Number two text number three text four text five blah blah. So all the possible dot products between all these embeddings 00:24:43.100 |
then instead of treating the loss as a distribution over a row or a 00:24:55.160 |
I want this item to be maximum or in this row. I want this item to be maximum 00:25:06.040 |
We use it as a binary classification task using the sigmoid loss 00:25:09.720 |
In which each of these dot products is treated independently from each other 00:25:15.400 |
So this is considered a single binary classification task in which we say okay this item here should be one 00:25:21.880 |
This item here should be zero. This item here should be zero. This item here should be zero independently of what are the other items 00:25:29.400 |
This one here should be zero. This one should be here zero, etc, etc, and we can do that with the sigmoid function 00:25:35.480 |
So as you can see, this is the function the signature expression of the sigmoid function 00:25:38.920 |
It takes as input this value called z which will be the dot product of our vectors 00:25:45.000 |
And the output of the sigmoid is this stuff here, which is a number between zero and one 00:25:51.160 |
So what we can do is we take each of these dot products. We run it through a sigmoid 00:25:55.900 |
And then we force the label to be one for corresponding 00:26:01.240 |
Text and images and zero for not corresponding ones. So each of these dot products now becomes a 00:26:13.240 |
Grow the batch size to millions of items and also to parallelize because we can put this block here into one device 00:26:20.600 |
And it can calculate it independently from this other device because they do not need to calculate any normalization 00:26:27.400 |
Constant for each item or the maximum item in each row or column because each of them is independent from the others 00:26:34.360 |
Now you may be wondering why are we even using a contrastive 00:26:41.960 |
Why cannot we just use an ordinary vision encoder that just takes an image and instructs some kind of embeddings that capture the information? 00:26:54.280 |
We want these embeddings to not only capture a information about the image, but we want these embeddings to be 00:27:01.720 |
Good representation that can be then contrasted or can be used along with text embeddings 00:27:09.100 |
And this is exactly what we do in a vision language model. We extract some 00:27:16.020 |
Embeddings which are vectors representing we will see later a patch of the image 00:27:21.480 |
So this you need to think of this image as being divided into a grid and this first 00:27:28.920 |
So we produce in this case, for example, nine embeddings which are nine vectors 00:27:33.800 |
Each of them represents information about a patch of the image 00:27:42.200 |
Representing the information of these patches, but also to be able to be contrasted with the text 00:27:48.520 |
Which is what we do in a visual language model 00:27:50.360 |
So we have some prompt and we kind of contrast it with the image embeddings to produce an output 00:27:57.560 |
It is not really a contrastive learning in this case because we are using it as a condition 00:28:02.600 |
We will see later how these things are merged 00:28:04.920 |
But we want a visual language a vision encoder that is already trained to be used with the text because it has a better 00:28:11.880 |
Representation for the image for being used along with the text. That's why we use the contrasting vision encoder 00:28:18.360 |
also, we use them because they are cheaper to train so 00:28:21.880 |
You can basically to train a contrasting vision encoder 00:28:26.600 |
You just need to crawl billions of images from the internet 00:28:30.360 |
Each of them already has a kind of a description because you can for example in wikipedia 00:28:35.480 |
You always have the description of each image, but also the internet when you have an image you always have the html alt text 00:28:44.040 |
Which is the alternative text that is displayed when the image is not shown 00:28:47.320 |
So you always have access to some kind of description 00:28:49.980 |
Now, of course this vision encoder may be noisy because they we crawl stuff from the internet 00:28:55.400 |
Which means that this stuff may not always be correct 00:28:58.280 |
So sometimes you see a picture but the description displayed is not correct or maybe the crawler didn't get the correct information 00:29:04.920 |
But because we train it on billions and billions and billions of images eventually it learns a good representation of this image 00:29:13.880 |
So this vision encoder that we will be using is basically a vision transformer. So now let's talk about the vision transformer 00:29:24.600 |
So the vision transformer is a transformer basically that was introduced in this paper and image is worth 16 by 16 words 00:29:32.680 |
In which basically they train a transformer as follows. So first of all, what do we 00:29:45.640 |
Attention mechanism, but for now, I just need you to remember that the transformer model is a sequence to sequence model 00:29:52.520 |
which means that you feed it a sequence of embeddings and it outputs a sequence of 00:30:00.180 |
What we do to encode an image with the vision transformer we take an image and we 00:30:07.240 |
Split it into patches and in this case, for example, we can split into 16 patches 00:30:13.000 |
So this is the first group of pixels. This is the second group of pixels 00:30:17.160 |
This is the group of pixels on the bottom right of the image. This one is on the top right top right, etc, etc 00:30:26.280 |
Information about this patch using a convolution 00:30:29.020 |
So when you run a convolution you can extract information about a group of pixels from the image 00:30:36.120 |
And then for example, this one will produce this output 00:30:39.640 |
This one the convolution of this patch will produce this output. The convolution of this patch will produce this output, etc, etc 00:30:46.520 |
And then we flatten them. So we lose the positional information 00:30:50.300 |
We just take we don't care if this four is the top right or the bottom left 00:31:00.200 |
We do we lose the two dimensionality in this case basically so we transform into a sequence of 00:31:08.760 |
Then we add this position information so we say that okay, this is the patch number one 00:31:16.680 |
This patch basically the embedding of this patch that will be the result of this convolution will be a vector 00:31:22.600 |
We add to this vector another vector that tells the model 00:31:27.800 |
Hey, this is the patch number one and this is the patch number two, and this is the patch number three, etc, etc 00:31:32.920 |
So we do that by adding so this plus operation you can see here 00:31:38.920 |
Vanilla transformer or the transformer model that we see for language models 00:31:42.600 |
These positional encodings are not calculated using sinusoidal functions, but they are learned 00:31:48.040 |
So they are vectors that get added always so the positional encoding number one always gets added to the top left 00:31:55.720 |
Patch the positional number two always gets added to the second patch from the top left, etc, etc 00:32:02.040 |
The positional encoding number 16 gets added always to the bottom right patch 00:32:09.560 |
Has kind of access to this to the 2d representation of the image 00:32:13.800 |
So the model will learn basically that the patch number 16 is always on the top right and this is always on the top left 00:32:22.200 |
So this is a series of embeddings because the sum of two embeddings is a series of embedding 00:32:30.760 |
Let's consider it as a black box and later when we code it, we will explore each layer of this transformer 00:32:35.500 |
The transformer what it does it does the contextualization of these embeddings 00:32:40.680 |
So at input we have this each series of embeddings each of them representing one single patch 00:32:47.640 |
The output of the transformer through the attention mechanism will be a series of embeddings again 00:32:52.920 |
But each of these embeddings is not only capturing information about itself, but also about other patches 00:33:02.440 |
We use in the attention mechanism. We use what is known as the causal mask. So this first 00:33:08.280 |
Embedding should be only capturing information only about itself the second one only 00:33:17.240 |
About itself and the two previous one the fourth one about itself and the three previous one, etc 00:33:23.000 |
This is what we do with the language models with visual language models in the with the trust 00:33:27.880 |
Sorry, not with visual language, but with the vision transformers 00:33:35.720 |
The model being autoregressive we say so we don't want these patches to only encode information about the previous patches because in the in an image 00:33:43.240 |
There is no autoregressiveness. So it's not like the patch number 16 of an image 00:33:48.920 |
It depends only on the previous patches and the patch number one does not depend on any others 00:33:53.960 |
Because imagine you have an image in which the sun is here or the light source is here 00:34:00.360 |
then this part here will be light will be illuminated, but 00:34:05.320 |
So the illumination here depends on what is coming after in the image 00:34:10.680 |
So in the image, we don't have this autoregressive 00:34:15.400 |
Why in the text without we do because we we write the text from left to right or from right to left 00:34:21.080 |
But anyway, each word that we write depends on what we have written previously 00:34:25.400 |
But this doesn't happen with image. So basically this contextualized embeddings 00:34:30.460 |
They capture information about themselves, but also all the other embeddings 00:34:37.800 |
We use this contextualized embedding to capture information about each patch 00:34:43.080 |
But also how it is present in the image. That's why we want them to contextualize 00:34:47.740 |
So we want each patch to include information about its position, which is given by the positional encoding 00:34:58.600 |
By contextualizing them. So when we code it, this will be more clear for now. I just want you to get a 00:35:05.400 |
Idea of what we are going to code. So we are going to code a model that will take an image will apply a convolution 00:35:13.020 |
To extract a series of embeddings. You can see here. We will add a positional encoding to these ones 00:35:19.560 |
Which are learned we will apply the attention mechanism 00:35:23.480 |
Which is will be a series of layer actually of the transferable model that will contextualize these embeddings 00:35:29.080 |
And then we will use this contextualized embedding as input to the language model for decoding the output of the language model 00:35:40.920 |
Using a slightly different approach, which is I will not be 00:35:45.560 |
I will be copying each line and explaining it step by step because I want this video to be more about explanation than just 00:35:52.040 |
Coding because I want to use the code for explaining what happens under the code under the hood 00:35:58.280 |
So let's create our first file, which is the modeling 00:36:07.560 |
And let's start by importing stuff which we need I don't need copilot 00:36:14.060 |
And then we create our first class which is the siglip-config 00:36:19.100 |
So, what is this basically we will be using this visual encoder and this visual encoder will have some 00:36:27.700 |
Configurations, why do we need a configuration class because uh, polygamma comes in different sizes 00:36:39.540 |
Which means that each of this size of polygamma each of these models polygamma models has a different configuration for its vision encoder 00:36:48.420 |
The hidden size basically it's the size of the embedding vector of this vision transformer that we are going to encode 00:36:57.700 |
Linear layer that we use the size of the linear layer that we use in the feed-forward network 00:37:02.340 |
The number of hidden layers is the number of layers of this vision transformer 00:37:06.820 |
The number of attention heads is the number of attention heads in the multi-head attention 00:37:10.500 |
The number of channels is how many channels is each image has which is RGB 00:37:15.080 |
The image size is because polygamma comes in I remember three sizes. So 224, 448 and 00:37:26.180 |
The default information that we put here is the for polygamma 224 00:37:29.960 |
Which supports of course image of size 224. So if you provide any image, it's first get resized into 00:37:39.840 |
The size of each patch. So what is the number? 00:37:42.980 |
It will be divided each image will be divided into patches. Each patch will be 16 by 16 00:37:52.260 |
Parameter for the layer normalization. We will see later 00:37:54.420 |
The attention dropout is another parameter that we will not be using in the attention calculation 00:37:58.900 |
Basically, it's a dropout that we use in the attention, but we will not be using it 00:38:02.660 |
And the number of image tokens indicates how many output embeddings this attention mechanism will this transformer vision transformer will output 00:38:17.460 |
Now before we saw that each an image encoder is something that converts an image into one single embedding 00:38:24.340 |
So that represents all the information about that image 00:38:27.140 |
but in the case of the vision transformer we can use all the output of the vision transformer to have because as we saw before 00:38:33.940 |
Vision transformer is a transformer model. So which takes as input 00:38:38.180 |
A list of embeddings and it outputs a contextualized embedding 00:38:42.820 |
So each of these contextualized embedding will be the tokens of our image 00:38:46.740 |
so it will not be one single embedding that represents the whole image, but 00:38:49.940 |
Lists of embeddings that represent a patch of each image, but also information about other patches through the attention mechanism 00:38:57.460 |
But we will see this later. So now this class is very very basic. It's just a configuration of our cglib 00:39:03.380 |
Now let's start by coding the structure of this vision transformer. So let me copy this stuff here 00:39:16.260 |
I am copying the code because I have already written before and I want to explain it instead of 00:39:21.780 |
Coding it because I also allows me to copy the comments and also allows me to avoid any mistakes while coding it 00:39:29.220 |
But I recommend that you code it from scratch. So you take this video and you just type whatever I am pasting here 00:39:37.460 |
This is the best way to learn because it's like when you study a mathematical proof 00:39:42.500 |
You should not just watch the proof on the piece of paper 00:39:45.860 |
Because even if it you think it makes sense to you 00:39:49.460 |
It doesn't actually because when you write it by hand, so when you code each of these lines by hand 00:39:55.300 |
Your mind will think why am I typing this? Why am I writing this? Why am I multiplying this number by this number? Why am I? 00:40:03.380 |
Calling this function so you question yourself when typing 00:40:08.180 |
That's why I recommend that you type this code while I am pasting it 00:40:12.420 |
I do it by pasting otherwise this video will be 20 hours 00:40:17.140 |
The first thing that we do is we create this vision 00:40:19.140 |
Model, this vision model is made up of a transformer and it has a configuration 00:40:23.380 |
So basically what we are doing is we take the pixel values of this our image, which will be loaded with NumPy 00:40:29.300 |
So when you load an image with NumPy it gets converted into an array that is channeled by height by width 00:40:35.540 |
But we can have a batch of images. That's why we have a batch size here. So the batch dimension 00:40:41.940 |
And our vision transformer will convert this into a batch size NumPatches 00:40:47.140 |
Which is how many NumImage tokens we have here and each 00:40:51.300 |
Vector will be of a fixed dimension called embeddim here 00:40:56.340 |
So basically our vision model will take an image as you can see a batch of images and it will give us a batch of 00:41:04.100 |
List of embeddings one list of embeddings for each image where each embedding is a vector of size embeddim 00:41:11.480 |
Okay. Now let's code the vision transformer, which is very simple also 00:41:27.400 |
Where we pass the configuration we save this embeddim, which is the hidden size 00:41:31.560 |
We saw before which is the size of this embedding vector 00:41:34.360 |
We first need to extract the embeddings from this 00:41:40.180 |
We need to extract the patches from this image, which will be done with this layer. We will call SigLip vision embeddings 00:41:46.680 |
Then we will run it through a list of layers of the transformer 00:41:51.060 |
Which is this SigLip encoder because it reminds the encoder of the transformer 00:41:55.380 |
Which is a series of layers of transformer and then we will have a layer normalization and we will see later how layer normalization works 00:42:07.060 |
So the forward method is basically we take these 00:42:09.700 |
Pixel values, which is the image which is a patch of images and we convert them into embeddings, which is 00:42:16.100 |
Which basically means that we are extracting the patches from these images. So let's visualize it here 00:42:25.540 |
Image embeddings we are taking these images. We will run a convolution here to extract patches 00:42:32.260 |
Then we will flatten these patches and add the positional encodings 00:42:35.960 |
And this stuff here will be done by this SigLip and vision embedding 00:42:44.420 |
Patches plus the positional encoding and we run it through this encoder, which is a list of layers of the transformer 00:42:51.300 |
So this stuff here is our encoder. What is the encoder? 00:42:54.340 |
Well, the encoder is a list of layers of the transformer 00:42:57.860 |
So you can think of it as being a list of these layers here. Actually these layers here 00:43:02.820 |
one after another which includes a multi-head attention, a 00:43:07.300 |
normalization, a feed-forward network and the normalization 00:43:10.440 |
In the case of the visual transformer the normalization is done before the feed-forward and before the multi-head attention, but that's the only difference 00:43:17.940 |
So this part here, so a series of layers is called the here 00:43:24.100 |
We call it the encoder because it resembles the encoder side of the transformer 00:43:28.200 |
And then we have a layer normalization. So now let's go to code this vision embeddings 00:43:34.500 |
So we want to extract information about these patches 00:43:37.880 |
Let's do it. Where are the vision embeddings? Here. Okay 00:43:56.100 |
Taking again the configuration because each of these models needs to have access to the configuration because they need to extract different 00:44:01.860 |
Information from this configuration. So we have the embedding size, which is the size of the embedding vector, which is the hidden size 00:44:10.980 |
And the patch size is how big is the patch that we want to get from this image. So basically we are talking about 00:44:20.900 |
In this case the patch size I remember is a 16 00:44:23.940 |
Which means that we are going to take this patch here is going to 00:44:32.000 |
How do we extract these patches? We do that through a convolution that is a 2d convolution, which it takes as input 00:44:38.740 |
The number of channels of the image so three channels are gb and it produces all channels equal to the embedding size 00:44:49.620 |
The kernel size so as you remember the convolution works like this, so let's use the ipad actually to draw so 00:44:56.020 |
The convolution works like this. So we have an image 00:44:58.900 |
Which is made up of let's say pixels. So suppose this is the grid of pixels 00:45:09.780 |
Basically the convolution works like this imagine the kernel size is three by three 00:45:16.020 |
So we take a three by three group of pixels. We apply this convolution kernel 00:45:21.220 |
So if you are not familiar with how convolutions work, I will not be reviewing that here 00:45:26.100 |
But basically it means that we have a matrix here 00:45:28.260 |
You multiply each number of this matrix by the value of the pixel on which it is applied to it will produce 00:45:39.700 |
And then you slide this kernel to the next group of pixel then you slide it again 00:45:44.900 |
Slide it again, etc, etc, and it will produce many features in the output features 00:45:49.700 |
However at as input we have three channels which you can think of it as three 00:45:55.700 |
Parallel images one that is only red one that is only green and one that is only blue 00:46:01.460 |
We run this kernel on all of these channels and it will produce 00:46:09.920 |
Depending on how many output channels we want. So for each output channel, we have a one kernel that is 00:46:15.440 |
We have three kernels actually that is used for one for each of this number channels 00:46:27.440 |
Kernel from one group of pixel to the next and we are using a stride that is equal to the patch size of the 00:46:34.240 |
Kernels, which is equal to the kernel size. So which means that we take the first oops 00:46:40.400 |
We take the first group of let's say three by three kernels 00:46:43.440 |
Then we skip three kernels to we slide it to the next group of three by three. So there is no overlap 00:46:54.400 |
Then we slide it to this group of pixel here so that there is no overlap. So basically what we are taking is 00:46:59.280 |
list of features each extracted by a independent patch of this image that we run the kernel on 00:47:07.840 |
And the padding if valid means that there is no padding added 00:47:11.200 |
So basically this patch embedding is extracting information from our image patch by patch 00:47:18.000 |
Where there is no overlap between these patches. How many patches do we have? 00:47:21.920 |
Well, it's the size of the image which is 224 in the base version of 00:47:31.200 |
So image size is the number of pixels divided by how big is each patch and then to the power of two because we have 00:47:38.000 |
Along two dimensions this image. So we run the patch. The patch is 00:47:41.840 |
It's a square. So it's a 16 by 16 or 3 by 3 or whatever the number patch size is 00:47:55.360 |
It's equal to the number of patches that we have because we need to encode information about where this patch came from 00:48:01.280 |
So how many positional encodings we need equal to the number of patches that we have 00:48:06.080 |
And what is each of this positional encoding? It's a vector. It's a vector of the same size of the patch 00:48:11.920 |
So it's equal to embeddings. You can see here 00:48:14.480 |
And it's a learned embedding. So it's a positional encoding that is a learned 00:48:20.160 |
Embedding how many we have we have noon positions of them each of them with this size here 00:48:26.320 |
And we will see later that each of them is added to the information extracted from the convolution 00:48:32.160 |
So that each convolution output encodes information about where it came from in the image 00:48:40.800 |
In the module which is just a list of numbers and we will use it later 00:48:47.440 |
So this is just a range of numbers so between zero and noon positions mine one 00:49:00.320 |
Copy and paste the code because I can copy all the comments without typing them one by one. Otherwise, it will take me forever 00:49:06.000 |
So what we do now is okay. We had our image which is a pixel values here 00:49:10.640 |
The pixel values came from noon pi so we will see later how we load the image 00:49:15.760 |
but basically you have to think that you load the image with noon pi and noon pi loads a 00:49:20.880 |
Batch of images, which is a channel height and width. It's a tensor with three channels and with the height of the image and the width of the image 00:49:31.840 |
Height and width is equal to the same because we resize each image to the input size of the image expected by the model 00:49:38.320 |
So we will resize in the case. We are using the smallest polygama. We will resize each image to 00:49:47.040 |
We extract this patch embeddings to this convolution so you can see here 00:49:51.520 |
So this will basically take our image which is a batch of images and convert it 00:50:00.400 |
So each image will be a list of embeddings of size embed dimensions 00:50:06.420 |
How many patches we have well the number of patches 00:50:10.400 |
For the height and the number of patches for the weight 00:50:14.720 |
In this case, it will always be the same so you can think of it as a number of patches a total number of patches 00:50:20.720 |
Each of patches with the dimension embedding dimension 00:50:26.900 |
And as we saw before we flatten these ones, so we extract them here. Let me delete it 00:50:38.960 |
So we run the convolution and then we flatten them here 00:50:43.440 |
So basically the convolution will give us 1 2 3 4 5 6 up to 16 or whatever the number of patches is 00:50:49.920 |
and then we convert it into a tensor where the 00:50:55.120 |
So the first patch is here and the last patch is the last element of this tensor and this is what we do here 00:51:00.880 |
Here because the output of the convolution is a 2x2 grid, but we don't want a 2x2 grid 00:51:07.520 |
We only want a one-dimensional long list of patches and this is done by this flatten method here 00:51:13.520 |
Then we transpose because we want the number of patches to come before the embedding dimension 00:51:19.300 |
Because as input to the transfer we need to give a sequence of embeddings 00:51:24.480 |
So that's why we want this num_patches dimension to come before so that it becomes a batch 00:51:29.600 |
of sequence of embeddings and each embedding is a 00:51:37.360 |
Each of these embeddings we add the positional encodings which positional encodings? Well the position 00:51:46.140 |
But which embedding do we want to extract? All the embeddings. So from 0 to 00:51:53.440 |
What is the where is this information 0 to 15 is in this self dot position and this which is a range 00:52:00.080 |
So as you remember a range is just a generates a list of numbers between 0 and the argument minus 1 00:52:06.960 |
So we add we extract this the all the positional encodings from this position embedding 00:52:12.240 |
Layer, which is this embedding layer here. We add it to the embeddings 00:52:16.880 |
So what we are doing basically is we flatten this embedding 00:52:20.320 |
We did that before then we add a positional encoding vector extracted from the positional encoding layer 00:52:25.600 |
And these positional encodings are learned. So learned why because this embedding layer here is a list of 00:52:34.800 |
That when the model is trained these embeddings will change according to the need of the model and basically we encode them 00:52:42.640 |
So it's not like we are telling the model. This is position number one. This is position number two 00:52:48.000 |
We add another embedding that is added to this 00:52:54.480 |
And then the model will learn to modify this positional embedding vector in such a way that they should encode the position 00:53:01.820 |
Information because each of this position embedding is always added to the same patch 00:53:07.020 |
So the first patch always receives the position number zero the second patch always the position number one 00:53:11.580 |
We hope that the model actually tries to change this position embedding in such a way that they encode the positional information 00:53:17.580 |
and actually it does because the model actually learns then the 00:53:23.580 |
Patch with each other by using their positional information 00:53:27.660 |
And the only way for the model to do that is to change this position embedding in such a way that they encode the position information 00:53:33.840 |
If you remember from the vanilla transformer, we use the sinusoidal functions 00:53:38.300 |
So if you want to look at the original transformer if you remember 00:53:45.740 |
Where is it here? So we create this position encoding using sinusoidal functions 00:53:52.780 |
So instead of learning them we actually pre-compute them and then we force the model to learn the pattern 00:53:58.780 |
Encoded by these sinusoidal functions in this case. We are not forcing the model to learn any pattern 00:54:04.060 |
We want the model to create the pattern that is most useful for the model itself 00:54:08.220 |
so we hope that the model will try to create this embedding layer in such a way that it creates some 00:54:24.540 |
Now we skipped before the normalization layer. So let's go actually to 00:54:29.020 |
Understand what is normalization and how it works so that we always don't leave anything behind that is not explained 00:54:36.620 |
All right. Let's talk about normalization. So imagine we have a list of linear layers 00:54:42.460 |
Now a linear layer is defined by two parameters 00:54:46.700 |
One is called the input features and one is called the output features 00:54:50.220 |
Imagine we have input feature is equal to four and output feature is equal to four 00:54:54.300 |
Actually, there is another parameter called bias 00:54:56.860 |
So it indicates if the linear layer also has a bias term and suppose that it's true 00:55:02.540 |
To the input of the linear layer usually we have a batch of items and each item is made up of features 00:55:11.260 |
Suppose that for now as input there is only one item and it's made up of four features 00:55:15.820 |
And as you can see the input features are four 00:55:18.380 |
What will happen with four output features is this the linear layer you can think of it 00:55:24.220 |
As a number of neurons where the number of neurons equal to the number of output feature of this linear layer 00:55:41.100 |
How many weights does it have? Well equal to the number of input features that this layer accepts 00:55:49.980 |
What each neuron will do it will do the dot product of the incoming vector 00:55:55.100 |
So the input vector x multiply dot product with the weight vector of this neuron plus the bias term 00:56:05.740 |
And this basically dot product plus this bias will produce one output feature 00:56:10.540 |
Because we have four neurons. We will have four output features 00:56:14.380 |
So each neuron will do the same job, but each neuron will have its own weight vector and its own bias number 00:56:20.540 |
So this one here will have its own weight vector different from the other ones and its own bias term here 00:56:28.860 |
Vector that takes as input four features and produces two output features 00:56:34.140 |
So you can think of it as a linear layer with the two neurons 00:56:38.140 |
where the first neuron has a weight vector made up of four numbers because 00:56:43.740 |
The incoming vector has four features and then one bias term here 00:56:47.740 |
It will produce an output vector of two items 00:56:51.420 |
The first item will be this number here and the second item 00:56:54.860 |
The second dimension will be the dot product of the weight vector of this second neuron with the input vector 00:57:06.460 |
With the linear layers, but actually with all layers in general 00:57:12.140 |
The problem is this it's called the covariate shift. The problem is that 00:57:18.860 |
That changes from one batch to another in magnitude 00:57:24.240 |
Then the output of the layer will also change in magnitude a lot depending on what is the incoming vector 00:57:32.860 |
So for example, imagine this the first input vector is all the numbers are more or less around one and two 00:57:45.980 |
Then if the next vector that is coming to this layer is 00:57:49.660 |
Much different in magnitude from the first one then the output will also be much different in magnitude 00:57:58.220 |
So the problem is that if the input of a layer changes, then the output of this layer will also change a lot 00:58:04.140 |
So if the input changes drastically the output will also change a lot drastically 00:58:10.940 |
Of a model during training depends on the output then the loss will also change a lot because the loss 00:58:17.820 |
Then determines the gradient during backpropagation 00:58:21.200 |
It means that if the loss changes a lot then also the gradient will change a lot and if the gradient changes a lot 00:58:27.020 |
Then because the gradient determines how we update the weights of the model during training then also the update of these weights will also change a lot 00:58:36.300 |
basically what happens is that the if the input the distribution of the 00:58:41.340 |
Dimensions of this vector that is coming to the input of a layer 00:58:45.660 |
Changes drastically from one batch to the next 00:58:49.260 |
Then the output of the model will also change and then the loss will change then the gradient will change then the update of the weights 00:58:55.500 |
Will change so what we will see that the loss will oscillate a lot 00:58:59.020 |
And also the weights will try to keep up with this changing input distribution 00:59:03.840 |
Which basically will result in a model that trains slowly. So here I have made a simple 00:59:14.700 |
So a big change in the input of a layer will result in a big change in the output of a layer which will result 00:59:20.540 |
In a big change in the loss of the model which will change result in a big change in the gradient 00:59:25.840 |
Of the during black propagation which will result in a big change in the weights of the network 00:59:31.580 |
And what is the result of this is that the network will learn very slowly because the network will spend most of its 00:59:37.020 |
Time but okay most of the effort trying to keep up with this distribution change in the input 00:59:50.300 |
So the the first solution to this problem was batch normalization, which was introduced in this paper 00:59:55.660 |
And with batch normalization what we do basically is that we have usually not a single item as input 01:00:01.740 |
We have a batch of items suppose that we are training a classification image classification model 01:00:10.460 |
For example the image of a cat the image of a dog of a zebra of a tree of a stone etc, etc 01:00:16.220 |
So you can think these are the dimensions of the vector that represent the cat 01:00:20.220 |
These are the dimensions of the vector that represent the dog. These are the dimensions of the vector that represent the zebra etc, etc 01:00:25.820 |
So what we do with batch normalization is that we calculate a statistic 01:00:35.100 |
Which statistic do we calculate the mean and the the variance and then we 01:00:42.680 |
Normalize each item by subtracting the mean and divide it by the standard deviation 01:00:54.380 |
According to a Gaussian with mean zero and the variance of one 01:01:05.420 |
Because the image of a cat is much different from the image of the zebra 01:01:10.380 |
Because the color distribution is different. The rgb distribution is different. So the pixel intensity is much different from each other 01:01:16.780 |
What will happen is that the model will not see this change in magnitude 01:01:23.100 |
And also will not see a change in distribution because all of these items will be distributed according to a mean of zero and the variance 01:01:31.420 |
So what will happen is that the model will oscillate less in the output. So it will oscillate less in the loss 01:01:44.300 |
So the model the training will be more stable. It will be it will converge faster basically this way. So 01:01:54.860 |
Why do we need normalization is because the input of the model which depends on imagine you are training 01:02:00.860 |
Classification or the image classification model then the input depends on the image and the image can be much different from each other 01:02:07.580 |
If the image changes a lot, we don't want the model to feel this change in magnitude of the input 01:02:13.500 |
We want the distribution of the inputs to be remain constant. Let's say 01:02:17.340 |
So that the model doesn't oscillate so that this doesn't force the model to kind of just to keep up with the distribution 01:02:24.560 |
This change in distribution. How do we do that? We we try to keep the distributions 01:02:29.520 |
Constant so always try to have the input features to be distributed according to a fixed distribution 01:02:35.100 |
Which is mean of 0 and 1 and we do that with this formula here, which comes from probability statistics basically each 01:02:42.060 |
Distribution if you subtract its mean divided by the standard deviation, it will result in a Gaussian distribution of mean 0 and variance of 1 01:02:49.980 |
Of course, this is valid also only for Gaussian distributions 01:02:58.220 |
And this will basically result in a more stable training 01:03:02.060 |
Now the best distribution actually worked fine. However, it has a problem with the problem is that 01:03:07.580 |
Which best normalization each of these statistics so the mu and the sigma are calculated 01:03:13.840 |
Along the batch dimension. So we calculate the mu and the sigma for the dimension number one of each of these vectors 01:03:21.820 |
Along the batch dimension. So basically to calculate this mean we are summing up the first dimension of each of these vectors 01:03:29.420 |
And divided by the number of items that we have 01:03:31.740 |
So we are mixing the features of different items 01:03:35.820 |
So we are mixing the dimension number one of the cat with the dimension number one of the dog 01:03:42.940 |
so basically to to have good results, we need to use a big batch because 01:03:47.660 |
If we use for example a cat and the dog it will result in one mean 01:03:52.780 |
But imagine in the next batch, we have the cat and the zebra it will result in a completely different mean 01:03:58.620 |
And then the next supposing the next batch we have a cat and the tree maybe it results in another different mean 01:04:04.700 |
So also we will still have this problem of covariance shift because the mean is changing a lot between each iteration 01:04:11.120 |
So the only solution to this actually is to use a very big batch size 01:04:15.340 |
So we are forced to use a big batch size in order to alleviate this problem 01:04:19.660 |
Of kind of mixing the dimensions along the batch dimension 01:04:25.980 |
We introduce the layer normalization with layer normalization 01:04:28.860 |
What we do is instead of calculating the statistics along the batch dimension 01:04:36.220 |
So the mu and the sigma that will be used to standardize the cat will only be 01:04:41.900 |
Dependent on the dimensions of the cat not on the whatever the cat comes with 01:04:48.300 |
So we are still doing each item minus its mean divided by the standard deviation 01:04:55.580 |
But instead of this standard deviation and this mean coming from the first dimension of each item 01:05:03.180 |
All the dimensions of the each item independently from the others 01:05:07.420 |
So it doesn't matter which other item the cat comes with it will always result in more or less the same mu and 01:05:17.660 |
And this makes the training even more stable because we are not forced to use a big batch size 01:05:27.120 |
Okay, we have seen what is normalization now we should implement what is this thing called the encoder so this is Sigleap encoder 01:05:36.700 |
Now the encoder is made up of multiple layers of the transformer model 01:05:41.980 |
And the architecture more or less if you look at the vision transformer paper, it is like this 01:05:47.580 |
So I changed it a little bit because I wanted to use the exact names that we will be using 01:05:53.660 |
So we have first of all what we have so far is this thing called the Sigleap vision embeddings 01:06:00.540 |
Taking some patches of this image using a convolution each of this 01:06:05.740 |
Output of this convolution is an embedding is used as an embedding. It's a vector 01:06:10.380 |
And this embedding vector is added to another 01:06:14.300 |
Vector called the positional encoding which is learned and then we feed this stuff to this thing called the encoder 01:06:21.260 |
So we convert it into embeddings at the positional encoding then we feed it to the encoder 01:06:25.340 |
And at the input of the encoder you need to think that we have 01:06:28.620 |
These layers repeated n times here. It's written l times 01:06:33.340 |
One after another such that the output of one becomes the input of the next layer 01:06:38.780 |
the thing that you need to understand about the transformer is 01:06:42.460 |
I repeat it is that the transformer is a sequence-to-sequence model that converts a sequence of embeddings into contextualized embeddings 01:06:51.280 |
What does it mean? It means that at the input you have a list of 01:06:54.560 |
Here embeddings each representing a patch of the image as an independent patch 01:07:01.520 |
So this embedding here only captures information about the first group of pixels 01:07:06.000 |
This embedding here captures all information about the second group of pixels, etc, etc, etc 01:07:13.760 |
Attention mechanism this contextualized these embeddings become contextualized at the output of the transformer and we will see in detail this 01:07:23.600 |
Such that this embedding here at the output of the transformer the first embedding is 01:07:28.240 |
represents information about the first patch plus other it includes information not only about the first part but also about other patches 01:07:36.080 |
And so is the second the third the fourth and the last one 01:07:40.320 |
So they become contextualized in the sense that they capture information about the context in which they appear 01:07:46.400 |
Which is different from language models in which each token captures information about the previous tokens in the case of the vision transformer 01:07:54.560 |
Each patch includes information about all the other patches 01:08:01.440 |
is made up of so we have the this is the input of the encoder let's say 01:08:07.360 |
And we will have the first layer of this encoder 01:08:10.480 |
The first thing that we do is we apply a layer normalization and we saw how it works and why we use it 01:08:15.840 |
The output of this layer normalization is a cop 01:08:18.800 |
First the input of this linear normalization is saved for a skip connection that we do later 01:08:23.680 |
Then the output of this layer normalization is sent to the self-attention mechanism 01:08:28.260 |
It's this one here and this self-attention mechanism takes the output of the layer normalization as a query key and values 01:08:37.520 |
It calculates the attention just like the usual formula 01:08:40.000 |
So softmax of the query multiplied by the transpose of the key divided by the square root of the model multiplied by v etc etc 01:08:46.000 |
The output of this self-attention is then summed up with this skip connection here 01:08:51.920 |
Then the output of this summation is sent to this layer normalization along with the skip connection that is used later 01:08:58.480 |
Then the output of the normalization is sent to this multi-layer perceptron, which is a list of linear layers 01:09:03.840 |
We will see later and then we do another summation here with the skip connection plus the output of the multi-layer perceptron 01:09:10.180 |
And then we do another layer like this and another another another and the output of the last layer is the output of our vision 01:09:20.380 |
the vision transformer takes as an input an image converted into patches. Patches are then fed to this 01:09:28.160 |
Encoder which is a list of layers and the output is a contextualized 01:09:33.860 |
So let's code this encoder, which is basically this structure here 01:09:39.120 |
And we will code each part of this structure and while coding each part we will go inside on how it works 01:09:46.880 |
So the normalization we already know how it works, but we still have to explore what is this stuff here called the self-attention 01:09:52.580 |
What is this stuff here called multi-layer perceptron? 01:09:56.240 |
I believe it's convenient for us to go first through multi-layer perceptron and then we go to the self-attention 01:10:02.080 |
I think because the self-attention is a little longer to do. So let me do the simple part first 01:10:17.520 |
So the encoder is made up of again, the constructor is made up of the configuration 01:10:22.240 |
We save some stuff which is the hidden size and then we have a block called the self-attention block in this call this 01:10:31.200 |
Note about the naming I'm using. So I am using the same names as 01:10:38.560 |
For only simple reason which is I want to be able to load the pre-trained weights from HuggingFace 01:10:44.240 |
So the pre-trained weights for the polygam are available on the HuggingFace hub 01:10:51.680 |
But each of these pre-load pre-trained models they have this dictionary of weights 01:10:57.040 |
So where the dictionary tells you where to load each of these weights 01:11:01.520 |
And if the names do not match you need to create some conversion script 01:11:04.720 |
So I didn't want to do that and also it would just complicate the code uselessly 01:11:12.240 |
Load basically the pre-trained weights from HuggingFace 01:11:17.440 |
Also because my code is based on the HuggingFace implementation 01:11:20.480 |
So to create my code I use the HuggingFace implementation, but simplified a lot a lot a lot 01:11:25.680 |
For example, I remade my own KVCache. I did a lot of 01:11:29.040 |
Modifications to simplify it but it's based on the HuggingFace implementation 01:11:36.080 |
So we have this thing called the self-attention then we have a layer normalization. So we saw it's 01:11:40.400 |
Where is it? And we have this layer normalization here 01:11:43.360 |
Then we have this multi-layer perceptron, which is this stuff here. And then we have another layer normalization, which is this stuff here 01:11:49.920 |
So we have two layer normalization. So now let's implement the forward method 01:11:54.480 |
And the forward method I will copy it line by line so we can understand 01:11:58.960 |
Okay this forward method. Now. The first thing we do is we save a residual connection, which is 01:12:05.680 |
We basically save the input that we feed to this 01:12:09.260 |
Encoder because we need to reuse it later. So we are saving this skip connection because we will need to use it here later 01:12:14.860 |
Then we run it through the layer normalization the input 01:12:19.500 |
And it's done here. So the layer normalization does not change the shape of the input 01:12:25.020 |
It's just normalizing each of these dimensions such that they they all come up 01:12:30.700 |
It's like they came out from a Gaussian of mean zero and variance of one 01:12:36.860 |
Then we apply this magic thing that we will explore later called the self-attention and the self-attention system 01:12:44.700 |
Tensor, but as we saw before the attention mechanism is something that takes as input 01:12:50.140 |
Embeddings and gives you contextualized embeddings. So it does not change the shape of these embeddings 01:12:55.600 |
But we will implement it later. So for now just think of it as a black box that you feed in 01:13:00.700 |
Embeddings and it gives you contextualized embeddings 01:13:03.980 |
Then we have a residual connection and we can see that here. So this residual connection 01:13:14.060 |
So we are taking what we saved before with the output of the self-attention 01:13:18.300 |
So what we saved before is this residual stuff here plus the output of the self-attention, which is this hidden states here 01:13:23.740 |
This the result of the summation is saved again because there is another skip connection 01:13:31.580 |
I don't know why my alt tab is not working. So 01:13:36.380 |
This stuff here. So we save it because later we need to use it here for the skip connection 01:13:40.860 |
Then we do I guess another linear layer normalization which also does not change the shape of the input 01:13:52.060 |
And then we have this thing called the multilayer perceptron. Now the multilayer perceptron is something that 01:13:57.820 |
It's not easy to explain what is used for but basically 01:14:01.100 |
The multilayer perceptron we will see later is a series of 01:14:13.500 |
Transforms it independently from each other from the others 01:14:17.820 |
So while in the self-attention there is kind of a mixing of the patches incoming so that you get contextualized 01:14:24.380 |
In the multilayer perceptron, there is no mixing between these let's call them tokens or patches 01:14:32.560 |
And the multilayer perceptron allow us to increase basically first of all it adds parameters to the model. So the model has more 01:14:40.060 |
Degrees of freedom to learn whatever it's trying to learn 01:14:46.380 |
Objective of the multilayer perceptron is that it allow to prepare 01:14:50.220 |
Let's say prepare the the sequence of patches for the next layer. So if the next layer expect these patches to be somehow 01:14:57.980 |
Different the multilayer perceptron allow to transform them 01:15:02.300 |
Also, it adds a non-linearity. So the multilayer perceptron also includes a non-linearity which adds 01:15:08.060 |
Which basically allow as you know non-linearities allow you to model more complex transformations 01:15:15.900 |
So if you just create a list of linear layers without any non-linearities that you cannot model complex functions so that for example 01:15:24.300 |
Map non-linearly separable data, but with by adding 01:15:29.900 |
Non-linear transformations you add complexity to the model. So the model is able to map complex transformations 01:15:38.400 |
So the multilayer perceptron just adds parameters and this non-linearity which is helpful to 01:15:45.420 |
To to allow the model to learn whatever complexity it needs 01:15:52.620 |
After the multilayer perceptron, I guess we have a 01:15:57.740 |
Yeah, we have another skip connection and then we return the output of this skip connection here 01:16:04.140 |
and also the skip connection does not change the shape of the 01:16:10.880 |
Now, let's code first this multilayer perceptron. It's the easiest stuff to do 01:16:18.300 |
Let's go here. I I will also always copy first the 01:16:21.980 |
Constructor and then the forward method so we can explore a little bit the structure and then we explore the logic 01:16:27.660 |
So this multilayer perceptron just like in the vanilla transformer is made up of two layers 01:16:36.780 |
So the first layer takes each of the embeddings which are we we can also call them tokens or patches 01:16:43.820 |
Because most of the time we are dealing with language models and expands them 01:16:47.980 |
So each of these vectors which is of size hidden size is expanded into this thing called intermediate size 01:16:55.180 |
Usually it's chosen as three times the hidden size or four times the hidden size 01:17:00.380 |
I remember in the vanilla transformer it was four times the hidden size 01:17:03.260 |
Then we apply a non-linearity to this expanded tensor and then we compress it back to the hidden size dimension 01:17:17.420 |
So the first thing we do is we convert each of these embedded dimensions into intermediate sizes 01:17:26.060 |
Each image is made up of num_patches number of patches each of this patch is represented by a vector of size embedding dimension 01:17:33.420 |
With the first fully connected layer, we are expanding each of these patches into the intermediate size and then we apply 01:17:42.460 |
A non-linear transformation in this case. It's the gelu function now 01:17:46.380 |
You may be wondering why are we using the gelu function or the zwiglu function or whatever non-linearity there is 01:17:58.540 |
There is a there is no like a rule of thumb for choosing the non-linearities to use for a specific case 01:18:07.820 |
And the heuristics is that initially the transformer when it was introduced it was with the gelu function as non-linearities 01:18:16.540 |
But then people explored other non-linearities and they saw that they work better 01:18:21.500 |
Now non-linearity is actually there is also some logic behind the choice of a non-linearity 01:18:25.980 |
So because the non-linearity define also the flow of the gradient 01:18:29.820 |
So for example, if you use the gelu function, if you look at the graph of the gelu function, let me draw it actually 01:18:36.940 |
The graph of the gelu function is something like this. So 01:18:43.020 |
So basically anything that is negative is zero. Let me use another color 01:18:49.100 |
Anything that is negative is becomes zero basically and everything else is forwarded without any scaling 01:18:56.880 |
So this means that if the input of the gelu function is negative the output will be zero and actually for any 01:19:06.220 |
Negative input there will be no gradient because the gradient will be multiplied by zero. So it will not flow 01:19:10.860 |
That's why for example, we introduced the leaky relu and other like 01:19:18.060 |
Functions that allow also a little bit of gradient flow from the negative side 01:19:27.020 |
How the gradient will flow during back propagation. So having a non-linearity 01:19:37.980 |
That allows the gradient to flow back even when it's negative 01:19:40.940 |
It means that the signal the model is not forced to always have the activation to be positive to have some 01:19:46.860 |
Feedback from the loss function to optimize its weights 01:19:49.900 |
And why we are using the gelu because people have tried it and probably it works better 01:19:56.780 |
compared to the relu function for the same class of 01:20:00.140 |
applications so in the vision transformer you see the gelu function, but 01:20:05.020 |
In the lama, for example, they use the zwiglu function in other scenarios 01:20:08.300 |
They use other functions and it's mostly based on heuristics on how they work in practice 01:20:13.980 |
also, because a model is usually made up of billions and billions and billions of 01:20:19.180 |
of parameters and it's not easy to find the regular regularity to understand why 01:20:24.860 |
Specific non-linearity is working better than the other one 01:20:30.380 |
Now, okay, then we apply the second linear layer 01:20:33.980 |
Which is basically recompressing back this intermediate state into the embedding size and then we return it 01:20:47.340 |
we are going to code this attention mechanism for the vision transformer and we will see that it's 01:20:53.340 |
Different than from those of language models because we don't have any causal mask or attention mask 01:20:59.980 |
All right guys, so we have seen the multilayer perceptron now 01:21:04.460 |
Let's go to the multi-head attention and for that 01:21:07.180 |
I want to use the slides because I believe it's a little faster to explain on the slides and then we proceed with the code 01:21:13.420 |
So what is the multi-head attention? The multi-head attention is a way of contextualizing stuff 01:21:19.420 |
Which means that you start with a sequence of for example patches and you can think we have for example 01:21:26.140 |
Four patches each of this patch is represented by a single vector of 1024 dimensions 01:21:32.620 |
So you need to think of this as a vector of 1024 dimensions. So you need to think there are 01:21:40.700 |
Then we have the patch number two the patch number three and the patch number four 01:21:44.700 |
Each of this patch was extracted from a group of pixels from the initial image and it's only representing information about the patch 01:21:51.980 |
It was extracted from so the part of the image it came from 01:21:56.300 |
With the multi-head attention system. We uh, what we mechanism what we are doing is we are contextualizing these patches 01:22:03.820 |
Which means that the output of the multi-head attention is a tensor of the same size 01:22:08.300 |
As the input so this is a tensor of size 4 by 1024 01:22:12.480 |
the output will be a tensor of size 4 by 1024, but where each of these 01:22:19.260 |
Embeddings now does not capture information only about itself, but also about the other patches 01:22:27.820 |
This is for vision transformer for the language models we want something slightly different 01:22:34.220 |
So for language models, we do have an input sequence, which is a sequence of tokens each token representing one single 01:22:41.020 |
I don't want to use the term word because it's wrong but 01:22:44.780 |
In my videos, I always make the simplification that each token is a word and each word is a token 01:22:49.740 |
But this is not the case actually in tokenizer. So usually a token can be just any sequence of characters 01:22:56.320 |
Does not does not necessarily be um, it does not need to be necessarily a word 01:23:01.660 |
But for us let's treat them as word. It's just simplifies the explanation 01:23:07.340 |
We have a list of tokens. Each token is represented as an embedding. Let's say of 1024 dimensions 01:23:17.400 |
1024 numbers for this one 1024 numbers for this one, etc, etc 01:23:21.720 |
The multi-head attention in the case of language models 01:23:25.480 |
What we want is we want to contextualize each token with the all the tokens that come before it 01:23:31.640 |
So the output of the multi-head attention in the case of language models 01:23:35.560 |
And this is this would be known as the self-attention mechanism with causal mask 01:23:43.160 |
Is a sequence with the same shape as the input sequence 01:23:47.320 |
So this vector this matrix here is a 4 by 1024. So the output will be 4 by 1024 01:23:53.180 |
And each of these tokens is not capturing information only about itself 01:24:00.120 |
But also about all the past tokens now the word I does not have any past token 01:24:04.920 |
So it will only capture information about itself 01:24:07.720 |
But the word love will capture information also about the token I because it comes before it and the word 01:24:13.160 |
Pepperoni will capture information about I and love because they come before it etc, etc until the last token which capture information about all the sentence 01:24:21.080 |
Why do we want to do this in language models? 01:24:25.160 |
Let me give you a little understanding of why we do it in this way with language models and why the transformer is 01:24:35.480 |
This is going a little off topic with respect to the vision transformer 01:24:38.600 |
But I think if you understand this then you will understand the big part of the transformer and why it even exists 01:24:48.600 |
Now what we do with the language models is you need to think that a language model is 01:24:53.640 |
Something that we need to we retrain on what is known as the next token prediction task 01:24:59.480 |
Which means that given a prompt the language model try to understand what is the next token that completes this prompt 01:25:05.560 |
How do we generate text with the language model? We start with some tokens, which are the prompt we generate the next token 01:25:11.480 |
We put it back into the prompt and we ask again the language model 01:25:14.120 |
What is the next token the language model gives us the next token? 01:25:16.680 |
Then we put it back into the prompt and then we ask again. What is the next token etc, etc 01:25:20.280 |
So we need to train a language model to train a language model 01:25:24.600 |
We need to train a model to predict the next token given the past tokens 01:25:29.320 |
And the transformer allow us to do that in parallel when training 01:25:35.000 |
Which means that we start with an input that is a series of embeddings 01:25:39.340 |
Which are uncontextualized so we start with this one and each of these actually is one single token. So this is only I this is only love 01:25:54.600 |
The output of the transformer of the self-attention mechanism will be a series of 01:26:04.420 |
Uncontextualized in such a way that each token captures information of only about itself, but also about all the past tokens 01:26:11.240 |
How do we train and the transformer can do it in parallel? 01:26:14.840 |
So the self-attention mechanism will take this as input and generate this output in parallel 01:26:19.800 |
So it's not will generate one token at a time, but it will generate all of them in the in parallel using this multi-head attention 01:26:31.340 |
As we saw before the language model is something that given a prompt needs to predict the output. So what we want is that 01:26:43.020 |
This sentence here. We feed it to the transformer the transformer will transform it into a sequence of embeddings 01:26:49.340 |
Contextualized embedding and then we need some labels to train this language model 01:26:54.060 |
So the labels what will be well, we will we want whenever the language models 01:27:12.620 |
Language model sees the word I love it should predict the word pepperoni 01:27:16.640 |
Whenever it sees the word the sequence I love pepperoni it should predict pizza 01:27:26.620 |
Whenever it sees the sequence I love pepperoni pizza 01:27:29.820 |
It should predict the token end of sentence, which is a special token telling hey, I'm done with the generation 01:27:36.000 |
Because the transformer can generate all of these contextualized embeddings in parallel 01:27:41.820 |
we can also calculate the loss for each of these predictions in parallel and 01:27:46.300 |
Calculate the with backpropagation updates the weights of the model to tell in parallel 01:27:53.020 |
How the model should predict each of this token given the 01:27:56.780 |
The previous tokens. So when we are given a sentence and we train language model the language model can 01:28:05.820 |
With only one forward pass on how to predict the next token inside of this sentence given the previous tokens as context 01:28:13.180 |
In only one single pass of the transformer. That's why the transformer is so powerful because this contextualization happens in parallel 01:28:19.900 |
So we can calculate the output in parallel for each position 01:28:22.540 |
And because we know already know what is the label because the label is just the next token given the previous tokens 01:28:28.220 |
we can calculate the loss in parallel for each positions and the model will learn in parallel how to 01:28:33.180 |
Generate exactly this sentence in in one pass only 01:28:37.820 |
so the model will not learn to generate one token at a time given the previous but 01:28:43.100 |
All the sentence in one pass and that's why it's so powerful 01:28:49.740 |
Okay, so we have seen what is the difference between the vision transformer and the language model 01:28:54.220 |
So in the vision transformer, we want to contextualize tokens or patches 01:28:57.980 |
In such a way that they capture information about all the other patches 01:29:02.220 |
But in the language model, we want each token to only capture information about itself and the previous tokens 01:29:10.300 |
We start with of course an input sequence. Our goal is to create an output sequence that is contextualized 01:29:16.380 |
And there are many intermediate steps. So now we will see what are these intermediate steps one at a time 01:29:23.340 |
Let's start by creating the class of this this attention mechanism and we will create it. Let's create it here 01:29:29.900 |
Okay, so in the input we have the configuration of the model we save some stuff that we will need later 01:29:37.580 |
So the hidden size the number of attention heads because we are dealing with multi-head attention 01:29:43.660 |
Head dimension we will see later what is it and why it's used 01:29:47.020 |
The scale is basically the if you remember the formula for the attention is 01:29:51.260 |
The queries multiplied by the transposed of the keys divided by the square root of the model 01:29:57.340 |
And this is one over the square root of the model 01:29:59.740 |
So the stuff that we need to divide the query multiplied by the keys with 01:30:05.100 |
Then we have this dropout which is zero. I never saw it used in 01:30:10.780 |
In polygamma, but I believe there are other cglib models that use it. So they they put it here 01:30:15.580 |
But it you can think of it like non-existent for now 01:30:19.180 |
and then we have these three linear layers called w, k, w, q and w, v which are 01:30:25.580 |
Parameter matrices that are also present in the vanilla transformer 01:30:31.260 |
And then we have this output projection which in the paper of the transformer is called the wo matrix and we will see later 01:30:39.580 |
Let's start by implementing the forward. So the forward method is this one 01:30:46.060 |
Well, the input of the forward method of this attention mechanism is basically what 01:30:50.540 |
Is the output of the layer normalization in this encoder layer class 01:30:55.580 |
So the output of the layer normalization is fed to this self-attention mechanism 01:31:00.000 |
So it is something of this shape. So it's a batch size by non-patches by embedding dimension 01:31:08.380 |
So what is does it mean? It means that we have a batch of images 01:31:18.780 |
And each of this patch is represented by a vector with the size embed dimension 01:31:24.700 |
You can think of it as a vector of 1024 dimensions. I don't remember the exact number of dimensions right now 01:31:30.940 |
You can also think as this non-patches as a sequence length 01:31:35.740 |
So before we saw that a language model is made up of a sequence of tokens here. You can think of it as a sequence of 01:31:40.940 |
Patches where the sequence length is this non-patches here 01:31:45.020 |
The first thing that we do in the self-attention mechanism is we take the input and we run it through three 01:31:52.060 |
Transformations one is called wq one is called wk and one is called wv and after we run it through these 01:31:58.140 |
Transformations the output will become query key and values 01:32:07.900 |
So we take the input sequence, which is this hidden states and we run it through wq here. It's called the qproj 01:32:14.620 |
Wk here is called the kproj w here is called vproj 01:32:19.020 |
The shape of the tensor does not change. Basically. These are parameter matrices 01:32:24.960 |
So they just add parameters to our self-attention that transform the input sequence so that they become query key and value 01:32:33.100 |
So it's the query key and value is just a transformation of the input sequence. However 01:32:37.740 |
In this case each token still is independent from the other 01:32:42.140 |
So there has been no contextualization happening with the linear layers. So linear layers always treat each token 01:32:47.500 |
Independently from the others just like the multi-layer perceptron each token in the multi-layer perceptron is expanded and then reduced 01:32:54.300 |
Here, it's not even not expanded nor reduced. It's just transformed because the size is from embedding dimension to embedding dimension 01:33:01.980 |
So it's just a transformation of the single token 01:33:04.780 |
Why we want to do it? Because the self-attention mechanism needs to see the same sequence in three different ways as query key and value 01:33:14.620 |
Later, we will see why they are called query key and values 01:33:17.820 |
The second thing we do is basically we split this each of these tokens into smaller tokens 01:33:28.540 |
How many smaller tokens based on how many heads we have and now we see why so let me do something strange 01:33:35.420 |
Which is i'm not copying the entire line. I'm copying a part of it 01:33:42.140 |
Which is a tensor of batch size numpatches embedding dimension and we are splitting the embeddim dimension into smaller 01:33:51.100 |
Called head dimension. How many of this head dimension we have? We have numheads 01:33:56.560 |
Okay, let me copy it all otherwise, I think it's going to be confusing. Sorry 01:34:02.080 |
We also have this transposition later. We will see how it works. We will visualize the tensor operations 01:34:09.040 |
We do it for the query the key and value, let's do it and then we see what is it about 01:34:24.000 |
So at the input of this fission transformer, we have a sequence of patches you can think of it as a sequence of 01:34:37.600 |
Sequence of tokens in case we are working with the language model and each token is represented by 1024 dimensions vector 01:34:44.720 |
The first thing that we do is we convert this input sequence 01:34:48.640 |
Which we will call x into query key and value and we do it through three transformations. One is called 01:34:59.380 |
Now if you look at the shape of the input sequence here, it's 4 by 1024 01:35:04.820 |
So here you can see the input sequence is 4 by 1024 01:35:08.260 |
Where 4 is representing the sequence dimension 01:35:12.320 |
So how many tokens or how many patches you have and the hidden size represents how many what is the size of this embedding vector? 01:35:19.760 |
We multiply it each of these with wq wk and wv 01:35:25.040 |
Now if you look at the dimensions here wq wk wv they are 01:35:29.360 |
The size is embedding dimension to embedding dimension. However here I have represented it as 01:35:35.040 |
embedding dimension to 8 multiplied by 128 so 01:35:40.800 |
The overall size is the same. So it's 1024 by 1024 01:35:44.340 |
However, i'm splitting this second 1024 into eight groups and later we will see why 01:35:54.640 |
matrix multiplication that takes a matrix multiplication between this tensor here 4 by 01:36:02.560 |
1024 and this other tensor which is also 1000 by 24 by 1024 01:36:08.880 |
However in which the second dimension is split into sub 01:36:12.080 |
Groups, how many eight groups because eight is the number of heads we are going to work with 01:36:23.760 |
It will result in this output here. So basically it's a 01:36:27.680 |
1024 multiply this dimension here cancels out as you can see 01:36:34.480 |
And then we have the second dimension that remains so in the matrix multiplication the inner dimensions cancel out and the outer dimensions remain 01:36:44.880 |
You can if you are confused by this you can think of it like this. So it's like a 1024 01:36:53.360 |
1024 nothing has changed. I'm just grouping the dimensions. So that's why it's possible 01:37:01.920 |
But it this grouping is helpful. And now we will see why 01:37:05.840 |
Let's visualize this tensor operation at the max matrix level 01:37:10.000 |
So when we do query this x multiplied by wq we have nx which is a 4 by 1024 01:37:22.480 |
And we are multiplying by a very big matrix, which is 1024 by 8 by 128. How to visualize this matrix? 01:37:30.080 |
Well, this is a wq. So it's a parameter matrix 01:37:33.360 |
It's also wq and wv. So they all have the same dimensions 01:37:37.780 |
You can visualize this like this. You can think of it as a matrix made up of 01:37:49.600 |
How many smaller vectors? 8 of them and each of these smaller vectors is made up of 128 dimensions 01:37:58.180 |
The overall size of this matrix is still 1024 by 1024 01:38:02.660 |
But each of these let's say these vectors are split into 8 groups 01:38:08.020 |
So that the output is also a matrix in which each of the 01:38:14.740 |
Tokens is a split into multiple subgroups. So it's a matrix that is 4 rows 01:38:21.060 |
So as you can see, this is 4 is the number of rows 01:38:24.900 |
Each row contains 8 groups of smaller embeddings and each of these smaller embeddings is made up of 01:38:36.260 |
With multi-head attention, basically what we want to do if we want 01:38:40.500 |
The multi-head attention is a way to relate tokens with each other 01:38:44.980 |
We don't want to relate tokens to each other by watching the full embedding of each token 01:38:56.420 |
Such that each head works with a smaller part of the embedding of each token 01:39:02.020 |
So the head number 1 will only watch the first 128 dimensions of each token in the entire sequence 01:39:11.300 |
The head number 2 will watch the next group of 128 dimensions. So the dimension from 01:39:23.060 |
So this head will learn to relate all these tokens by only watching this part of the embedding of this each token 01:39:28.340 |
This head will learn to relate tokens by only watching this part of the embedding of each token 01:39:34.020 |
And this last head will watch to we learn to relate tokens by only watching the last part 01:39:39.540 |
Last 128 dimensions of the embedding of each token. Why? 01:39:49.620 |
Many languages a word may have different meaning depending on the context in which it appears 01:39:56.180 |
If we don't have multi-head attention because the multi-head attention we will see it later is based on what is known as 01:40:11.300 |
Token then there is only way of calculating the dot product between two tokens 01:40:16.180 |
Which is the full embedding of the first token with all the full embedding of the second 01:40:21.060 |
So there is only one way of relating two tokens with each other 01:40:28.420 |
Each dedicated to one head. So this is head 1, head 2 and head 8 and all the intermediate heads are here 01:40:36.820 |
We learn to relate tokens to each other differently because each head is watching different parts of the embedding of each token 01:40:44.020 |
And this is useful for language modeling, for example, because in language modeling 01:40:51.380 |
Each word may have different meaning depending on the context in which appears 01:40:55.460 |
So it may be a noun in some context. It may be a verb in some other context or an adverb in some other context, etc 01:41:02.980 |
So we hope that this head here, for example learns to relate this token as a verb 01:41:08.020 |
This head here will learn to relate this token as a noun and this head here 01:41:12.820 |
Maybe will learn to relate this token as an adverb or some other property that this token has 01:41:17.700 |
And this multi-head attention also has another advantage 01:41:21.320 |
Because the multi-head attention is based on dot products between tokens 01:41:24.980 |
This head here will do the dot product of this first 128 dimensions of this token with the first 128 dimensions of this token 01:41:33.140 |
And this head because it watches this part of the token embedding and this other head watches this part of the 01:41:40.340 |
Embedding they can work independently from each other 01:41:44.020 |
And so because they can work independently from each other this computation can be parallelized 01:41:48.920 |
That's why in the attention is all you need paper when they talk about the multi-head attention. They make this 01:41:58.260 |
Drawing with multiple drawings behind you can see here with the head dimension appearing here, which means that each of this head 01:42:05.380 |
Is computing this scale dot product attention in parallel 01:42:10.120 |
With the other heads because each of them is working with a different part of the embedding of each token 01:42:15.860 |
So they can work independently from each other 01:42:17.860 |
And this is what we are doing here. So we group this 01:42:22.100 |
This the embedding of each token into multiple subgroups 01:42:27.560 |
Each dedicated to one head because we want this multi-head attention to happen in parallel 01:42:33.500 |
Because each head is working with a different part of the embedding of each token 01:42:40.600 |
Much faster because we can compute all this stuff in parallel 01:42:48.440 |
So we have taken our input sequence now here for the drawing. I have chosen a 4 by 1024 01:42:56.840 |
Depending on how many patches we have so numPatches by embedDimension 01:43:00.860 |
We have multiplied each of them by the Q K and V 01:43:05.000 |
And then we split them here as you can see in the 01:43:09.240 |
In multiple heads, so we add this head dimension here in my slide 01:43:15.560 |
I just pretend I am multiplying directly with a 01:43:19.080 |
Parameter matrix that is already split into multiple heads 01:43:23.240 |
Why am I doing differently here than compared to the code because we will be it will be useful for this 01:43:29.240 |
Visualizing it this way is will be useful for when we will be 01:43:32.920 |
Talking about the language model and especially we will be talking about grouped query attention 01:43:36.920 |
Because with grouped query attention, we will see that the number of heads for the query 01:43:40.600 |
Is much bigger than the number of heads for the keys and the values 01:43:45.240 |
So here in the vision transformer the number of heads of the query key and values is the same 01:43:49.560 |
So we don't use the grouped query attention and that's why 01:43:52.680 |
We use the same number of heads for the query key and values 01:43:55.480 |
Then we do this transposition and now we see what is this transposition 01:43:59.720 |
So when you do this multiplication here, so you multiply the input by the Q projection. It will return the same 01:44:08.600 |
When you do this view, it will just split this last dimension. So this embedDimension into smaller parts 01:44:25.340 |
This dimension into these two smaller dimensions. So numHeads by headDimension 01:44:31.180 |
So basically, what is this headDimension? headDimension is the embedding full embedding divided by the number of heads 01:44:43.240 |
Then this will be 128 because it's 1024 divided by 8 01:44:53.080 |
Because we are not reducing the number of parameters or we are not throwing away anything 01:44:57.800 |
We are just grouping differently each of these embeddings 01:45:00.940 |
With this transpose here, we are changing the position of the two 01:45:07.560 |
Two dimensions which dimension the position the dimension number one and the dimension number two, which is the numPatches with the numHeads 01:45:14.780 |
So basically we are doing numHeads and numPatches 01:45:20.040 |
So this will be the output of all this expression. So it will be a tensor of this 01:45:25.880 |
Of this shape batchSize numHeads numPatches headDim. Why are we doing this transposition? Let's see 01:45:38.040 |
When we multiply by this wqwk and wv which is already includes the grouping. We are grouping each of these 01:45:44.360 |
Vectors into sub groups each dedicated to one head 01:45:49.880 |
Now what we have here is a sequence of tokens 01:45:53.080 |
Each token is made up of eight group of embeddings. Each group of embedding is made up of 128 dimensions 01:46:05.560 |
Multi head attention in parallel, which means that each head should be able to visualize 01:46:11.500 |
The entire sequence but a smaller part of the embedding of each token 01:46:17.800 |
We need to transpose these two dimensions. So we exchange the sequence dimension with the head dimension 01:46:30.120 |
Let's do it. So we have this sequence of tokens each token is 01:46:35.560 |
Divided into eight groups. Each group is made up of 128 dimensions. We want to convert it 01:46:43.320 |
Into multiple sequences made up of only the part of the embedding dedicated to each token 01:46:49.480 |
So when you do the transposition of these two dimensions here 01:46:56.620 |
How can you visualize this matrix? You can visualize it as follows. It's a big matrix that contains eight smaller matrices 01:47:05.160 |
each smaller matrices contains four tokens and each token contains 01:47:10.180 |
128 dimensions, which is exactly the dimensions 01:47:13.720 |
That are dedicated to each of this head. So you can think of it as a sequence eight sequences 01:47:24.760 |
tokens and each tokens contain only the part of the embedding dedicated to each of the head that it's 01:47:33.880 |
It's composed of so this sequence here will only contain the first 128 dimensions of each token 01:47:40.440 |
This sequence here will contain the next 128 dimensions of each token 01:47:45.560 |
And the last sequence here will be a sequence of four tokens and each token will be made up of the last 01:47:54.600 |
Why are we doing this? Because now we can compute 01:47:59.720 |
The multi-head attention using this stuff here 01:48:02.040 |
Independently from this one independently from this one independently from this one 01:48:07.400 |
because each head has a sequence of four tokens and each token is made up of 128 dimensions 01:48:19.720 |
So we can compute this scale.product attention using the query key and values where the query key values are not the entire 01:48:27.380 |
Embedding of the token but are only the part of the token dedicated to that specific head 01:48:32.660 |
So this head here suppose the head number one will be using the first 128 dimensions 01:48:38.180 |
This second head will be using the second 128 dimension. The last head will be using the last 128 dimensions, etc 01:48:45.460 |
So we have created the that's why we did this transposition because we now we can treat each head 01:48:53.200 |
Independently each head is made up of is working with the four tokens 01:48:57.200 |
Which is the sequence dimension and each token is made up of the part of the embedding dedicated to that head 01:49:06.240 |
The next thing that we do in multi-head attention is well, we have this 01:49:13.440 |
We should do query multiplied by the transpose of the key divided by the square root of the model 01:49:22.560 |
Let's calculate the attention weights, which is this one 01:49:29.440 |
Multiplied by the transpose of the keys where we are transposing the second and the third dimension 01:49:36.320 |
It's the numPatches with the head dimension because the query is pet size numHeads numPatches head dimension 01:49:50.240 |
You so multiply it we need like this. We need 01:50:02.020 |
Such that if you remember in the matrix multiplication the inner dimensions cancel out and the outer dimensions remain 01:50:20.080 |
Head dimension will cancel out with this one and we will be left with numPatches 01:50:25.540 |
So the output of this multi head attention basically, it's a matrix that is numPatches by numPatches for each head 01:50:35.760 |
So I know it's not easy to visualize it like this. So let's visualize it on the slides 01:50:41.440 |
So what we are doing is we are multiplying the query with the transpose of the keys 01:50:45.280 |
And then we are dividing by the square root of the model, but we already computed it here. So this is the square root of the model 01:50:51.120 |
And we just because it's already one over square root so we just multiply it we don't need to divide by it 01:50:58.080 |
So let's visualize in the slides how this multiplication works 01:51:03.120 |
Okay, we already saw why we do the multi head attention because we want to parallelize the computation etc. So now what we are doing is we are 01:51:17.280 |
Made up of one sequence of embeddings where each embedding is not the full embedding of the token 01:51:23.920 |
But it's a part of the embedding of each token. So it's a smaller embedding. Let's say 01:51:27.920 |
So each head basically will do the following matrix multiplication when you do query multiplied by the transpose of the keys 01:51:38.560 |
And each token is not the full embedding of the token, but it's the first 128 dimensions of each token 01:51:45.440 |
When we do the transpose of the keys each of these row vectors becomes a column vector as you can see 01:51:52.080 |
And when we do this matrix multiplication for each head we will be getting this 01:52:04.280 |
Sequence by sequence because as you can see when you multiply this matrix here by this matrix here 01:52:09.180 |
You get four by four matrix as output because the inner dimensions cancel out 01:52:16.860 |
Each of these numbers represents the dot product of one token with another token 01:52:23.340 |
So you can think of the rows as being the queries and the columns as being the keys 01:52:30.060 |
This one here is the dot product of the first token of the queries suppose that the each of these tokens represent 01:52:40.380 |
Then this is the word I this is word the word love this is the word pepperoni and this is the word pizza 01:52:47.740 |
Then this number here represents the dot product of the word I with itself 01:52:59.340 |
This one here represents the dot product of the first query with the second key 01:53:05.340 |
This one represents the dot product of the first query with the third key 01:53:10.780 |
And we do all the possible dot products as you can see here 01:53:17.740 |
Are and what does this matrix represent? This represents somehow the relationship between two tokens 01:53:24.380 |
So the bigger the dot product the more intense is the relationship between two tokens 01:53:29.340 |
Actually, it's then defined later. We will see that we apply the softmax 01:53:33.120 |
But you can think of the dot product as being how the self-attention mechanism is relating to tokens 01:53:39.420 |
How intense is the relationship of these two tokens? 01:53:42.140 |
Why do we have this square root of the model as the denominator because 01:53:49.340 |
We want to scale this dot product based on because usually when you train a model you train multiple variants of it, for example 01:53:58.860 |
We when we and suppose some for example, imagine you want to try you train multiple variants and you have this 01:54:08.940 |
You don't want the magnitude of these numbers to change between one try and the next one 01:54:14.220 |
So basically by dividing by the square root of the model you keep the magnitude constant 01:54:21.440 |
Now what are what is this matrix doing so this matrix tells us how two tokens are related to each other 01:54:29.440 |
Now in language modeling we also apply what is known as the attention mask 01:54:35.760 |
So we don't want the word I to be related to future tokens 01:54:40.000 |
So usually we don't want to compute this dot product 01:54:42.560 |
We don't want to compute this dot product and we don't want to compute this dot product because we don't want the token I 01:54:47.520 |
To be related to all any other token because there is no previous tokens 01:54:51.440 |
We also don't want the word love to be related to the word the pepperoni and the pizza 01:54:56.560 |
Because they come after it, but we want of course the word pepperoni to be related to the word love. So this this 01:55:02.080 |
There should be a number here. So we don't want to mask out this one 01:55:10.960 |
If we don't want some interaction between token to happen 01:55:18.640 |
So query multiplied by the transpose of the keys and then we replace all the numbers all the relationships that we don't want 01:55:24.640 |
With minus infinity. So here we can replace this number here with minus infinity 01:55:29.620 |
Here we can replace this number with minus infinity and then we can replace this number with 01:55:41.200 |
So that after we need to apply the softmax the softmax will convert each of these numbers into a 01:55:51.840 |
because we want the relationship of one token with other tokens to be 01:55:56.880 |
Between zero and one and also we want each row to sum to one 01:56:01.760 |
Later, we will see why because actually the when we do the contextualization we are doing a weighted sum, but okay 01:56:11.040 |
Anyway, the point is we apply the softmax row by row. So if we don't want the relationship of two tokens to 01:56:19.440 |
Considered by the attention mechanism. We replace that particular dot product with minus infinity before we apply the softmax 01:56:26.560 |
Because the softmax we saw before is an exponential 01:56:29.840 |
It's e to the power of x when e is to the power of minus infinity 01:56:34.000 |
It will become zero. So the output of the softmax will become zero for all the interaction that we didn't want 01:56:40.080 |
So that's why we replace it with minus infinity 01:56:50.800 |
So as you can see if we apply the mask before we apply the softmax 01:56:54.080 |
It will replace with zero all the interactions that we don't want 01:57:02.560 |
This matrix here is known as attention weights 01:57:05.280 |
so it tells us how intense is the relationship between two tokens and 01:57:10.000 |
This matrix here is calculated independently for each single head because here I show you only one matrix here 4 by 128 01:57:25.840 |
So you need to think that you have eight of this matrix if you have eight attention heads 01:57:31.200 |
And in this case in the code, you can see that the output is a list of it's a batch 01:57:39.440 |
Each of these images is managed by multiple heads 01:57:42.960 |
Each of these heads will learn to relate tokens differently 01:57:46.720 |
So each of these heads will give us a numPatches by numPatches matrix or sequence by sequence matrix 01:57:52.000 |
Where each of this number represents how this head is relating two patches with each other 01:57:59.440 |
So now we have seen how to calculate this attention weights 01:58:02.240 |
Which basically it's a matrix that tells you how two tokens are related with each other 01:58:06.400 |
It's kind of a score of how the attention mechanism thinks two tokens are 01:58:15.840 |
The first thing we do. Okay, we verify the dimension of this matrix 01:58:19.520 |
And then we apply the softmax the softmax as we saw before is a way to convert these attention scores into 01:58:28.560 |
Numbers that are between 0 and 1 and also such that they sum up to 1 01:58:32.560 |
And we do it by soft with the softmax function, which is applied by rows 01:58:41.200 |
What is the meaning of this dimension parameter which tells you how you want to apply it? 01:58:47.760 |
Dimension you can think of this as the row dimension. This is the column 01:58:52.960 |
So if you apply it on entire all the columns, it means you are applying it by rows 01:58:57.920 |
then we have the dropout but as I said before we don't use the dropout because 01:59:04.480 |
I didn't see it in the parameters of the polygamma ever being used. So we have it, but we don't use it 01:59:12.000 |
And as you remember the dropout basically takes random 01:59:15.920 |
With the probability p it will set some activations to zero 01:59:20.240 |
So some numbers of this input matrix to zero, but we don't use it 01:59:23.680 |
And it only happens during training and it's a way to reduce overfitting 01:59:30.180 |
The next thing that we do in the multi-head attention is we are multiplying this attention weights matrix with the v sequence 01:59:39.940 |
So we multiply this matmul means matrix multiplication 01:59:43.000 |
We are multiplying this attention weights with the value states, which is the value sequence 01:59:48.500 |
which is a transformation of the input sequence through this wv matrix and also by grouped by 02:00:01.860 |
so the output of the attention mechanism of the query multiplied by the keys is this matrix here where each number represents the 02:00:09.540 |
How two tokens are related to each other by applying the softmax this number become between zero and one in each row 02:00:16.020 |
And also in such a way that they sum up to one 02:00:18.740 |
So here you can see it's 1.0 because there is only one number here. It's 0.4 and 0.6 02:00:25.140 |
So they sum up to one and here is 0.2, 0.4, 0.4. So they sum up to one etc, etc 02:00:31.140 |
Now when I say that these numbers represent the intensity of how the attention mechanism relates to token is because now when we multiply 02:00:39.720 |
This matrix here, which is in the code is written as attention weights 02:00:45.220 |
We multiply it by the v matrix. So the v sequence for the value sequence 02:00:57.780 |
We are multiplying for example a 4 by 4 matrix by a 4 by 128 matrix 02:01:03.860 |
Where each of this v matrix is one for each attention head just like each of this matrix here 02:01:10.340 |
Attention weights is one for each attention head. So each of these attention heads will be doing this 02:01:15.060 |
Product in parallel. So each attention heads does query multiplied by the transpose of the keys in parallel the softmax in parallel 02:01:28.020 |
I mean not these operations in parallel. It's the attention heads that work in parallel. The operations are sequential, of course 02:01:41.780 |
Product it's a 4 by 4 multiplied by 4 128. So the output is a 4 by 128 because the inner dimensions cancel out and 02:01:54.500 |
So it will be a matrix with four tokens each token represented by not the full dimensions 02:02:01.140 |
But because we are working with multi-head attention each head will have a smaller part of the embedding of each token 02:02:07.460 |
So it will have 128 dimensions in case we have eight heads and the embedding dimension is 1024 02:02:16.900 |
Will be the dot product of the first row of this matrix with the first column of this matrix 02:02:32.180 |
which means that only this token here will contribute to the output here, which means that this and 02:02:41.220 |
So this stuff here will be the dot product of the first row of this matrix with the second column of this matrix 02:02:48.500 |
But most of the values here are zero except the first one 02:02:52.420 |
Which means that only this token here will contribute to this second number here 02:02:57.140 |
So all the dimensions in this row will be contributed only by the first token 02:03:05.120 |
The dimension of the first token multiplied by the number one 02:03:09.620 |
Because all the other tokens will be multiplied by zero zero and zero 02:03:14.900 |
Let's look at the second row of this matrix here this one here the first number 02:03:20.900 |
So the first dimension of the second row of the output 02:03:24.660 |
Matrix will be the dot product of the second row of this matrix with the first column 02:03:31.860 |
The first two numbers are non-zero and the second two numbers are zero 02:03:35.700 |
Which means that only the dimensions of the first two tokens will contribute to this output embedding 02:03:43.460 |
So for all the dimensions here will only be contributed by the first two tokens because all the other tokens 02:03:52.100 |
They will be multiplied by zeros. So they will not contribute to this output embedding 02:03:56.660 |
That's why we can say that this is a contextualized embedding 02:04:00.200 |
In which the contribution to this contextualization only comes from the first two tokens 02:04:06.740 |
How are they these two tokens contributing? Well each of these numbers in the second 02:04:12.580 |
Token will be multiplied by 0.4 and each of the number in the first token will be multiplied by 0.6 02:04:20.820 |
This you can see it as the first token contributing 02:04:23.880 |
60 percent of the information to this contextualization and the second token contributing 0.4 to this 02:04:37.300 |
So this output here the first number will be the dot product of this third row 02:04:42.740 |
Multiplied by this first column and as you can see here, we have a zero because of the causal mask 02:04:50.100 |
Which means that only the first three tokens will contribute to the third embedding here 02:04:55.220 |
How much each token will contribute? Well, it depends on how are these numbers distributed? 02:05:00.440 |
The first token will contribute 20 percent. The second token will contribute 40 percent and the third token will contribute also 40 percent 02:05:08.740 |
So that's why when we talk about the attention width matrix, we talk about how to 02:05:18.500 |
Is telling us how intense is the relationship between two tokens so that each token will contribute that token will contribute more to the output 02:05:33.120 |
Very related to each other when then the embedding of the word I will contribute most to the output of embedding of this fourth 02:05:44.660 |
So it means that then the fourth was contextualized position will be 40 percent based on the information 02:05:50.660 |
contained in the token I and 20 percent of the information contained in the word the love and 02:06:00.980 |
So this is why it's known as a weighted sum because you are 02:06:05.460 |
Summing the contribution of each token if it's not masked out 02:06:16.100 |
Calculated using the attention weights matrix here and we do this for each of this head in parallel 02:06:22.200 |
So each head is watching a part of the embedding of each token and it's learning to relate them differently and then doing this weighted 02:06:36.500 |
a list of contextualized embedding but each of this contextualized embedding will not be a full token 02:06:43.060 |
It will be part of what is the full token and now we'll 02:06:46.180 |
We see how we can merge the result of this multi-head attention 02:06:50.500 |
And for that we need to look at the original paper. So if you look at the original paper 02:06:54.660 |
We calculated this multi-head attention in parallel. And how can we merge the result of this multi-head attention? 02:07:02.340 |
Well, we we we we go here and we basically concat these heads 02:07:06.820 |
So we take the output of the first head we concat it with the next we concat with the third head with the fourth 02:07:15.300 |
All the heads so until we get the full dimension of the original token back because each head is made up of 100 02:07:22.340 |
In case suppose 128 dimensions, so this will be the first 128 dimension then the next 100 and the third 100 etc 02:07:30.100 |
Until the last 120 dimensions, so we get back the 1024 dimensions back 02:07:44.180 |
Will return a contextualized embedding for each position, but it's a contextualized 02:07:53.940 |
That does not include all the original token contextualized but a part of it because each head is working 02:07:59.940 |
In parallel with a part of the embedding of each token, then we concatenate them. So 02:08:05.140 |
What we do is we basically we want to arrive to this stuff here. So we have a contextualized embedding 02:08:15.300 |
Okay, first we need to do I believe a transposition so we need to transpose back because before 02:08:25.300 |
We put the head dimension first and then the sequence dimension 02:08:29.460 |
So now we need again the sequence dimension and then the head dimension after 02:08:37.860 |
Which is for each head. We have a contextualized list of tokens 02:08:43.220 |
We want to get a list of tokens in which each 02:08:48.500 |
Head is contributing its 128 dimensions, which are contextualized 02:08:59.700 |
I believe it's here. So I think there is another checking of the output dimension 02:09:10.820 |
So we do this transposition back. So we did the first transposition here to exchange the 02:09:16.660 |
Number of heads with the sequence dimension. Now we transpose back 02:09:19.780 |
So we go back to the num_patches and num_heads 02:09:24.100 |
So it's a sequence each sequence is made up of smaller 02:09:35.000 |
We do this contiguous because we want to reshape. Okay, it doesn't matter 02:09:40.260 |
You don't have to know why we do this contiguous, but basically 02:09:47.920 |
The tensor to represent the information in the memory in a contiguous way so that the next operation that we are going to do 02:09:57.920 |
Does not require any computation because when you do a reshaping or a viewing of a transfer of a tensor 02:10:04.240 |
There is no change in the memory layout of the tensor 02:10:08.880 |
Actually, the PyTorch will just change what is known as the stride of the tensor 02:10:20.960 |
There is this thing called the stride which tells you how 02:10:24.080 |
To go from one dimension to the next without changing the layout of how this tensor is allocated in the memory 02:10:34.480 |
The PyTorch will just change these numbers on the stride. Okay 02:10:41.600 |
But anyway, but this contiguous allow us to have this tensor all in the memory as a contiguous memory allocation 02:10:48.160 |
So that this reshape operation can be done without 02:11:02.240 |
So after we have to do a reshape, we did the transpose operation and now we need to do a reshape operation 02:11:07.520 |
So the transpose operation basically allow us to get again at the first dimension the sequence dimension 02:11:19.260 |
And each group contains 128 dimensions. Now, we need to concatenate them. How can we concatenate them? 02:11:25.680 |
Well, we just want to merge these heads again together into one single token 02:11:33.660 |
Reshape operation. So with reshape basically, we are going from numHeadsHeadDim to EmbedDim, which is 02:11:44.060 |
So, how does it work the reshape basically the 02:11:53.580 |
Groups and will just merge them. So it will just concatenate them with each other. So instead of being a 02:12:01.020 |
matrix that contains sub-arrays where each sub-array contains multiple sub-arrays and each of these 02:12:08.140 |
sub-sub-array contains 128 dimensions, it will just become a matrix that contains one array that is made up 02:12:15.900 |
1024 dimensions, which is the concatenation of all these heads 02:12:20.940 |
So this is how we merge the information of all this multi-head attention that was done in parallel into one single 02:12:30.780 |
Token that is a contextualized version of the initial token 02:12:34.460 |
So we as you can see we got back the initial shape 02:12:38.460 |
So we started with before at the beginning of the multi-head attention. We started with 02:12:57.960 |
Multiplication with this WO. So if you look at this concatenation that we have done 02:13:03.000 |
The concatenation basically takes the this tensor this first token here 02:13:09.160 |
Is just the concatenation of the first 128 dimensions, which are the output of the first head then the second 128 dimension 02:13:16.760 |
Then the third 128 dimension and then the last 128 dimension. In total there are 1024 dimensions 02:13:23.880 |
But there has been no mixing between the result of these heads. So it's just a concatenation of multiple 02:13:33.800 |
Each calculation done by one head independently from the others 02:13:37.720 |
But we want the token to not be a concatenation of independent calculations 02:13:43.720 |
We also want to kind of mix the result of these heads with each other 02:13:48.600 |
And the mixing happens when you do this multiplication by WO. The WO matrix is a matrix that is 02:14:00.680 |
As you can see does not change the shape of the input. So we have 02:14:03.400 |
The input of this WO will be a 4 by 1024. We multiply by 1024 by 1024. So it results the same input shape 02:14:16.520 |
Let's look at this number here. This number here is the dot product of the first row 02:14:21.320 |
So the first token with the first column of this matrix 02:14:24.840 |
And the first column of this matrix is 1024 parameters. So all of these heads, so the 02:14:38.520 |
Will all participate in the same dot product giving up one single number here 02:14:43.880 |
So there have been a mixing of the results of this head. If we don't multiply with the WO there is not 02:14:49.800 |
There is no mixing between the result of each head which happened independently in parallel 02:14:57.400 |
So we don't want each token to be a contextualized version of multiple subtokens each calculated independently from each other by the multi-head attention 02:15:05.400 |
We want of course it to happen because we want to parallelize 02:15:08.760 |
But then we want to mix the result of this multi-head attention and we do that by multiplying by WO 02:15:16.620 |
For now, we just merge. So this reshape is basically doing the concat that we saw before in the attention paper 02:15:23.020 |
Now we do the multiplication with the WO which is this stuff here. So out projection 02:15:28.540 |
It won't change the shape of the tensor that is input to it 02:15:32.700 |
And then we return it along with the attention weights. Actually, we will not be using the attention weights 02:15:37.020 |
And now finally we have implemented the multi-head attention 02:15:43.900 |
We forgot to implement this encoder. So we created the layer of the encoder, but we didn't create the encoder itself 02:15:51.660 |
So what we created basically in this vision transformer is this stuff here. So let me open the slides 02:16:02.700 |
But we didn't create the sequence of these layers because an encoder is a sequence of these layers. So let's do it 02:16:08.620 |
It's it's very simple. So this is a single layer 02:16:11.900 |
But we need to create a sequence of them because we apply one after another such that the output of one is 02:16:17.180 |
Used as input for the next one. It's a very simple class. So let's create it 02:16:25.340 |
Constructor so it's just very simple. It's a okay 02:16:28.620 |
We save the configuration then each we create a sequence of layers where each layer is this encoder layer to which we pass the configuration 02:16:37.260 |
How many we create based on how many layers it should have so the transformer layers 02:16:41.820 |
And the forward is very simple. I can just copy it all. It's basically says, okay 02:16:48.780 |
We have the input we give the input to the first layer and the output of this layer becomes the input to the next one 02:16:55.820 |
So we do a for loop and then we return the the output of the last layer 02:17:00.380 |
This is a very simple and as you can see between each layer, there is no change in the shape of the tensor that is fed 02:17:10.380 |
Coded all of the cglip. So which is our vision transformer 02:17:14.560 |
You may think that I have lied to you by saying that at the beginning when we were talking about contrastive learning you 02:17:21.820 |
Okay, actually, let's look at it. Otherwise, we will have the doubt so 02:17:27.580 |
When we were talking about contrastive learning 02:17:29.580 |
We were talking about generating one single embedding for each image 02:17:35.180 |
But here we are generating a sequence of contextualized embedding 02:17:38.880 |
So how can the image generate one single embedding? 02:17:49.740 |
So you give it a list of patches as input and it will give you a sequence of contextualized patches as output 02:17:56.540 |
When working with something like clip, for example, if you want only one single embedding for each image 02:18:02.940 |
You can just take the first output contextualized embedding from the transformer as a representative for the whole image 02:18:09.820 |
Because it will force the model to put all the information in the first contextualized embedding 02:18:17.820 |
Another way is to just take the average of all the output embeddings by the transformer to generate one single embedding 02:18:24.540 |
Anyway, this was just a closing note before we move to the next part, which is our language model 02:18:30.620 |
So let's go back to the architecture, which is here 02:18:33.180 |
So we have coded this part here the vision encoder so we feed an image it will be 02:18:40.780 |
The vision encoder extracts some patches each of these patches become an embedding to this embedding 02:18:47.100 |
We add a positional encoding which is learned 02:18:49.740 |
We send it to this magic box called the transformer layer, which will contextualize them 02:18:54.540 |
We take the output of this contextualization and this becomes our 02:19:02.140 |
Now before we can feed it to the language models 02:19:04.860 |
These embeddings may not be of the same size of the embeddings used by the text layer 02:19:10.300 |
So we will need to introduce this linear projection 02:19:14.460 |
So in the next part of the video, we are going to code the language model including this linear projection here 02:19:20.540 |
And we will learn how to merge these tokens the image tokens and the text tokens 02:19:28.540 |
So the next part that we are going to code is basically how to load the image from the disk to convert it into a tensor 02:19:37.500 |
And we need we will see that we need to do the preparation of the text has to be done in a particular way 02:19:43.660 |
Let's see actually why we have it has to be done in a particular way. So let's open the slides 02:19:48.220 |
Oops, I think I closed it. So let me open it again 02:19:54.060 |
So as you can see, we need to find a way to combine the image tokens with the text tokens 02:20:02.220 |
But we need to create some placeholders for where we will put the image 02:20:09.260 |
Tokens before the text token. So I will use the term image tokens and image embeddings interchangeably 02:20:15.760 |
because you can think of the image embeddings as kind of tokens that represents the image or and the 02:20:21.900 |
Text are the embeddings that represent the text that is the prompt from the user 02:20:26.700 |
so the first thing that we need to do is we need to learn how to load this image into a 02:20:32.300 |
tensor because then as you can see from our cglib code the input to the cglib is 02:20:39.020 |
A tensor that is has the channel the height and the width dimension 02:20:44.780 |
transformed into patches and contextualized, etc, etc 02:20:47.420 |
Then we need to tokenize the text. We need to create this list here 02:20:56.140 |
Each corresponding to the text tokens and then we will add some placeholders for where we will put the image tokens 02:21:03.500 |
and then it will be the transformer that will 02:21:08.300 |
Take these placeholders and replace it with the image. So 02:21:11.180 |
I know it's a lot of things to remember. So don't worry. Let's code it and we will see it step by step. So let's go 02:21:18.460 |
We create a new file called, let me check here processing 02:21:35.820 |
We create these two constants and later we will see why we need them 02:21:43.340 |
Okay, let's start from the beginning. So let's create this class called the polygamma processor 02:21:57.020 |
It will take as input the tokenizer how many image tokens? 02:22:02.460 |
We need to generate for the image and what is the image size that this particular gamma will work with 02:22:08.780 |
We save it. We save these two values and then what we do 02:22:13.660 |
We need to add some special tokens to our tokenizer. So now I show you why we need to do it and how it works 02:22:21.100 |
So the tokenizer that polygamma is using is the tokenizer of the gamma model 02:22:26.940 |
But the tokenizer of the gamma model was not created 02:22:30.320 |
With the special tokens for the image. So what they did was they basically created these 02:22:43.640 |
So what we saw here in my slide is basically here is trying to extract information from an image 02:22:50.140 |
So we have an image we have a prompt and the polygamma so which is basically the gamma model here 02:23:01.420 |
Prompt and using this one as additional information for the prompt 02:23:07.180 |
Polygamma actually can do much more than this a polygamma can also do image segmentation so it can 02:23:14.220 |
Segment the part of the image that for example for this leg 02:23:19.980 |
So it can detect all the instances for of of tree for example 02:23:24.220 |
If we do object detection for trees, it will probably give us this this okay 02:23:29.020 |
This is not a bounding box this box here telling that this is a tree 02:23:32.380 |
If we do it ask it to detect all the feeds it will give us two 02:23:39.580 |
So polygamma can do a lot of this and the way it does it by using special 02:23:46.380 |
For the segmentation they are called the segmentation tokens and for object detection. They are called local location tokens 02:23:53.580 |
And but we will not be using them. So our goal here is just to inference polygamma 02:23:59.340 |
So we will not be working with the object detection or object segmentation 02:24:03.120 |
But if you want more information on how these tokens work, there is a very nice article 02:24:08.940 |
Not only this one from google. So here in google they say 02:24:12.300 |
That polygamma uses the gamma tokenizer, but they extend it with these further tokens that are used to tell 02:24:19.580 |
In the output of the model, where is the segments? 02:24:23.420 |
where is the bounding box position that it has detected or where is the 02:24:29.580 |
Of the segmentation mask that the model has detected 02:24:33.980 |
Another article that I recommend is the hugging face blog article about 02:24:37.980 |
Polygamma, let me find it. I believe it is this one here 02:24:43.180 |
In which they describe how this attention masks work 02:24:47.100 |
So as you can see polygamma can detect the cat and will give us this output which is a lock 02:24:57.100 |
Which this number 0094, 0256 tell us the position of the top left 02:25:03.820 |
Top right bottom left and bottom right corner of this bounding box here 02:25:11.900 |
Here because we are only interested in using the polygamma as a conditional model for generating an output 02:25:24.540 |
Used by polygamma is adding these special tokens 02:25:27.740 |
We also add them here and how to add them and how many to add them is described in this article 02:25:34.700 |
1024 location tokens for image detection and then 128 tokens for object segmentation 02:25:48.860 |
We have we also need to create this constant called image token 02:25:57.880 |
process our text with the gamma tokenizer the gamma tokenizer will only generate of course the 02:26:04.220 |
The tokens for the text but later we need to also insert in these tokens the image tokens 02:26:11.820 |
So what we need to do what we do basically is we insert some placeholder tokens 02:26:19.760 |
Extracted by the visual encoder and this placeholder tokens that we will be using is this image token here 02:26:34.300 |
Now how to use this polygamma processor. So the polygamma processor is a special class that given an 02:26:40.780 |
Text which is the prompt of the user and an image will load the image 02:26:46.840 |
Reprocess it so resize it rescale it. Whatever the vision model needs to see 02:26:54.520 |
Text tokens with the placeholder for the image tokens. So let's do it 02:27:02.120 |
Method here the call why we create the call method. Well, basically this allows the 02:27:12.840 |
So when you create the processor you will we will create it like this like polygamma processor and then we can use it like this 02:27:19.720 |
Passing the arguments here. So this is why we implement the call method 02:27:24.040 |
And the call method takes as input a list of text and the list of images 02:27:28.600 |
but we will actually only accept one text and one images because I don't want to deal with the 02:27:34.200 |
Padding otherwise, it will complicate our code. Our goal is not to make it universally 02:27:39.420 |
Perfect. Our goal is to learn by doing and how it works. Actually, this is this code will be usable 02:27:45.160 |
So we will actually run the inference later, but it will only work with one image and one prompt at a time 02:27:52.760 |
I will try to make the code for fine-tuning this model 02:27:55.400 |
And we will see that we will change this code a little bit to to accommodate for the padding 02:28:02.680 |
Anyway, we need to process these images and we will use a special method called process images 02:28:09.400 |
So if we take each of these images and we need to resize it 02:28:12.920 |
We resize it to the image size that is accepted by this polygamma version. So 02:28:20.680 |
Actually show there is multiple weights, but this is two to four only resizes the images to the size 02:28:28.100 |
124 by 224 and generates 128 tokens for this in each image 02:28:33.780 |
then we rescale this image and later we will see why we do it and then we 02:28:38.100 |
We normalize it using the mean and the standard deviation of ImageNet 02:28:43.540 |
It's not really the ImageNet mean and standard deviation, but later we will see how it works 02:28:47.620 |
Anyway, suppose that this method here will load the image will rescale it will normalize it etc and convert it into 02:28:57.460 |
A tensor that can be then processed by the vision model 02:29:04.980 |
We create here a tensor. So because this will 02:29:08.020 |
Return a list of tensor. We need to create a one single tensor with the batch size 02:29:13.540 |
So we stack them stack basically means that if we have a list of tensor, it will create one single big tensor 02:29:25.220 |
So instead of becoming a list of tensor it will become one big tensor 02:29:28.500 |
This is a NumPy tensor it is converted into a torch tensor 02:29:37.620 |
Create the input to the model. So later we will expand this method. So now I just create them 02:29:44.020 |
What is this method going to do? Well, this method is going to 02:29:49.780 |
Let's check here. It's going to create the tokens of the text and create the placeholder for the image tokens 02:29:59.620 |
We tokenize it using the placeholder tokens for the image 02:30:08.100 |
This stuff I know that I have copied a lot of code. Now, I will explain it one by one 02:30:13.460 |
So let's start at input. We have a list of text and the list of images. Let's process these images 02:30:26.980 |
Okay, the process image takes as input a list of images what is the size that we want of these images 02:30:35.700 |
What is the kind of resampling that we want to do when resizing this image? You can do linear, you can be cubic, etc 02:30:44.160 |
Rescale factor if we want to rescale this image and 02:30:50.580 |
And this has the same meaning as the normalization that we do in the neural networks. So we want the 02:30:57.280 |
The image no matter what it represents to always have the same distribution more or less 02:31:05.680 |
And the way we do it is basically we take the image 02:31:10.000 |
Values so the tensor we subtract the mean of all the images that we have in our data set 02:31:16.240 |
And usually we use the mean of the image net data set and the standard deviation of the image 02:31:24.480 |
I don't know why in the hugging phase they use 0.5 because it's actually not really 0.5 02:31:28.880 |
It's very close to 0.5 each of these numbers, but it's not really so maybe it works anyway 02:31:35.920 |
And we have one for each channel of the image. So one for r one for g and one for p 02:31:44.000 |
Function going to do first it resizes the image by using this resampling method 02:31:47.760 |
Then it will convert the image into a numpy array 02:31:50.400 |
Then it will rescale it so that the pixel values instead of being between 0 and 255 will be between 0 and 1 02:31:57.440 |
Then it will normalize using the mean and the standard deviation of image net 02:32:01.280 |
And then it will move the channel dimension to be the first dimension. So 02:32:06.320 |
Instead of being a height width channel, it will become channel height width 02:32:11.120 |
Let's implement this very simple method. So there is first the resize 02:32:16.980 |
The resize is just going to resize the image using the 02:32:31.520 |
So it will take the image and it will resize it using this resampling method 02:32:41.360 |
The rescale is just going to rescale the image 02:32:43.680 |
So it will convert each pixel value instead of being between 0 and 255. It will rescale it into 02:32:49.920 |
Between 0 and 1. Why? Because as you can see here, we pass a scale factor of 1 over 02:32:56.060 |
255. So that's why we are multiplying it by this scale 02:33:01.420 |
The next thing that we are doing is normalizing 02:33:04.800 |
normalizing means that we want the each of these values to be 02:33:09.340 |
distributed like it's coming from a Gaussian of mean 0 and variance of 1 and we do it by 02:33:14.380 |
Subtracting the mean and dividing by the standard deviation as you can see here 02:33:22.140 |
I believe we have already implemented everything for the process images 02:33:25.980 |
Now, let's go further. So we have these images we are processing them. So they are still a list of images 02:33:32.700 |
We convert them into they are converted into a list of numpy arrays and we do that here 02:33:38.780 |
As you can see first we convert them into numpy arrays then we rescale, normalize, transpose 02:33:46.860 |
This list of numpy arrays is converted into a single tensor instead of being a list of tensor is becoming one big tensor 02:33:53.260 |
And then we convert it into a torch tensor. This torch tensor 02:33:57.900 |
Is the pixel values that will be fed to the model to the image encoder 02:34:06.060 |
And we need to tokenize it but we need to tokenize it by already accommodating for the position in which we will put the image 02:34:15.980 |
And we do that by processing this each of this text through this function called add image tokens to prompt which as the name implies 02:34:23.260 |
We'll add this image token placeholders to the prompt 02:34:34.300 |
Save it here. It's a long comment because I found a little bug in this one, but okay later I explain to you 02:34:40.300 |
But basically we add some image token placeholders. How many of them? Well, depending on how many image 02:34:46.540 |
Tokens this model needs in the case of polygama 224. We need 128 02:34:55.420 |
Oh, no, this is not this is the text tokens, I think it's 256 I remember correctly 02:35:01.120 |
Later we can check. I think it's in the config.json. Let's go here 02:35:14.460 |
Then we add the beginning of sentence token and then we add the prompt of the user. It's called the prefix prompt 02:35:20.700 |
How did I come up with this function I didn't come up with it I copied from 02:35:26.860 |
Hugging face implementation, but how did hugging face come up with this actually? 02:35:34.300 |
So if we go to the polygama paper, let's go here 02:35:40.780 |
Here they show you how to prepare the input for the gamma model 02:35:47.740 |
Then we have the prompt of the user that tells us what the language model needs to do with these images 02:35:54.380 |
So if as you saw the example before in in the introduction 02:36:00.380 |
So we want the language model to tell us where is the photographer resting by looking at this image and the model will generate this output 02:36:09.900 |
So this is the prefix and the prefix is built by first taking okay 02:36:14.460 |
We take the image tokens and we are adding them here and based on how many this model particular size of polygama needs 02:36:21.580 |
then we have the beginning of sentence token and this one then we have the tokens of the 02:36:27.260 |
Prefix, which is the task that we want the language model to perform 02:36:32.140 |
And then we have a separator the separator token is a slash n. So it's the new line 02:36:41.740 |
So we have this beginning of sentence token. So then we have the token the the task 02:36:47.100 |
The the prompt by the user based on what task we want the language model to do and then we have the separator token 02:36:54.380 |
Which is a slash n now in the paper. They say that they tokenize the 02:37:02.540 |
so the slash n needs to be tokenized separately from the rest of the 02:37:06.940 |
Input because we don't want the slash n to be merged with this with the 02:37:13.260 |
With the prompt by the tokenizer, so as you know the tokenizer will convert a sequence of 02:37:19.580 |
Characters into tokens and if in the dictionary of the 02:37:27.100 |
The language model there is one character suppose that we ask the language model to tell me where is the photographer 02:37:37.900 |
and then we have this new line suppose that in the vocabulary of the 02:37:43.260 |
Language model there is a token that is like this. So 02:37:48.940 |
And escape and it will become one single one single token 02:37:52.860 |
So suppose that this one becomes the token number three and then there is another token that is a space protog 02:37:58.620 |
It becomes the token number five and then the token the d is another token. So it's the token number six, etc 02:38:05.180 |
So we don't want the escape and to be merged with whatever comes before it 02:38:09.340 |
So they in the paper, they recommend to tokenize it separately. So that's why I I wrote this 02:38:14.860 |
Comment here to to note that it should be tokenized separately, but I don't know why in hanging phase they do it 02:38:23.580 |
It could be a bug or it could be some other indication that I am missing 02:38:27.660 |
So I just write it now later. I will investigate and probably ping the hanging phase team 02:38:31.900 |
But for now, we just need to think how we prepare the input 02:38:35.500 |
So the input is prepared like this a number of input image tokens 02:38:39.500 |
What is each of this image token? It's this placeholder token that we created here this image token 02:38:46.940 |
how many of them depending on the size of the model and we have this beginning of sentence token and then we have the 02:38:52.220 |
Prefix the prompt of the user and then we have the slash n. We take all of this and we tokenize it 02:39:02.380 |
And we return this stuff here. So we return this input 02:39:05.740 |
Which is the input IDs and the attention mask that will be generated by the tokenizer 02:39:11.200 |
In this case, we are not using any padding. So the attention mask will be just a list of ones 02:39:16.060 |
So what is the input IDs? As you remember tokenizer converts the text into 02:39:21.580 |
A list of numbers where each number represents the position in the vocabulary of each token 02:39:27.020 |
So these are not embeddings. These are just input IDs 02:39:30.700 |
So it's a list of numbers where each number represents the token position in the vocabulary 02:39:35.440 |
So imagine our vocabulary is made up of words 02:39:38.940 |
So the word hello the sentence hello world may be tokenized as follows 02:39:47.100 |
It may be tokenized as a list of two tokens, for example, three tokens 02:39:52.300 |
For example, the first one corresponding to the word hello 02:39:54.860 |
Then the one corresponding to the space and then one corresponding to the word world 02:39:59.980 |
Suppose it's the token number nine. So these are called input IDs. So it's not an embedding 02:40:07.740 |
Then by the embedding layer, this will be converted into embeddings, which will be one 02:40:14.540 |
Vector for each token. So with the suppose 1024 dimensions 02:40:23.180 |
1024 dimensions then for the second token another 1024 dimensions, etc, etc, etc 02:40:29.340 |
So this is how we prepare the input. So for now, we have resized the image converted into a tensor 02:40:35.740 |
Then we have taken our prompt. We have added some placeholder tokens for the image then we have 02:40:41.980 |
Added the prompt of the user and then the slash and character as indicated by polygamma 02:40:47.440 |
And now our processor will return this stuff. Now, we need to understand what to do with this stuff 02:40:53.500 |
So we need to code our language model. All right guys, so let's continue our journey by creating another file here called 02:41:03.160 |
Which will be our language model. So the language model that will decode the answer of the 02:41:11.740 |
Using the prompt or given by the user and the image that we have provided as input 02:41:16.300 |
So we create this file. We import a little bit of stuff the usual stuff 02:41:21.740 |
So torch, some math, typing and then we import siglib model that we have created before so the visual model and the configuration that it needs 02:41:29.580 |
Let's do a bottom-up approach which means that we first create the structure of the model and then we create each single component 02:41:46.720 |
Our main class will be called the polygamma for conditional generation 02:41:52.720 |
So why it's called conditional generation? Because we are conditioning the generation of text on the image that is provided as input 02:41:59.680 |
This is why it's called conditional generation 02:42:04.240 |
how we create the attention mask that we will see later because we are attending to all the tokens of the 02:42:09.760 |
prompt of the user and all the tokens of the image 02:42:14.400 |
Without any causality so it's used like a condition, but we will see that later. So 02:42:20.640 |
The constructor accepts a configuration file, which we are going to create now 02:42:24.960 |
It will create an instance of the vision model. So the encoder of the image it will create this multi-modal projector 02:42:32.000 |
Which is a linear layer. Let's actually visualize it all these components 02:42:35.940 |
So we go here and then we open this stuff. So basically the multi-modal projector is this 02:42:43.840 |
linear layer you can see here linear projection 02:42:50.960 |
Contrastive vision encoder and then we have gamma for causal language modeling, which is this our transformer decoder 02:42:58.000 |
So this class basically polygamma for conditional generation is actually the class that will 02:43:02.080 |
Make make connect all these components together 02:43:05.760 |
I don't know why my pen is not working my ipad pen 02:43:15.520 |
All right, so we have created this it will create an instance of the language model 02:43:20.880 |
It will save some stuff like what is the language model? What is the vision tower, which is the 02:43:28.720 |
which is the linear layer that will convert the size of the embedding output by the 02:43:32.720 |
Vision encoder into the size of the embedding of each text token so that they can be concatenated with together 02:43:43.200 |
We need to create another method called tie weights and we will see later what is this about 02:43:51.200 |
Or actually we can check now what this is about 02:43:55.280 |
so tie weights basically means this so let's go back to our 02:43:59.440 |
Here and let's open the attention mechanism. And actually let's open the transformer model 02:44:14.080 |
And specifically in the case of language model most language models are in decoder only language model 02:44:19.600 |
Which means that they are only made up of this part of the transformer without the cross attention 02:44:27.840 |
So it's they are made up of a self-attention with the normalization then a feed forward with the normalization a lot of layers like this 02:44:35.840 |
so one after another then we have a final linear layer that projects the embedding output by these layers into 02:44:42.800 |
Logits, and then we have the softmax to understand which of these tokens has the maximum 02:44:48.540 |
Probability score given by the language model 02:44:52.540 |
the job of this linear layer is basically to convert the embedding of the 02:44:56.940 |
Contextualized embedding output by the last layer of this series of layers 02:45:02.060 |
Into the vocabulary size, which is exactly the opposite that this job 02:45:07.500 |
Layer is doing so the embedding layer the embedding layer is converting the token ids 02:45:14.140 |
So the position of each token in the vocabulary into an embedding while this 02:45:18.300 |
Linear layer here is doing exactly the opposite converting an embedding into its position in the vocabulary 02:45:30.300 |
Weight tying which basically shares the parameters of this layer and this layer because they are doing basically one the inverse job of the other 02:45:37.580 |
Which is also a technique actually to reduce the total parameters of the model because if you are sharing these parameters 02:45:47.900 |
And in many language models this depending on the vocabulary size 02:45:51.180 |
These parameters can be actually quite expensive on the overall total number of parameters of the model 02:45:56.220 |
So it could be like 10% of the parameters in this layer here 02:45:59.180 |
So if you are sharing them, you are actually reducing the number of parameters 02:46:03.020 |
Let's say by 10% because depending on the how many 02:46:08.300 |
So we created this method here tie weight and later we will implement it also in the language model 02:46:15.740 |
That will tie the weights of these two layers 02:46:18.780 |
Okay, now that we have seen also this one. Let's go further, which is the implementation of the forward method. So 02:46:26.060 |
So we implemented the forward method as follows so it accepts the input ids 02:46:32.940 |
What are the input ids? The input ids will be the input ids extracted from this 02:46:42.920 |
Some image tokens. So a lot of tokens like this one image image image image 02:46:47.720 |
How many depending on the size of polygama we are using? 02:46:50.600 |
Then it will contain a beginning of sentence token. Then it will contain the prompt of the user 02:46:56.840 |
So for example, tell me where is this photographer and then a new line 02:47:01.560 |
Character the token corresponding to the new line character 02:47:07.800 |
Yeah, text, okay, so, then we have the pixel values which is the 02:47:12.200 |
Again is the image extracted from this polygama processor, which is the image 02:47:23.300 |
Normalized using the mean and the standard deviation of this image net standard mean and standard deviation 02:47:31.020 |
It is converted into a pair into a tensor and then provided as is 02:47:37.480 |
Then the goal of this polygama for conditional generation will be to take this image and feed it to the image encoder to get extracted 02:47:45.640 |
Then we have this attention mask. The attention mask is provided directly by the tokenizer 02:47:49.880 |
So whenever you tokenize text using a tokenizer, it gives you two output. One is the input ids and one is the attention mask 02:47:55.880 |
Because we will not be using any padding the attention mask will be a series of one 02:48:00.360 |
Later, we will see how we also need to modify the attention mask 02:48:05.640 |
But actually we will not be modifying because we will not be using any padding so 02:48:09.800 |
Yeah, then we have the KB cache, which we will talk about later when we actually use it 02:48:14.920 |
So for now just consider it as something that you don't know anything about and later we will discuss 02:48:27.880 |
We have first we make sure that we are not using any padding because I didn't implement the code to manage the padding 02:48:34.440 |
Then we extract the input embeddings of the text tokens and the image placeholder tokens 02:48:40.200 |
So in the language model, we have added a fictional token called 02:48:47.640 |
Which will be converted into an input id so it will be converted into a number which corresponds to its position in the vocabulary 02:48:53.980 |
What we are doing is we are converting all the input tokens 02:48:58.520 |
Which are the image tokens the beginning of sentence token the tokens of the prompt plus the new line character 02:49:06.920 |
of course the embeddings produced by the image placeholder tokens will be 02:49:10.280 |
Junk because we will not be using them because they do not correspond to the actual image features 02:49:14.920 |
But later we will replace them inside of this one with the correct one 02:49:19.400 |
so now we have this input embeddings the first thing we do is we 02:49:24.200 |
Extract the features of the image and we do it like this 02:49:27.320 |
So we feed the pixel values of the image, which is a tensor directly to the vision tower. So the vision tower is our 02:49:34.280 |
Siglip vision model. So it means that we are using the forward method here. So we are feeding the pixel values here 02:49:41.640 |
It will extract what it will extract some patches with their contextualized embeddings 02:49:52.340 |
Patches and each of these patches is a contextualized patch actually 02:49:56.180 |
The second thing we are going to do is we are going to resize this embeddings image embeddings into the same size of the 02:50:12.100 |
So we take the image embeddings extracted by the vision encoder and then we resize them using a linear layer called the multi-modal projector 02:50:20.340 |
So later we will see this is actually just a linear layer that will convert this embedding 02:50:25.300 |
So this embed dimension extracted from the vision encoder into the hidden size 02:50:29.540 |
Which is the same embedding size used by the language model for each of this each of its tokens 02:50:34.420 |
Now we need to merge the tokens extracted from the vision 02:50:41.300 |
Model with the text token extracted from these embeddings which already contain some placeholders for where we should put the image tokens 02:50:50.420 |
And for that we will create another method called 02:50:55.700 |
Called merge input ids with image features in which we pass the image features extracted from the vision encoder the input 02:51:02.740 |
Embeddings extracted from the language model with which already contains the placeholders 02:51:07.720 |
the input ids which are the original input ids fed to the 02:51:11.620 |
The tokens fed to the language model and the attention mask given by it and the KB cache later 02:51:20.500 |
Suppose that these input features have been merged so we will get these input embeddings these input embeddings. What are they? 02:51:41.940 |
So what we are doing is basically we are creating this stuff here. So we are taking the 02:51:46.660 |
First we are taking the image features extracted by the vision encoder and these 02:51:52.500 |
Then we are resizing them using this multimodal projector, which is this stuff here 02:51:57.300 |
Which will resize the each embedding vector to the correct size so that they can be concatenated with the 02:52:08.660 |
When we tokenize them, they already contain some placeholder tokens, which are those image tokens 02:52:14.500 |
We saw before in the processing_polygamma.py file 02:52:17.460 |
Our goal is to replace each of them with the features extracted from this vision encoder after it has been resized by the multimodal projector 02:52:29.060 |
So this method takes the image features extracted after 02:52:31.940 |
They have been resized the input embedding extracted from the language model which contains the text tokens and the placeholders 02:52:40.580 |
So suppose that now it everything has been replaced. So we treat it as a black box 02:52:44.580 |
What we are going to do we are going to feed all this sequence 02:52:47.300 |
Which is a sequence of image features and the text tokens to the language model 02:52:53.540 |
Use the prompt of the user which are these tokens and the image fed by the user to generate some text 02:52:59.380 |
So let's implement this part here, which is just calling a method 02:53:09.540 |
Because it's just calling a method and later we will implement this language model 02:53:13.620 |
So for now, I created the structure of what we are doing 02:53:16.420 |
So we extract first we tokenize the text the text already contains placeholders 02:53:21.060 |
We replace these placeholders with the features extracted from the vision encoder. We feed everything to the language model. The language model will 02:53:27.060 |
Generate some output and we return this output 02:53:30.020 |
Now our goal is of course to implement all of these blocks that we have created that we have taken for granted for now 02:53:36.820 |
The first thing that we can do is to implement this polygamma config which will give us some understanding of what are 02:53:41.700 |
What is the kind of configuration that this polygamma needs? 02:53:44.500 |
For that we create it we need to create this polygamma config 02:53:51.800 |
Okay, the polygamma config basically takes as input so the polygamma is 02:54:00.340 |
So what is gamma? What is polygamma? And what is cglib? 02:54:05.300 |
I think you should already have an understanding of it now. So polygamma is all of this stuff here all this stuff here 02:54:11.060 |
So it's a combination of a vision encoder and a text decoder language model. So a gamma model 02:54:18.980 |
It's composed of a cglib vision encoder along with a linear layer that will change the embedding size 02:54:24.660 |
And it's made up of a language model called gamma language model 02:54:29.540 |
So the polygamma needs of course the configuration for this block here 02:54:33.860 |
So the language model and the configuration for the vision encoder so that it can create an instance of 02:54:39.300 |
This cglib class and of this gamma language model passing their own configuration to them 02:54:47.300 |
So you have the vision config which is the configuration of the vision encoder the text config which is the configuration of the text 02:54:56.340 |
The ignore index is not used. We will not be using it for labels 02:55:00.340 |
So if you are training, but we will only doing inference 02:55:02.820 |
The image token index is the token corresponding to the placeholder image token. So the 02:55:08.500 |
This token here. So let's this this stuff here 02:55:11.780 |
The vocabulary size. So what is the vocabulary size of the model? 02:55:19.300 |
What is the final dimension that the image features should be resized to before feeding to the language model? 02:55:25.940 |
So what is basically the output size of this linear layer? 02:55:30.580 |
Then we have the hidden size which is the embedding size of the language model 02:55:35.460 |
So the language model has some tokens. These tokens are embeddings and these embeddings have a dimensions. How many dimensions? 02:55:44.900 |
This stuff is something that HuggingFace needs we will not be using it 02:55:50.980 |
We save the padding token id if in case it's fast, so we save the vision encoder 02:55:55.060 |
We save the text encoder and then we need the configuration of the text language model 02:55:59.220 |
Which is the gamma model to which we pass the of course the text configuration and to the vision encoder. We pass the vision configuration 02:56:10.100 |
For image tokens each image will generate which is basically the size of the image divided by the patch size 02:56:17.140 |
So it's actually how many patches you get for each image 02:56:21.300 |
Um, which is also corresponds to how many image tokens you get here 02:56:26.500 |
Because of course if you divide the image by four you get four patches 02:56:31.700 |
If you divide it in smaller parts, you get more patches and each a polygamma size 02:56:36.420 |
So polygamma two to four, I think it has 256 tokens. Another one has more etc, etc 02:56:44.100 |
Um, the projection dimension is how we want to resize this image tokens, etc 02:56:49.620 |
So now let's create also the configuration for the gamma model 02:56:52.660 |
which is just the configuration of any language model because it has 02:56:57.060 |
A vocabulary size how much tokens we have in our vocabulary the hidden sizes. So what is the size of the embedding? 02:57:04.820 |
Embedding vector of each token the intermediate size of the feed-forward layer as we saw before 02:57:12.020 |
In Sigleap the number of hidden layers. So how many layers our transformer has in this gamma language model 02:57:18.740 |
How many attention heads we have? Okay here we have a difference 02:57:22.340 |
This is called the grouped query attention when you have a different number of heads for the query and for the key and values 02:57:28.340 |
the number of heads here refers to the number of heads for the 02:57:32.420 |
Queries and the number of heads for the key and values is this parameter here. We will see later how it works 02:57:40.180 |
Dimensions each head will work with as we saw before we divide this big embedding into smaller groups one dedicated to each head 02:57:47.860 |
This is how many dimensions each head will watch 02:57:53.560 |
But actually it will come from the configuration file of the polygamma model that we will load 02:58:09.860 |
We will load all this configuration from this config.json file 02:58:13.700 |
Which as you can see contains this text config this visual config which contains exactly the information that we need here 02:58:20.500 |
This max positional encodings indicates how much the maximum number of positions our model has been trained upon 02:58:28.740 |
Which is necessary for the rotary positional encodings 02:58:33.380 |
RMS norm is we will see later. What is the rms normalization, but just like the layer normalization 02:58:39.460 |
We have this parameter called rms norm fps. Okay, I will explain it later 02:58:43.940 |
Actually, the rope data is another parameter of the rotary positional encoding, which is the base frequency 02:58:56.420 |
We are we want the bias because as you remember we have the wqwk and wv matrix 02:59:00.900 |
These are linear layers and we can have also the bias term, but we I believe we never use the bias for this 02:59:07.300 |
And it looks like we yeah, we don't use any bias for it. So if they don't overwrite it then it remains false 02:59:14.340 |
Dropout just like before we are not going to use it and the padding token id and we save all this stuff. So nothing so 02:59:21.920 |
Sophisticated here now the first thing that we are going to do since we have already implemented polygama for conditional generation 02:59:28.160 |
I believe that the first thing that we can do is this method here merge input ids with image features 02:59:33.760 |
But for that we will need to understand. What is the kb cache? 02:59:37.120 |
All right. So let's start coding this method. So 02:59:40.800 |
Let me go also here in the code that I have already written. So I will code it piece by piece 02:59:51.760 |
So we create this method which has this signature 02:59:58.240 |
And let's extract. Okay. The first thing we do is we extract some information from the inputs 03:00:05.460 |
Which are what is the embedding dimension from the image features because we need to 03:00:14.400 |
Because we pass them after sending them through this multimodal projector 03:00:18.320 |
So they have already been resized to the same size of the text tokens 03:00:22.000 |
Then we have these input ids which tells us how many tokens we have the input ids 03:00:26.480 |
If you remember correctly is the not the embedding of each token 03:00:30.080 |
It's the number indicating the position of each token in the vocabulary 03:00:33.060 |
While the input embeddings are the embedding of each token after they have been extracted from the embedding layer of the language model 03:00:48.400 |
The first thing that we do is we scale these image features 03:00:54.000 |
We scale these image features which also helps. It's like the same kind of scaling that we use in 03:01:00.400 |
In the attention mechanism, so we do query multiply by transpose of the key divided by the 03:01:06.080 |
Square root of the model here. We do the simple the same kind of scaling 03:01:11.760 |
Because probably they have tried multiple variations of the model and we want the magnitude of the numbers to remain the same 03:01:18.480 |
That's why we divide it by the the size of the hidden side. So if they if you want to double the for example the embedding 03:01:25.440 |
Size of the image features you want the magnitude of numbers more or less to remain the same. That's why you you scale them 03:01:35.360 |
Now the first thing that we need to do is to create the final tensor that will hold the combined 03:01:41.380 |
Features of the image tokens and the text tokens and this is and it's this tensor here 03:01:46.560 |
It's made up of zeros and it has the size of batch size 03:01:50.000 |
Sequence length. So what is sequence length? The sequence length is the number of input ids we have 03:01:55.520 |
What are these input ids? The input ids that are coming from this processing polygamma 03:02:02.720 |
which are the placeholder for the image tokens the 03:02:08.640 |
tokens of the prompt and the new line character 03:02:11.760 |
So the token corresponding to the new line character 03:02:15.140 |
So we create this sequence of empty embeddings of which size of embedding size dimension 03:02:23.140 |
Embedding dimension which is the same size of the embedding vector of language model because the image 03:02:29.120 |
Tokens and the text token will have the same size which is embedded dim here 03:02:33.680 |
We want to be of the same size of the same d type 03:02:37.520 |
So if it's floating point 32 of the input embeds and we put it on the same device 03:02:43.120 |
The first thing that we do is we create some masks that will be useful for understanding which is a placeholder token 03:02:50.160 |
Which is a text token and which is a padding token, even though we will not be using any padding 03:02:54.640 |
So I just took the original implementation, which was already handling the padding, but we will actually never have padding tokens 03:03:03.600 |
Well, a text token is something that is not an image placeholder token and it's not a padding token 03:03:10.560 |
Well something that is equal to the image placeholder token and the padding tokens are the tokens that correspond to the padding token id 03:03:21.360 |
useful for us to understand where to put the embeddings of the image tokens in this 03:03:25.920 |
Final embedding tensor where to put the text token in this final embedding tensor and where to put the padding tokens in this final 03:03:37.440 |
Here we see them and later we will see why we need to expand them. So basically we are creating I believe the 03:03:49.120 |
batch size dimension and the sequence dimension 03:03:52.100 |
I don't know. We already have the sequence dimension because it's already given by the input ids 03:03:59.600 |
We are creating the batch dimension and then we are expanding it to this embed 03:04:07.200 |
Later we will see why we need it. So basically this means that 03:04:10.560 |
The text mask here. So let me draw a sample of how it may look like 03:04:22.400 |
Will be something like this. So if suppose that the 03:04:25.520 |
The input ids are the tokens corresponding to the image. So suppose that it's the 03:04:34.780 |
So we have many tokens corresponding to the placeholder for the image then we have the beginning of sentence token suppose usually it's the 03:04:55.040 |
Slash and token. So it's suppose it's the token number two 03:04:59.760 |
What we the text mask here will be basically something that is like this so it will be zero zero zero zero zero 03:05:10.000 |
And then it will be one one one one one one and then it will be zero 03:05:16.480 |
uh, actually one because the slash n is still part of the 03:05:24.400 |
one one one one one and then a series of zero because all the others are text tokens 03:05:34.400 |
Equal to all zeros. So I don't write all of them, but you can understand all zero because we don't have any padding token 03:05:45.280 |
This expand basically repeats these zeros and ones along this dimension the embedding dimension that we are adding here with this unsqueeze 03:05:53.940 |
And we will need it later for the for another method, which is the wear method 03:05:59.040 |
So for now, just keep in mind. We are just expanding this token by repeating this series of zero and one along a new dimension 03:06:05.060 |
So the first thing that we do is we copy the text 03:06:09.660 |
Embeddings into this final embeddings and we do this by using this method. So we say this final embeddings 03:06:16.000 |
This wear method basically says that if this condition is true 03:06:20.620 |
It will take the input from the second argument. Otherwise, it will copy the third argument 03:06:26.620 |
So if wherever this condition is true, it will copy this stuff here wherever this condition is false. It will copy this stuff here 03:06:42.380 |
We copy the embedding from the input embeds which correspond to the text inputs plus the placeholder for the image 03:06:51.740 |
Text tokens because for the image image tokens, we will have zero in this mask 03:06:58.940 |
Otherwise just keep the final embedding as it is 03:07:08.860 |
Which is using another method called the must scatter and we cannot use the torch dot where because the sequence length of 03:07:17.980 |
Image scaled is not equal to the sequence length of the final embedding 03:07:22.300 |
But basically this does the same job as the where 03:07:25.500 |
So what we are saying is that copy from the scaled image features where this stuff is true 03:07:33.500 |
So we are copying the image features where where the image mask is true where the image mask is true 03:07:38.620 |
Where we have the placeholder tokens for the image so we are copying in the final embedding the image tokens 03:07:50.620 |
And the padding we just zero out everything because we don't care about what is in the paddings 03:07:55.840 |
So what we are saying is that wherever the padding mask is true 03:07:59.100 |
Just copy a zero a tensor made up of zero. Otherwise keep the final embedding as it is 03:08:03.980 |
Now comes the interesting part so for now we have created the final embeddings 03:08:10.620 |
What is the final embeddings is this stuff here. So let me show you again from the ipad. It's this stuff here 03:08:16.620 |
So now here we have the first image token embedding 03:08:20.140 |
second image token embedding third image token embedding blah blah up to 03:08:25.960 |
256 image token embeddings in the base version of polygama if I remember correctly 03:08:30.360 |
And then we have the embeddings of the tokens corresponding to the prompt 03:08:35.080 |
Plus the padding but the padding we will never have because I excluded it from my implementation 03:08:44.440 |
Which is the creation of the attention mask and the attention mask has to be created in a particular way 03:08:55.320 |
And for that I need to introduce the KV cache. So that's why this part is interesting. So let's go 03:08:59.880 |
So let's talk about this thing called KV cache 03:09:02.920 |
But before we talk about the KV cache, we need to understand what is the problem that the KV cache is solving 03:09:10.840 |
So as I we saw before the transformer can be thought of as a model as it's a sequence to sequence model 03:09:16.680 |
Which means that you feed it a sequence of n tokens and you get as output n tokens 03:09:22.440 |
These n tokens as output are not normal tokens anymore 03:09:25.960 |
They are contextualized tokens means that each of them is not capturing information only about itself 03:09:30.920 |
But also about other tokens which depend on the mask that you use if you use the causal mask 03:09:35.880 |
It means that only each token will only capture information about itself and all the previous tokens 03:09:41.320 |
If you are not using any causal mask, then each token will encapsulate information about all the other tokens in the sequence 03:09:47.800 |
Which is what we do with vision encoders like the image encoder we saw before the Sigleap one 03:09:52.280 |
Because the transformer is a sequence to sequence model, so let's open our ipad 03:09:59.400 |
Now because the transformer is a sequence to sequence model 03:10:05.960 |
So suppose that we want to train we train a language model on the following sentence. So it's always the same which is 03:10:19.400 |
Pardon my calligraphy I write very fast recently we feed it to this black box that we will call the transformer model 03:10:30.040 |
Each of these stuff here each of these uh tokens is actually an embedding 03:10:37.880 |
So we will get an as output a list of embeddings, but they will be contextualized 03:10:44.260 |
Contextualized one for the first token one for the second token. So this is the second embedding 03:10:49.140 |
This is the third embedding and this is the fourth embedding 03:10:52.260 |
I am again making the simplification that each word is a token and each token is a word 03:10:58.900 |
Well, we force the language model to predict the next token given the contextualized embedding 03:11:04.980 |
So this contextualized embedding here contains information only about the word I in case we are using the causal mask 03:11:14.580 |
This only contains information about the token I 03:11:17.060 |
This contains information about the token I but also the token love this contains information about the token. I love 03:11:24.820 |
Pepperoni pep and this contains information about all the other tokens. I love 03:11:36.820 |
What labels do we use when training a language model 03:11:39.460 |
Well, in this case, we want the first language model that given the prompt it should predict. What is the next token? 03:11:45.460 |
So given only I the the language model should predict the word 03:11:53.860 |
Given only the token love. I love so the prompt. I love that the language model should predict the token pepperoni 03:12:04.580 |
Given the token the prompt I love pepperoni the language model should predict pizza 03:12:09.720 |
And given all the sentence it should say end of sentence so it means hey i'm done with the generation 03:12:17.080 |
Now this is how we train a language model. How do we actually inference a language model is the same way 03:12:27.220 |
so suppose that the user only gives us one token as a prompt the word I 03:12:32.340 |
And suppose that our language model has been trained on the sentence before so I love pepperoni pizza 03:12:37.220 |
How can we generate the entire sentence? Well, we feed this single token to our black box, which is our transformer 03:12:44.420 |
So now I will write it reversed because I don't have space above 03:12:50.980 |
The transformer will generate it's a sequence to sequence model, which means that it takes as input one embedding 03:12:56.960 |
Corresponding to our prompt token I and it will generate one contextualized embedding 03:13:02.420 |
So it will be one embedding what do we do with the language models we project this single embedding into logits 03:13:13.360 |
Output of the of the transformer, which is this stuff here 03:13:18.640 |
To generate logits for this token. So let's go back here 03:13:38.400 |
This logits tell us what is the score assigned by the language model to each token 03:13:45.200 |
So how likely that particular token is the next one to convert it into a probability score? 03:13:51.600 |
So something that sums up to one we use the softmax. So suppose that we have already applied the softmax 03:14:04.880 |
Sorry a single logits token, but the difference is that now they sum up all to one 03:14:10.960 |
Which one we select the one with the highest number usually this is called a greedy strategy 03:14:16.240 |
There is another strategy called the top p which means that we sample from the top the tokens with the top score 03:14:23.920 |
Up to 90 percent. So suppose that there are three tokens here 03:14:28.240 |
Okay, actually the top we will see later when we implement the inference for now 03:14:31.760 |
Just think that we are always sampling the one with the highest probability score. So we use the greedy strategy 03:14:40.560 |
What will happen is that probably the model if it has been trained well, it will tell us that the next token is very likely the token 03:14:48.080 |
Love so this is how we know. What is the next token? 03:14:51.840 |
How do we generate then the next next token? We take this token love 03:14:56.400 |
This token love and we put it back into the input of the language model 03:15:02.320 |
So now we feed a new input to the language model. Let's remove this stuff 03:15:10.560 |
Now we are feeding two tokens to the language model 03:15:13.280 |
Language model is our transformer model. So it's a sequence to sequence model 03:15:17.520 |
It means that it takes as input two tokens. It will output two tokens 03:15:21.200 |
So it's taking as input two embeddings. I am drawing here the text 03:15:25.920 |
But actually you need to consider that these are two embeddings of these two tokens 03:15:30.160 |
So we feed two embeddings. It will output two embeddings 03:15:45.040 |
one corresponds to the token I so the first position one corresponds to the second position which is 03:15:51.040 |
Because this is a contextualized embedding. It will include information about 03:15:58.000 |
Now before what we did was to project this output embedding into logits here 03:16:03.920 |
We have two embeddings which one should we project into logits? Of course. It's the second one. Why? 03:16:13.120 |
This embedding includes information about the two tokens. So it's like we are using the entire prompt. So what we do is we 03:16:27.520 |
It will become logits. So let's write actually logits 03:16:31.300 |
Then we apply this thing called softmax which will convert this logits into 03:16:42.080 |
Using I love as prompt. Well, we sample from the softmax which one the one with the highest score. So 03:16:48.240 |
We take the one with the highest score as the next token so if the language model has been trained 03:17:03.680 |
Now, what do we do? How do we generate the next next next token? We take this word pepperoni 03:17:08.720 |
We feed it back into the language model and we ask again the language model. Hey generate the next token 03:17:28.160 |
We are feeding three tokens to the language model which are converted into three embeddings then are fed to the transformer 03:17:33.600 |
The transformer will output three output embeddings 03:17:39.280 |
Now without writing the first position will correspond to a contextualized embedding that only includes information about the token I 03:17:50.560 |
Embedding contextualized embedding will include information about I and the love the third contextualized embedding will include information about I love 03:18:00.540 |
Of course the third one because it's the one that encapsulates information about all the prompt 03:18:05.760 |
So this way we keep doing this way and we generate 03:18:11.360 |
Now, what is the problem here? The problem is that at every step of inference 03:18:15.280 |
We are generating a lot of embeddings. Suppose that the prompt is very large 03:18:20.320 |
A lot of embeddings that we are not using so we are creating them because the transformer is a sequence to sequence model 03:18:26.960 |
But then we are only projecting one single embedding to the logits and then to the softmax to understand what is the next token 03:18:33.760 |
And as you know, the transformer model uses this thing called attention mechanism and the attention mechanism generates this matrix 03:18:40.800 |
That is a sequence by sequence, which is the attention scores matrix that we saw before 03:18:44.560 |
which means that when you have a thousand tokens 03:18:48.960 |
It will generate a matrix that is a thousand by one thousand. So it's a one million numbers in that way 03:18:54.240 |
So it's a huge matrix and then you only need to use a part of this matrix that will generate this embedding here 03:19:00.480 |
So is there a way to not generate the embeddings that we are not going to project into logits? 03:19:06.160 |
But only generate the one that we only need to generate the next token 03:19:10.320 |
Yes, and it's possible through what is known as the kb cache and the trick is here. So now let's open this other slide 03:19:18.000 |
The trick is this one. So when we calculate the 03:19:21.200 |
attention matrix, so the query multiplied by the transpose of the keys divided by the square root of d 03:19:27.360 |
Model or d head in case we have a multi multi head attention 03:19:31.040 |
What we are getting is suppose that we want to generate the word pizza by using the prompt I love pepperoni 03:19:44.620 |
Embeddings, so I love and pepperoni to the transformer. The transformer will convert them into query key and values using the projection 03:19:58.700 |
It will convert them into query key and values and now then we use the query key and values to calculate this 03:20:04.940 |
Matrix here. So the query multiplied by the transpose of the keys, which is this matrix here 03:20:10.860 |
Then we multiply this matrix by the v matrix with by the v sequence and it will give us the output 03:20:19.240 |
Attention, which is contextualized embedding you can see here and we saw also before that when we multiply by v 03:20:24.460 |
We are doing what is known as a weighted sum using these weights as weights in this weighted sum 03:20:34.620 |
So the input of the model is I love pepperoni and the output that we are getting is a three contextualized 03:20:39.440 |
Embeddings so the embedding corresponding to only to the word I the embedding corresponding to the word 03:20:45.020 |
I love and the embedding corresponding to the I love pepperoni 03:20:47.760 |
We know that we only need this one here because this is the only one that we need to project into logits 03:20:53.980 |
And then to generate the next token. So is there a way to not compute these two stuff here that we will not be using? 03:21:05.420 |
Embedding contextualized embedding here is the result of the multiplication of this matrix by this matrix 03:21:12.700 |
but not all of this matrix by the v sequence, but only the last row of this matrix by the v sequence because 03:21:25.500 |
Then this number here comes from the result of the dot product of this row here 03:21:34.220 |
So this number here comes from the dot product of the first 03:21:38.700 |
The last row of this matrix with the first column of this matrix the second number in this matrix output 03:21:44.540 |
Vector comes from the dot product of the last row of this matrix with the second column of this matrix 03:21:55.500 |
Dot product of the last row of this matrix with the third column of this matrix, etc, etc for all the 128 dimensions 03:22:02.720 |
So what we need to generate only this one is the last row of this matrix, but all the v sequence 03:22:15.500 |
Because the attention matrix as we saw before we can consider the rows 03:22:20.460 |
To be the queries and the columns to be the keys to have only this last row here 03:22:30.060 |
But all the previous tokens including itself as keys and we need also all the tokens as values 03:22:37.740 |
That's why what we do is the following when we generate text with a language model 03:22:50.300 |
Let me draw in such a way that it's not confusing. So I think we can continue here. So 03:22:55.900 |
Imagine we start again the process of generation of text, but this time we do it with the kv cache 03:23:05.420 |
Top to bottom. Otherwise, it gets confusing because before I did top to bottom. So 03:23:10.140 |
Okay, we use only the token i as input to the language model 03:23:14.620 |
The language model will convert it into an embedding blah blah blah, then we feed it to the transformer 03:23:19.120 |
Suppose that it's only made up of one layer. Actually, it's a series of layers 03:23:26.140 |
Single token will be converted into query key and values. So it will be a sequence of tokens 03:23:34.620 |
So the q sequence will be one token. The k sequence will be one token. The v sequence will be one token 03:23:46.240 |
Which will calculate that matrix so the query multiplied by transpose of the keys which will be a matrix that is one by one because 03:23:54.080 |
And then we multiply it by v so it will result in only one contextualized embedding as output 03:23:59.920 |
So it's this stuff here what we do we project it into logits 03:24:03.700 |
Which is another vector then we convert it into softmax which is another vector 03:24:20.720 |
The difference with the kv cache is that whenever we pass a token to the input of this self attention 03:24:28.580 |
We cache the key sequence and the v sequence into a buffer called the kv cache 03:24:40.240 |
That initially is empty. But after we pass the token I 03:24:43.760 |
It will contain the embedding. So the q embedding. Sorry the k embedding corresponding to the token I 03:24:50.960 |
And also this is the kv cache. So it is made up of the key cache and the v cache 03:24:59.040 |
Then we have the v cache which is initially empty 03:25:01.440 |
But after we send in the first token, we save this v sequence. It only contains one token. So we save it here 03:25:14.080 |
Using the query key and values. It will result in only one output embedding. We project it into logits 03:25:21.120 |
We project it into softmax. We sample. What is the next token? Very probably it will be the token love 03:25:30.560 |
What we did before was that we took this word love 03:25:33.920 |
Put it back inside of the prompt and then ask the language model again. What is the next token? 03:25:38.480 |
But with the kv cache we do something different 03:25:40.640 |
With the kv cache. We always take the previously generated token. So in this case is the token love 03:25:59.440 |
And we use this single token as input to the language model 03:26:03.520 |
Now what happens is that we feed the transform this single token love into its embedding which is an 03:26:10.720 |
Uncontextualized embedding we feed it to the first layer of the transformer as a query key and values for now 03:26:16.720 |
The query key and value contains only one token the token correspond the embedding corresponding to the token love 03:26:31.760 |
For the key for the keys and values we take this single token love we append it to this buffer called 03:26:39.200 |
Kv cache. So now it contains love here for the values. Also it contains love 03:26:45.120 |
And then we use this buffer as the key and value sequence in the self attention 03:26:50.640 |
So we take this token love we convert it into query key and value the query key and values are one single token 03:26:57.600 |
But the query the key and value we append them each of them into their respective buffer here 03:27:03.520 |
And then we use the content of this buffer to calculate the self attention 03:27:08.400 |
What happens is that we have only one query, but now we have two keys and two values 03:27:13.440 |
Which will result in exactly the calculation of this last row of this matrix 03:27:21.360 |
That the last row that we are interested in to predict only the next token and not generate all the other contextualized embedding 03:27:31.520 |
Two tokens, but later we will see with the third token. It will be exactly the last row of that matrix 03:27:39.360 |
The output of this self attention because we have one query two keys and two values 03:27:43.680 |
I can guarantee mathematically it will be one single embedding you can verify by yourself 03:27:48.800 |
But basically if you have one query as you saw before the self attention mechanism 03:27:52.820 |
Will generate a matrix that is a sequence by sequence 03:27:55.760 |
But in this case, it's the the roles of this matrix are defined by how many queries you have. So we have only one 03:28:06.400 |
So it will be a matrix that is one by two and it will result in only one output embedding token when you multiply it by b 03:28:16.240 |
And we saw that before actually when we calculated the dimensions of the output embedding 03:28:20.800 |
We saw that it's only the last row that generates the last embeddings and this is exactly what we are doing here 03:28:26.320 |
Anyway, this the self attention calculated like this 03:28:30.240 |
So using the query the single token, but as keys and value the content of the buffers the keys and the kv cache 03:28:36.960 |
To calculate the self attention we result in only one output embedding 03:28:41.200 |
Which is exactly the contextualized embedding that we are interested in to generate the next token 03:28:46.160 |
We project it into logics. We'll project it to the softwares and it will result in the next token being 03:28:53.500 |
Naively what we did before was take this for the pepperoni and feed it back into the prompt and then feed all the prompt to 03:29:00.240 |
The language model but with the kv cache it's different. So we use the last generated token pepperoni 03:29:10.960 |
We feed it to we convert it into a single embedding 03:29:15.140 |
So the query key and value here are one single token 03:29:20.080 |
But before computing the self attention, we put this key and value inside each of their buffers 03:29:27.520 |
So now the buffer for the k contains pepperoni as well 03:29:36.080 |
Then to calculate the self attention we don't use this key and v we use the content of the kv cache because it contains three tokens 03:29:43.360 |
So as query we use only one token, which is the word pepperoni 03:29:46.660 |
But as key and v we use the content of the kv cache. So it will result in a matrix that is 03:29:51.360 |
Exactly the last row that we saw here because it's exactly this one now because we have as a query 03:29:58.480 |
Only the word pepperoni and as key is the token. I love pepperoni 03:30:03.440 |
Which will result when multiplied with the v sequence, which is three tokens because we have also the v cache 03:30:08.640 |
Will result exactly in the computation of this output embedding here, which is only one single embedding 03:30:15.780 |
Which is exactly the one that we need to predict the next token, which will be 03:30:23.120 |
Etc etc. So this is the kv cache this kv cache basically allow us to during inferences 03:30:30.640 |
So during token generation to avoid generating all the embeddings 03:30:34.580 |
Of all the input sequence, but only generate the last 03:30:38.400 |
Embedding contextualized embedding which is exactly the one that we need to we need to predict the next token 03:30:44.960 |
There is another thing that we used to know about kv cache, which is the pre-filling the pre-filling is basically we started here with 03:30:56.720 |
So we only use the word I but usually the prompt is a little longer. So it's not only one token from the user the user 03:31:04.960 |
Suppose that the user uses multiple tokens, so it uses the word I love 03:31:09.280 |
What we do is because we have already access to all the tokens of the prompt of the user 03:31:17.040 |
We are not generating them. We can pre-fill instantly using all of the prompt 03:31:23.520 |
All the kv cache corresponding to the prompt of the user so we can do instead of doing first adding I and then adding love 03:31:30.320 |
We add both of them in the same forward pass. How to do that? 03:31:34.480 |
We take we use both of them. We convert them into embeddings 03:31:38.080 |
So it will result in two embeddings. We feed it to the language model as query key and values 03:31:44.720 |
This will result in a cool sequence of two tokens the k sequence of two tokens and the v sequence of two tokens 03:31:52.960 |
We put the k and the v inside of their respective buffer called the k buffer and the v buffer which comprise the kv cache 03:32:10.560 |
So now we have two tokens for the query two for the keys two for the values because the content of the kv cache contains 03:32:17.440 |
Which will result in a two by two matrix, so it will result in two output embeddings 03:32:23.460 |
And two output softmax which one we project in the um in the logits only the last one 03:32:32.640 |
Because we are we are not interested in predicting the word love. We are only interested in knowing what comes after love. So we only take the 03:32:40.800 |
Embedding corresponding to the position of the word love we project it into logits 03:32:47.460 |
And we project it into softmax to understand what is the next token 03:32:50.740 |
So only during this pre-filling phase we actually allow the generation of multiple output embeddings 03:32:57.960 |
And then we discard the one that we don't need 03:33:00.900 |
Why do we do it because we don't want to add one single token at a time because it will be too slow 03:33:06.180 |
If you have a lot of tokens, you just add them all at once in the kv cache 03:33:10.260 |
And then you use this kv cache which is pre-filled now to generate one token at a time 03:33:16.420 |
The reason we do it is because the gpu is very fast at parallelizing stuff 03:33:20.740 |
So it's very good at parallelizing computations 03:33:22.900 |
So actually by doing all of these computations inside of the gpu 03:33:26.740 |
Will result in a much less wall clock time instead of adding one token at a time 03:33:30.820 |
And this is guys the kv cache. So now we can finally code it 03:33:34.340 |
Okay, let's code the next part. So we copy this part here and all of this 03:33:48.100 |
We know that we have two parts to do when we work with the kv cache 03:33:51.700 |
The one part is called pre-filling and one is token generation during the pre-filling. We send all the prompt of the user 03:34:00.340 |
To the model using as a query key and value and this will create the initial cache that will then be used by subsequent 03:34:07.320 |
During token generation. So where we generate one token at a time 03:34:11.300 |
Why do we do this two phase because we want the the prompt is already available to us 03:34:15.540 |
We don't want to edit one token at a time while the token generation 03:34:19.300 |
We want to generate one token at a time because we don't have these tokens 03:34:28.500 |
when we are working with the pre-filling phase, we will have that the 03:34:32.980 |
Number of queries key and value will be the number of the tokens inside of the prompt. So we generate a mask that is 03:34:42.180 |
Because it will be used in the attention mask. So let's visualize it actually 03:34:46.260 |
so suppose that we are doing the following so 03:34:50.900 |
This suppose that we receive a prompt that is I love pepperoni and we want to generate the next token, which is pizza 03:34:58.180 |
The attention calculation will result in the following attention score 03:35:02.660 |
So it's a matrix that is three by three in which we want to mask out some 03:35:07.840 |
interactions between tokens especially for each query cannot attend to future keys 03:35:12.400 |
And the way we do that is we create an attention mask 03:35:16.560 |
Of the same size of the attention matrix as you can see so three by three. So sequence by sequence 03:35:24.400 |
Before we apply the softmax. We add this thing called mask to this 03:35:31.280 |
And this mask is made up of minus infinities for all the position in which we don't want any interaction to happen 03:35:38.160 |
And this is what we are doing here. So at the beginning we create 03:35:41.920 |
We are inserting the prompt of the user and we should mask out future tokens, however 03:35:51.680 |
And we create a mask that is a token sequence by sequence 03:35:54.960 |
So this is during the pre-filling so when the KB cache is not or the KB cache does not contain any item means that we are 03:36:01.040 |
Doing it for the first time. So we are pre-filling the prompt of the user 03:36:04.160 |
Now we are not adding any minus infinity value to this KB is to this attention mask during the pre-filling. Why? 03:36:12.560 |
For to understand that we need to understand how polygamma attends to the 03:36:17.120 |
Image tokens and to the prompt of the user. So for that, let's open the page of 03:36:28.540 |
So a prompt in polygamma is made up of the image tokens, which are 256 in the case of the smallest polygamma 03:36:37.760 |
Then we have the prompt of the user which is a beginning of sentence token plus the prompt of the user 03:36:43.180 |
So for example, the prompt of the user may say extract where the photographer is in this picture 03:36:48.060 |
And then we have a separator token, which is the new line token we saw before 03:36:53.420 |
As you can see the attention mask here is not masking out anything for the part that corresponds to the 03:37:00.300 |
Prompt because the prompt of the user is made up of the prompt 03:37:04.220 |
So the textual prompt plus the image and we don't mask out anything. Why? Because and it's quite 03:37:11.420 |
and it's different than what we usually do with language models because 03:37:17.900 |
We can understand that we don't mask out anything because each text token that we will generate needs to access all the image tokens 03:37:25.020 |
So it will be conditioned on all the image tokens. That's why it's called conditional generation 03:37:29.120 |
And that's fine because we saw that each image is each image feature each image embedding is encoding 03:37:37.740 |
But also all the other embeddings and we want each text token to watch all the image to be predicted and that's fine 03:37:49.740 |
So as you can see the first token of the prompt, which is this one 03:37:53.500 |
so suppose that the prompt is two tokens, for example, I love and 03:37:56.940 |
We want to generate the word pepperoni and pizza, which should be the first output token and the second output token you can see here 03:38:05.180 |
Why are we not applying any causal mask to the tokens of the textual prompt? 03:38:14.780 |
Because the textual prompt is usually very short 03:38:17.420 |
And we want and it usually describes what is the task that we want the vision language model to perform 03:38:31.180 |
This prompt represents the task that we want the language model to perform 03:38:34.380 |
We want all the tokens that will be generated to watch all of the 03:38:41.480 |
Moreover, we want each token in the prompt to watch even future tokens of the prompt itself 03:38:55.160 |
When we will do prefilling what we will have is the following so we will have 03:39:00.360 |
The prompts let's use a different color. So we will have all the tokens of the prompt which are the 03:39:06.440 |
Textual prompt which is the textual prompt that we will send to the model 03:39:14.280 |
And we do not need to generate any mask here because each 03:39:18.840 |
Text prompt can watch even future tokens of the text prompt because you can see that this is the keys 03:39:26.200 |
This is the query number one of the text prompt and this is the key number one of the text prompt 03:39:32.360 |
This is the key number two of the text prompt and as you can see the query number one of the text prompt 03:39:36.760 |
So this beginning of send the token can attend to the key number two of the text token 03:39:45.000 |
Authors made so they they said okay, usually the prefix of the 03:39:50.040 |
Because we are not generating this prefix, which is the prompt that we send to the model telling what the model needs to do with the image 03:39:57.960 |
We do not need to add any causality because we do not 03:40:02.840 |
Need the model to be causal with respect to this prefix because we are not going to generate it 03:40:07.960 |
however, the only thing that we are going to generate is this thing called suffix which are the 03:40:13.560 |
Output tokens predicted by the model using the prompt textual prompt and the image 03:40:20.920 |
So the first token output by the model needs to attend all the previous keys, which are the image token 03:40:27.480 |
So these three image tokens plus the four tokens of the text prompt 03:40:32.760 |
Then the next token predicted by the model should be able to access again all the image tokens 03:40:39.000 |
So the first three tokens then the four tokens of the textual prompt plus the last generated 03:40:46.760 |
By the model then when we generated the next next token, it will need to access 03:40:53.560 |
First three image tokens then the next four text tokens of the prompt 03:40:58.280 |
And the two tokens predicted by the model before so it is causal only in the generated text not in the prefix part 03:41:07.240 |
Which is different than normal language models in normal language models when we prefill even the 03:41:20.840 |
Itself is prefilled using the causal mask because the the prompt is just 03:41:25.160 |
A part of what the model would generate if it would start with only the first token 03:41:30.440 |
But this is not the case in PaliGamma. It's a choice that the PaliGamma team made 03:41:35.240 |
So it's not like the language model has to work in this way or there is any advantage or disadvantage 03:41:40.700 |
The only advantage if we want to say is that the information about the prompt 03:41:45.880 |
Is replicated in each of these tokens because each of these tokens basically 03:41:50.440 |
Includes information also about future tokens that are part of the prompt and this happened when they train the model 03:41:56.120 |
so when you train the model also you don't mask out the 03:42:03.320 |
Textual prompt you only mask out what you expect the model to generate 03:42:09.340 |
Using the image token and the textual prompt. So to rehearse 03:42:15.160 |
Let's go back to this image. What is the text prompt? So when we inference a language model we provide a 03:42:20.920 |
Visual text visual language model. We provide an image as condition and then we provide some 03:42:27.080 |
Text prompt which is a description of what we want the language model to do with this image 03:42:32.280 |
For example tell us where is the photographer in this picture? 03:42:34.920 |
And then the model will generate some tokens as outputs telling us where the photographer in this case is 03:42:41.880 |
and what we do when we train this language model is that 03:42:47.800 |
We do not mask the tokens of the textual prompt 03:42:51.560 |
So when we ask the language model what to do with this image 03:42:54.360 |
We do not mask out during training and also during inference, of course because the model needs to work in the same way 03:42:59.560 |
But we mask out only what we expect the model to generate 03:43:03.800 |
So the causality is only in the generated tokens and it's a choice that you make with the language model 03:43:09.000 |
It's not necessarily it has to work with this way because normal language models 03:43:15.720 |
There is no like not masking out of the prompt because usually the prompt itself 03:43:20.120 |
You can consider it as something generated by the model, even if it's not 03:43:23.480 |
So this is a more of a philosophical question that's a technical 03:43:28.200 |
But the reason is that it's a choice made by the polygamous authors also in visual language model 03:43:32.920 |
Especially like polygamous the task so the prompt the textual prompt is usually very short 03:43:37.800 |
It tells the model what to do with the image that it's being fed 03:43:40.760 |
so for example localize where is the cat in this image or 03:43:43.480 |
Extract all the numbers or tell me where is the photographer in this image, etc, etc 03:43:50.200 |
And also the usually the generated output of the model is very short 03:43:53.960 |
So we don't use at least polygamous models like polygamous are not used for generating very long 03:43:59.320 |
Content but they can be of course fine-tuned to do it 03:44:04.520 |
So, let me delete this part. Otherwise it remains here forever 03:44:11.320 |
All right, so now we have seen how we generate the 03:44:14.200 |
The the mask for the pre-filling so in the past for the pre-filling 03:44:18.360 |
We do not mask out anything because we do not mask out the text prompt and we do not mask out the image prompt 03:44:24.520 |
The interesting part is that when we generate the text we have we generate one token at a time with the KB cache 03:44:38.440 |
Because let's go back to the polygama here picture. So here 03:44:43.640 |
When you generate the first token, the first token needs to access all the image tokens and the text tokens 03:44:50.360 |
So does not we don't need to mask out anything 03:44:52.840 |
When we generate the next token as you can see it needs to access all the image tokens and all the text tokens 03:44:59.320 |
Plus the last generated token here. So we do not need to mask out anything then again for the next next token 03:45:05.320 |
We need to access all the previous tokens plus the two previously generated tokens 03:45:09.960 |
So we do not need to mask out anything because we are generating one token at a time 03:45:13.800 |
So it needs to access all the previous tokens plus the image tokens plus the textual prompt 03:45:18.920 |
So we never need to mask out anything. So you may be wondering why are we never masking out anything? 03:45:25.000 |
Because we are working with the KB cache and with the KB cache 03:45:27.480 |
We only generate one single row of this matrix at a time 03:45:33.320 |
We always generate the last row and the last row is always the last token that needs to access all the previous tokens 03:45:38.920 |
So we never need to mask out anything. However during training 03:45:44.920 |
on something then you need to mask out because the model will generate all the 03:45:48.920 |
Contextualized embedding in parallel and you want each contextualized embedding to only be contextualized on the previous token 03:45:54.600 |
So you need to mask out. So during training we will have a causal mask, but during inference, which is our case 03:46:00.200 |
We don't have any causal mask at least when working with the KB cache and at least 03:46:04.040 |
When working with models like polygamma if you work with a normal language model like normal like llama 03:46:09.880 |
For example when you do the pre-filling you actually need to mask out the pre-filling part 03:46:14.200 |
But in the case of polygamma because of the choices made by the polygamma team. We do not need to mask out anything 03:46:19.640 |
And this is why we do not need to mask out anything 03:46:22.840 |
So when we will in the future plan to make another video on how to fine-tune this model that we have made 03:46:27.720 |
And we will see that we will need to introduce some kind of mask 03:46:31.080 |
And the mask will have to be generated exactly like shown by the polygamma paper. So let me check if my it's still working 03:46:39.080 |
Sometimes I lose connection with my cam. So I need to check every once in a while. So 03:46:46.920 |
we have created this mask which is filled with zeros because 03:46:49.480 |
We need to fill up minus infinities to all the positions where we want to mask out something 03:46:55.000 |
But we never mask out anything. So we always make this tensor full of zeros 03:46:58.920 |
when we are pre-filling we generate a sequence by sequence mask, but when we are 03:47:04.760 |
Generating tokens, we only generated the last row of that metric. So we have only one 03:47:11.080 |
Query, so as you can see assert query is equal one 03:47:13.800 |
So we only have one query and then we have how many keys we want which is how many keys there are in the KVCache 03:47:19.720 |
We add the plus one to this KVCache because before using the KVCache we add this current token 03:47:25.480 |
So the query token inside of the KVCache then we extract it before calculating the self-attention like we saw before 03:47:31.000 |
As you know the KVCache when we do the attention computation, we have one attention computation for each head 03:47:37.720 |
So we need to add the head dimension because there will be one attention matrix for each head 03:47:42.120 |
And that's why we add this head dimension here 03:47:50.200 |
We need to generate the positions of the tokens that will be used by the rotary positional encodings 03:47:56.380 |
So when we are working with the pre-filling part of the KVCache 03:48:01.240 |
It means that we have n tokens that are part of the prompt of the user which are the image tokens plus the text tokens 03:48:07.720 |
Then we need to generate enough positions to apply the rotary positional encoding. So which the positional encoding 03:48:13.480 |
How many of them we need we need up to how many tokens there are in the prompt 03:48:20.360 |
Which is indicated also by the number of ones in the attention mask which is generated by this processing polygamma code 03:48:28.840 |
It will give you the input IDs and another tensor of the same size as the input IDs with all ones 03:48:35.300 |
Indicating that we do not mask out anything and if you count the number of ones it also gives you how many tokens there are 03:48:41.140 |
In the input IDs, so that's what we are doing here 03:48:43.380 |
We generate enough positions. So when we are doing the pre-filling suppose that the pre-filling is made up of 256 image tokens 03:48:52.660 |
And then three tokens of the textual prompt. So what we will this will generate basically 0, 1, 2, blah, blah, blah 03:49:05.520 |
A sequence like this. This sequence will be then used to understand which 03:49:09.920 |
Rotary positional encoding we need to apply to each token 03:49:15.760 |
Token generation we only have one single query to which we need to apply the positional encoding 03:49:27.040 |
So this will generate only a one single mask, which is the position corresponding to the last 03:49:37.360 |
So when we do token generation basically we have some tokens that are already saved in the KV cache 03:49:41.840 |
And then we have one new token, which is the last predicted token, which we use as a query 03:49:46.080 |
To understand what is the position of this token 03:49:49.120 |
We also pass the attention mask in the case of the attention mask 03:49:52.640 |
It will indicate that it's all made up of ones how many ones well indicate well 03:49:57.200 |
Based on how many tokens there are in the KV cache 03:50:00.000 |
Plus one because we also have the new token that we need to add to the KV cache before doing the self attention 03:50:05.040 |
So what we are doing here is the same. So we are counting how many ones there are in the KV cache 03:50:15.120 |
And we this is how we generate the position IDs 03:50:21.280 |
And then we return this stuff here, so let me return this stuff 03:50:29.840 |
So what this does this method do this method basically takes as input the image features 03:50:34.800 |
It takes as input the input IDs and the input embeddings 03:50:38.240 |
What are the input embeddings are the image the embeddings of the image placeholder, which we will not use 03:50:45.280 |
And then the image features our goal is to put all the image features in the right places in this input embeddings based on where 03:50:52.240 |
Are these image embeddings placeholder positions? 03:50:59.200 |
Here actually then we create the attention mask, which is basically just made up of zeros which 03:51:04.720 |
Do not confuse the zeros in the attention mask 03:51:07.520 |
We are creating here with what we are probably commonly used to see in the attention mask 03:51:15.120 |
So usually you are probably used to see the attention mask as a bunch of num ones and zero and the zero indicates which number 03:51:21.440 |
Should be masked and the one which indicates what is the number that should not be masked 03:51:25.600 |
This ones and zero is actually then converted into a number of in a series of minus infinities and zeros before 03:51:36.000 |
Instead of creating a ones and zero which then converted into minus infinities and zeros 03:51:41.200 |
We are already creating the mask that can be directly added to the attention mask 03:51:45.280 |
So we are creating a bunch of zeros, which basically means that 03:51:51.280 |
So it's like you are not masking out anything 03:51:53.440 |
If you want to mask out something then you need to add some minus infinities in this mask, but we never add any minus infinities 03:52:02.240 |
And this is our method that combines the image features with the text tokens 03:52:07.680 |
Our next goal is to create the structure of the polygama 03:52:11.220 |
Actually, we can create this polygama multimodal projector. Yeah 03:52:15.280 |
All right. So let's create this polygama multimodal projector. Let me put away this stuff here 03:52:21.840 |
We just copy it. It's very simple. I just I don't even need to copy first the constructor and then 03:52:28.400 |
So the polygama multimodal projector is just that linear layer that converts the size of the image features 03:52:34.620 |
Extracted from the vision encoder into the same size of the embedding size that is used by the language model 03:52:41.900 |
So it's just a linear layer that converts the hidden size of the vision model into the projection dimension, which is equal to the 03:52:55.420 |
So this project projection dim is equal to the you can see it here is equal to the hidden size 03:53:04.460 |
So it's basically resizing the the embeddings so that they can be concatenated with the text tokens 03:53:11.020 |
Let's go back here. So as you can see, we are just applying this linear layer 03:53:15.980 |
Our next step is to code the language model itself. So the language model the gamma language model is a transformer model 03:53:25.900 |
Transformer model so we create this gamma for causal language modeling 03:53:30.860 |
Which takes the configuration of the gamma model as input and the gamma model, which we will create later 03:53:36.060 |
Basically in the hugging phase whenever you see something something for causal language modeling 03:53:42.860 |
It is a transformer model plus a language modeling head, which is the linear layer in the transformer that projects each embedding into 03:53:52.300 |
So this is basically the transformer model this gamma model and then this is gamma for causal lm is the gamma model plus a linear layer 03:53:59.820 |
That's why we are reusing this instance plus a linear layer. So the forward method will be very simple 03:54:11.020 |
Weight tying so we saw before that weight tying basically means that we share the weights of the embedding back 03:54:17.180 |
Layer with the logits layer. So this is what we are doing 03:54:20.380 |
So when we type weights, we just copy from the embeddings to the language modeling head 03:54:25.420 |
Which is the linear layer that converts the embedding into logits 03:54:29.920 |
Then we have the forward method which is also very simple because it will not do anything except for 03:54:36.480 |
Applying sending the stuff to the language model and then applying this 03:54:40.400 |
Linear language modeling head which is the linear layer to convert into logits 03:54:50.000 |
So the attention mask the position IDs the input embeddings the kvcache we send it to this language model, which we will implement later 03:54:56.960 |
The output of this language model will be a series of embeddings, but we do not want embeddings. We want logits. So 03:55:04.880 |
We take the outputs. We take the hidden states from these outputs, which are the series of embeddings 03:55:10.560 |
We apply the language modeling head. So it's the linear layer. We make sure it's a floating point numbers 03:55:19.920 |
Result is it so we return the logits and if the user specified the kvcache, we also return the updated kvcache. That's it 03:55:27.680 |
Because here there is no logic the logic will be here in gamma model 03:55:32.320 |
Yeah, so let's go to implement the gamma model, all right 03:55:37.120 |
So what is a language model a language model is an embedding layer plus a series of transformer layers 03:55:44.000 |
And then we have the language modeling head. The language modeling head is already implemented here in gamma for causal language modeling 03:55:50.160 |
So we just need to create the other part which is the embedding layer and the list of transformer layers 03:55:55.440 |
Let's do that. So we create first the constructor. So this 03:56:04.000 |
Information that it needs so the vocabulary size why we need a couple vocabulary size because we need to create the embeddings how many embeddings 03:56:12.480 |
Depending on the number of tokens in our vocabulary each embedding vector will be of size a hidden size 03:56:19.600 |
This indicates the position of the embedding token inside of the vocabulary 03:56:23.060 |
And basically I think the embedding layer takes it as input so that it does not update the gradient for this token here 03:56:37.440 |
These are called here are called gamma decoder layers. So they are the transformer layers. We have how many of them we have 03:56:45.440 |
Depending on this parameter num_hidden_layers. And then we have a final normalization, which is a rms normalization, which I will describe later 03:56:52.880 |
What is it and why it's different from a layer normalization? 03:56:56.020 |
We need to implement this method here get_input_embeddings, which is used by the language modeling head. So as you can see we use it 03:57:07.760 |
We use it here to extract the initial embeddings 03:57:10.960 |
From the language model which are then combined with the image features we saw before here and then send to the language model 03:57:16.800 |
So the language model here is receiving not the input IDs, but it's receiving the embeddings already 03:57:21.840 |
So the image embeddings plus the text embeddings 03:57:24.420 |
Which is the same embeddings that we will receive here in the forward method of gamma model 03:57:33.280 |
Which is also very simple because we do not implement much logic here 03:57:39.600 |
So we receive the attention_mask, the position_ids, which are the position that we will apply for each token 03:57:45.200 |
How to apply the positional encoding to each token 03:57:48.800 |
We didn't talk about the positional encoding yet because we apply the rotary positional encoding in this case, which are applied 03:57:57.200 |
So they are not applied at the beginning like we saw before with the Sigleap or with the vanilla transformer 03:58:02.320 |
But they are applied just before calculating the attention 03:58:06.160 |
We have the input embeddings which we saw before are the image features plus the text tokens 03:58:11.520 |
And in case we have the KB cache also the instance of the KB cache, which we didn't implement yet 03:58:20.960 |
Let's do it. So the first thing that it does it is 03:58:24.560 |
Taking and applying some kind of normalization, which is the same reason we apply 03:58:31.020 |
Normalization also to the input of the image features 03:58:33.500 |
We want the kind of the magnitude of the numbers to remain the same even if the number of dimensions increases 03:58:38.560 |
then this language model is made up of a series of layers of 03:58:43.660 |
Transformer layers. So what we do is the output of one layer becomes the input of the next one 03:58:56.860 |
So we take the decoder layer we send it the first hidden state which is the input of this forward after it's been normalized 03:59:04.160 |
We send the attention mask. We send the positional encodings the KB cache and it will return something which is 03:59:10.380 |
Contextualized embeddings which become the input of the next layer 03:59:14.860 |
So we replace basically these hidden states with the output of the first layer so that it becomes the input of the next layer 03:59:24.380 |
The output of the last layer we send it to a normalization 03:59:28.240 |
Layer which is the rms normalization, which we didn't see yet, but we will talk shortly 03:59:35.740 |
So I want to actually redraw what we are doing so far. So we have arrived 03:59:53.340 |
What we are doing basically is this so we have created the 03:59:56.620 |
Embeddings before we have merged them with the image tokens and the text tokens 04:00:01.420 |
We did not apply any positional encodings because we are doing the rotary positional encodings 04:00:06.800 |
Which are applied exactly when we calculate the attention 04:00:10.560 |
So if we were to draw the the gamma architecture, it would be like this. So we have the 04:00:22.620 |
Then I remember there is some kind of normalization 04:00:25.040 |
Doing but it's not a linear not a normalization layer. It's just we are normalizing the embedding 04:00:31.420 |
So it's not a layer actually so we do not have to draw it 04:00:34.300 |
Then we have a series of layers and we have n of them 04:00:37.900 |
Each of these layers is made up of a normalization 04:00:55.660 |
Uh, I think I made it too small. So let's make it a bigger 04:01:03.180 |
Then we take the output of this one and send it to another normalization, which is an again in rms normalization 04:01:12.640 |
The output of this one is sent again to another 04:01:22.060 |
Then the output of the last layer will be sent to again another normalization, which is the rms normalization 04:01:28.640 |
Then we send it to a linear layer for the logits 04:01:32.480 |
Linear and let me shift it down and then we have the softmax so so far 04:01:43.820 |
So far what we have made is basically we are now creating this structure here, but without coding the single block 04:01:52.620 |
Forward method that will run the output of the embeddings to each of this layer one after another and will apply the final normalization 04:02:04.380 |
And then it will be sent to the linear layer when it will be sent to this linear layer 04:02:09.900 |
With gamma for causal lm because as you can see gamma for causal lm will take the output of this 04:02:18.540 |
Except the linear layer and then we'll apply this linear layer called the language modeling head which will convert it into logits 04:02:25.420 |
And after we will apply the softmax, but that is for sampling 04:02:29.020 |
So now we need to create this decoder layer. So what is this decoder layer? 04:02:32.940 |
This decoder layer is this stuff here. We need to code the normalization. We need to code the attention mechanism 04:02:38.940 |
We need to code the field forward network and of course all the skip connections. So let's do it 04:02:43.580 |
All right. The first thing that we can implement actually very easily is the rms normalization. So let's explore it 04:02:54.620 |
What we are doing is that we are normalizing each value using some statistic collected from the value from each item itself in the batch 04:03:05.500 |
It's a batch of pictures and the first picture is that of the cat in the layer normalization 04:03:09.740 |
What we are doing is for each dimension of this vector 04:03:12.620 |
We calculate a statistic using this vector which is the mean and the standard deviation 04:03:18.480 |
And then we normalize each value in this vector using these two statistics. How do we normalize? Well, we recenter it around 04:03:27.260 |
Here it's not written, but I can show you the formula here 04:03:30.300 |
You basically subtract the mean that you calculated and you divide it by the standard deviation 04:03:35.760 |
And the layer normalization actually works fine 04:03:39.980 |
But recently in most language models, we are seeing another kind of normalization that is known as root mean square normalization 04:03:47.120 |
Basically what we do with this normalization is that each of these features in this 04:03:57.740 |
We are normalizing it in such a way that it becomes like it's coming out from a distribution 04:04:03.120 |
Gaussian distribution with a center of zero and a variance of one 04:04:08.620 |
What they claim in the root mean square normalization paper is that they say 04:04:17.260 |
Layer normalization is not because of its recentering invariance, but because of its rescaling invariance 04:04:26.860 |
To actually reduce this internal covariate shift, which is the reason we use normalization 04:04:36.700 |
Centered around zero. It just needs to see the values mostly surrounded around whatever mean they are centered upon 04:04:45.420 |
So the values of this cat, for example, they do not need to be all around zero 04:04:51.900 |
They could be all around 500 or all around minus 100 as long as they are more or less around 04:04:58.300 |
500 or more or less around minus 100 all of them 04:05:02.060 |
That's the meaning of reducing the variance to one 04:05:06.140 |
So we want most of the values to be around whatever mean it is 04:05:12.700 |
Made by this paper and it's actually verified because most language models right now 04:05:18.060 |
They do not suffer from the internal covariance shift because they can be trained successfully very fast just like the layer normalization ones 04:05:25.420 |
But by using this root mean square normalization, why it is advantageous 04:05:31.660 |
Instead of layer normalization because instead of computing two statistics for the mean and the variance 04:05:39.100 |
We only need to compute one statistic, which is this root mean square statistic 04:05:44.380 |
Why we do not compute just the standard deviation like we do with the layer normalization because to compute the standard deviation 04:05:53.260 |
But we do not want to compute the mean because we do not want to recenter them 04:05:58.620 |
So we do and because we don't compute the mean we cannot compute the 04:06:05.900 |
So we replace this standard deviation with another statistic that allow us to 04:06:11.100 |
Reduce the variance, which is this root mean square statistic 04:06:14.640 |
Which is calculated as follows. So we take each item in this vector 04:06:19.660 |
So this item, this item, this item, this item, this item, this item 04:06:22.540 |
We make the power of two of each of this item. We sum them up all together. We calculate 04:06:29.120 |
The mean of this summation so divide by n basically 04:06:33.020 |
Square root and this gives us the square root mean 04:06:38.540 |
Square statistic for this item then we take each of this item and we divide it by this statistic 04:06:44.380 |
Multiplied by a learnable parameter called gamma, which is one for each feature 04:06:50.380 |
So basically with root mean square normalization, we are obtaining the same 04:06:59.580 |
I mean, it solves the same problem of the internal covariate shift as layer normalization, but by computing one less statistic 04:07:07.340 |
So we compute less statistics. So it is faster basically 04:07:12.940 |
Okay. Yeah, so let's implement it. Let me put away this stuff 04:07:17.740 |
All right, so now we copy this class we put it here 04:07:35.180 |
It's very simple. Okay. So what we are doing with rms normalization is that okay 04:07:39.740 |
we are creating a weight matrix, which is the 04:07:41.820 |
number of parameters one for each feature in the vector to which we apply this root mean normalization how many 04:07:48.700 |
Dimensions will have this vector well the same as the tokens because we are we will go we're going to normalize tokens 04:07:55.820 |
So this dim will be the hidden dimension of our language model 04:08:00.300 |
We compute this root mean square statistic as follows. So we calculate the power of two of each item 04:08:08.220 |
Power of two. So what we are calculating here is basically this term here. So let me 04:08:16.700 |
Then we do one the square root of this which is this r sqrt 04:08:21.340 |
but actually we are not doing the square root we are actually calculating the 04:08:25.500 |
One over the square root of whatever is the argument of the r sqrt. So stuff here 04:08:31.260 |
And instead of dividing each item we are multiplying with one over sqrt, which is exactly like dividing by one 04:08:41.820 |
Why do we have this item here plus self dot eps in the argument of the square the square root 04:08:51.900 |
Well, because this r sqrt is one over the square root of 04:08:58.780 |
But if the computation of this statistic produces a number that is very close to zero in this division 04:09:05.500 |
We are basically dividing by zero which will make the output of this division this number here very big. So instead of 04:09:12.780 |
To avoid this division by zero we add to the denominator of this division. So this denominator we add a very small number called eps 04:09:22.080 |
As you can see, it's a very small number to avoid this division by zero 04:09:25.200 |
And it's the same parameter that we also pass in the layer normalization as you can see here 04:09:29.600 |
We pass this parameter, which is a very small number to avoid this division by zero 04:09:33.280 |
So the forward method is basically just doing this normalization and then we multiply each of this number by this gamma parameter 04:09:41.120 |
Which is a learnable parameter as you can see 04:09:43.920 |
Here, so we have here we have this gamma parameter 04:09:53.840 |
Now we can move to the next part, which is the coding of this decoder layers 04:10:01.920 |
Let me check gamma model so we can create the decoder layer. So let's copy some code 04:10:13.440 |
All right, so the decoder layer as we saw before it's this stuff here 04:10:17.680 |
So we need to create something that manages all these blocks here 04:10:22.400 |
So something that takes an input a list of embeddings apply a normalization then apply a transformer 04:10:31.200 |
Then the output is sent to another normalization then to a feedforward layer block then again another skip connection then produces some output 04:10:38.240 |
So we will just create this simple block which is the same structure as the decoder layer that we have the encoder layer that 04:10:44.080 |
We have created in cglib. So it's the equivalent of 04:10:46.560 |
This block here the encoder layer. It will be doing the same job 04:10:54.640 |
So what we are doing is we are saving some stuff 04:10:57.520 |
So the hidden size of the model then we are creating the attention 04:11:00.800 |
Block, which we will code later the multi-layer perceptron, which is the feedforward network block 04:11:06.240 |
The first normalization and the second normalization because in the decoder block we have two normalizations 04:11:10.900 |
So as you can see here, we have one normalization here and one here 04:11:14.640 |
So the forward method is the same very similar to the one we have coded for cglib 04:11:23.440 |
Input to this layer the attention mask, which will be sent to the attention mechanism the position 04:11:28.800 |
Ids which also will be sent to the attention mechanism because we are using the rotary positional encodings 04:11:33.920 |
And the kb cache which also will be sent to the attention mechanism 04:11:36.660 |
So let's actually let me just copy it and then I explain it because it's the same as the encoder 04:11:42.960 |
So we take the input we apply the first normalization to this input which is 04:11:53.460 |
This hidden state we send it to the self-attention block along with the attention mask the positional encodings and the kb cache 04:12:00.320 |
And this will produce an output which will be then summed up with the skip connection here, which is this stuff here 04:12:06.080 |
So we take the output which is hidden states plus this residual which we saved before to create the skip connection 04:12:11.840 |
then we create another skip connection and we send the output of the 04:12:19.660 |
Self-attention to the second normalization, which is this stuff here this normalization 04:12:25.060 |
The output of the normalization is sent to the multi-layer perceptron, which is this one here 04:12:30.880 |
And then we take the output of the multi-layer perceptron 04:12:33.600 |
Which is the feed forward network plus the skip connection that we saved before which is this residual stuff here 04:12:38.960 |
And that's this plus sign here and the output is then returned and this is the decoder layer 04:12:45.280 |
Now we need to code the multi-layer perceptron and the self-attention 04:12:49.620 |
Block, I believe the the faster stuff to do is the multi-layer perceptron. So let's do that first 04:13:00.160 |
It's also very similar to the multi-layer perceptron that we have already coded for the 04:13:08.240 |
So the multi-layer perceptron here, which is also known as feed forward network is basically as we saw before in the sigleap 04:13:14.560 |
It is something that two linear layers that first expands the embedding 04:13:20.000 |
Vector applies some non-linearity and then reduces it back to the original size and this is what is done here 04:13:27.520 |
But in this case, we also have another linear layer called the gate projection 04:13:32.580 |
Which is used by the activation function that this gamma language model is using 04:13:37.600 |
We saw that different language models have different activation functions, which is based mostly on heuristics on how they work 04:13:45.520 |
So let's implement the forward method, which is very simple here and we will see why we need this gate projection 04:13:53.360 |
I made a code to convert this very long. I mean this very long this this this line into 04:14:00.000 |
Series of steps so that you can see each single step being done independently 04:14:04.980 |
but let me describe it what we are doing here basically is 04:14:08.480 |
First we are applying the gate projection to the input to this feed forward network, which is a list of embeddings as we saw before 04:14:17.920 |
And the function that we are using is the gelu function, which I believe is the same that we are using also for the sigleap 04:14:33.600 |
So basically it's adding some learnable parameters before sending it to this activation function 04:14:39.600 |
We multiply the output of this activation function with the up projection 04:14:45.120 |
The up projection is basically the one that takes the embedding size from the original embedding to the intermediate size 04:14:53.120 |
And then the result of this multiplication, which is a vector 04:14:57.920 |
Which is a tensor of size batch size sequence length and the intermediate size is then reduced back to the original size by this 04:15:04.960 |
Down projection because with the up projection you are expanding and the down projection you are putting it back to the original size 04:15:11.440 |
So the down projection will take the intermediate size back into the hidden size and this is the multi-layer perceptron of gamma 04:15:17.200 |
It's slightly different than the other one because we have this gate projection 04:15:23.360 |
And it's the same kind of gate projection that we also have if I remember correctly in lama in which we have this regular function 04:15:29.520 |
With its own gate projection. It's just parameters that are learnable before applying the non-linearity 04:15:35.220 |
We also said that the non-linearity is chosen based on heuristic on how they work well in particular case 04:15:41.280 |
But also on some properties that we want from them with respect to the gradient. So some 04:15:46.160 |
Activation functions allow the gradient to flow for negative values. Some others don't allow it, etc, etc 04:15:52.640 |
So it's all based on practical application. Someone trained tried using it so that it works better and then we start using it 04:16:00.560 |
Okay, now we also have the multi-layer perceptron now comes the biggest part 04:16:05.600 |
And but not the hardest because we are already familiar with the attention mechanism 04:16:09.280 |
So we need we need to code the attention mechanism which will comprise the self-attention the use of the KV cache 04:16:14.960 |
The grouped query attention which is something new and the rotary positional encoding. So it will be a little bit of learning experience. So let's start 04:16:22.400 |
All right. So let's start coding the next part, which is gamma attention. So we start by creating the class 04:16:33.120 |
And I will do it slowly because this one has a lot of innovations 04:16:36.820 |
So let's start by creating the constructor, which is our usual constructor 04:16:41.540 |
It takes in the configuration of gamma. We also take another parameter, which is the id of the layer 04:16:50.000 |
Transformer because as you know the decoder the gamma is a decoder 04:16:54.080 |
Only model it's made up of many layers and each of these layers will have its own KV cache 04:17:02.480 |
So to know which KV cache to use because there is one cache for each layer. We need to also pass the layer index 04:17:09.600 |
To each layer so it knows where to put its key and values 04:17:18.080 |
So the attention dropout which we will not use the hidden size is the size of the embedding vector of each token 04:17:29.040 |
The number of the head dimension which is how many 04:17:41.680 |
Which is a part of the entire embedding of each token 04:17:45.200 |
How many heads we have for the number for the keys and values in the multi-head attention? 04:17:52.320 |
And this is different from those for the query because we are going to talk about grouped query attention 04:17:57.280 |
So we can calculate how many groups we have in this grouped query attention, but later I will explain how it works 04:18:02.000 |
The maximum positional embeddings which are how many positions we can encode in the positional encoding using the rotary positional encoding 04:18:10.400 |
And what is the base frequency of the rotary positional encodings? 04:18:16.640 |
So first of all, we make sure that the hidden size is divisible by the number of heads because as you know 04:18:22.880 |
Each head has to watch a part of the embedding of the entire token 04:18:26.560 |
So it must be divisible by the number of heads 04:18:28.560 |
Then we create our projections which are the wq wk and wv projections that we saw in the multi-head attention 04:18:36.960 |
But in this case, we can see that we have not hidden size as input as output 04:18:47.200 |
But the number of features are calculated as the number of heads multiplied by the head dimension 04:18:52.320 |
Now why this is different from the multi-head attention that we have implemented for Sigleap? 04:18:57.440 |
So if we go to look at Sigleap and we look at the attention 04:19:01.840 |
you can see that each of these wq wk and wv metrics matrices is a 04:19:06.640 |
Hidden size by hidden size here. It's called the embedding dimension, but okay, it's the same thing 04:19:11.200 |
So it's the size of the entire token with the output features being also the same number of dimensions 04:19:22.160 |
If we look at what is the numHeads numHeads is the number of heads for the query and this is actually the 04:19:26.960 |
the full the number of heads for the query in grouped query attention is 04:19:32.320 |
Equal to the is bigger than the number of heads for the than for the keys and values later 04:19:39.280 |
We will see why but for now, let's concentrate on the dimensions. So in this case this wq matrix 04:19:44.720 |
So it's called the qproj which stands for which is the wq 04:19:48.800 |
Matrix in the multi-head attention has an output a number of output features. So suppose that the number of heads 04:19:55.440 |
So number of heads is equal to 8 and suppose that the hidden size is equal to 1024 04:20:07.040 |
1024 by 8 multiplied by the head dimension, but the head dimension is what the head dimension is how many 04:20:15.820 |
Dimensions it had will watch by using the number of heads of the query as a reference 04:20:28.300 |
So it's 8 multiplied by 120. So actually the wq matrix is 1024 by 1024 04:20:35.440 |
What changes in grouped query attention is the wk and wv projection actually wk actually will be 04:20:43.480 |
4 because that's the hidden size as input and the output features will be the number of heads for the key values 04:20:54.040 |
In the configuration we can see that the number of heads for the 04:20:58.440 |
Queries is 8 and the number of heads for the key and values is only one 04:21:04.600 |
So actually this is the case of not of grouped query attention. It's multi query attention. So 04:21:09.240 |
Let's say okay. Suppose that we have only one head here. Also one multiplied by 128. So it's equal to 04:21:18.820 |
And the same size is also for wv because as you can see the expression in wv is the same 04:21:25.480 |
it's the number of heads for the key value multiplied by the head dimension and then we have the output projection, which is a 04:21:33.640 |
Hidden size by hidden size because the number of heads multiplied by the head dimension 04:21:37.480 |
So it's actually number of heads is 8 which is always referencing the number of heads of the queries 04:21:45.720 |
So as you can see the difference with the grouped query attention is that we have less head for the keys and values 04:21:57.320 |
When it's used as keys and value. Let's see why so let me open a new 04:22:04.120 |
Page and let's switch to the ipad which is here 04:22:10.520 |
Normal multi head attention what we have is that each token is divided into multiple groups of dimensions 04:22:17.400 |
One dedicated to each head suppose that we have an initial token 04:22:21.800 |
Let me use a pen and let's use a smaller size. So imagine that we have a token with 04:22:32.260 |
Dimensions in total if we divide that in eight heads 04:22:36.340 |
We will have that each of the head will manage 128 dimensions of this token so one to 04:22:52.480 |
Etc, etc until the last one which will be I don't know how to do the calculation. Let me check 04:23:06.820 |
128 yeah should be correct. So this is the head number eight 04:23:20.900 |
When we do the product query multiplied by the transpose of the keys each of the query is 04:23:29.360 |
Multiplied so dot product with each of the keys, but only in the part 04:23:34.800 |
Dedicated to each head because each head is working independently 04:23:38.500 |
So suppose that this is our query. So this is our query. Let me write it with a different color. So 04:23:49.860 |
And this key also in the normal multi head attention. We have the same number of heads for the query and the keys 04:23:57.220 |
So suppose that we have the same number of heads also here so we can copy this stuff, I guess 04:24:12.640 |
So what will happen with the multi head the normal multi head attention is that each head will do the dot product of the first 04:24:22.200 |
Head of the head number one. For example, we'll do the dot product of the first 04:24:27.500 |
128 dimensions of the query with each of the keys because you need to think that we don't have one key. We have multiple keys 04:24:35.160 |
Because it's a matrix. The matrix is a sequence by sequence. So each head each query is attending to all the past keys 04:24:49.680 |
So key number one key number two and key number three and this is the query number one and we do it for all the 04:24:54.640 |
Queries so for each token each token will attend all the past tokens as keys 04:25:02.400 |
So what will happen is that we are doing a dot product 04:25:06.320 |
With the first head will do a dot product of the first 04:25:09.560 |
128 dimensions between the query and the key then again between this query and this key and then between 04:25:16.520 |
This query and this key in parallel the head number two will do the same stuff 04:25:22.200 |
so the head number two will take the next group of 04:25:25.560 |
128 dimensions or the dimensions from 129 to 256 and will do the dot product with the 04:25:37.380 |
So it will do the dot product of this query with this key and then this query with this key 04:25:53.080 |
Now what happens is that and we do it for all the heads 04:25:58.560 |
The problem with the multi head attention is that the and this was described in the multi query paper 04:26:06.720 |
So if you want I can give you the reference to the paper. It's called 04:26:20.840 |
Basically, Noam Shazir described what is the problem with multi head attention at least from a computation point of view 04:26:31.320 |
The problem is not in the number of computations that we are doing which is the bottleneck of the computation 04:26:40.480 |
Data transfer that is happening in the GPU because of this multi head attention and for that we need to talk about 04:26:52.600 |
Is this a GPU has a very big memory called the high bandwidth memory 04:26:59.400 |
Which is in the order of gigabyte or tens of gigabyte. I think the 04:27:04.880 |
100 goes up to 80 gigabyte. Then we have some smaller memory called local memory. So local 04:27:14.360 |
And this one is in the order of the megabyte. I don't know if it's 10 of megabyte 04:27:19.040 |
I think in the tens of megabytes, so it's a one a magnitude of order smaller 04:27:32.080 |
The cores are many and they all work in parallel all of these cores 04:27:36.600 |
So when you do a matrix multiplication, what happens is this 04:27:40.120 |
You have the matrix that you are trying to multiply in the high bandwidth memory 04:27:44.820 |
The the kernel that manages this matrix multiplication 04:27:50.120 |
Which is a CUDA kernel in case you are using an Nvidia 04:27:53.240 |
GPU will copy for example the first part of the matrix from the high bandwidth memory to the local memory and 04:28:00.920 |
Each core will work with a part of this big matrix to compute this matrix multiplication in parallel 04:28:09.040 |
So each one is will be working with a smaller part of this matrix to calculate this this part in parallel 04:28:14.680 |
it's much easier to visualize with the summation because for example if you are summing two matrices like this matrix and this matrix and 04:28:21.880 |
You get this matrix as output. What happens if you divide it into four parts is that 04:28:28.920 |
The result of this part of the matrix only depends on these numbers and these numbers 04:28:33.620 |
So the first head can work with these two parts the second core 04:28:37.960 |
Sorry, not head the second core can work with these two parts 04:28:41.480 |
sum them up to produce this one the third core can work with these two parts and 04:28:47.840 |
Resulting in this part and then the last core can work on this part which will result in this part of the matrix 04:28:54.800 |
So as you can see the metric summation can be done in parallel by multiple cores each working with a part of the matrix 04:29:01.120 |
What happens when we do multi head attention is that? 04:29:08.960 |
The dimension suppose that because the heads are working in parallel 04:29:16.560 |
128 dimensions of the query to the local memory of the GPU which will then be 04:29:23.720 |
Accessed by the cores to compute these dot products 04:29:26.880 |
Meanwhile the second head at the same time needs to copy the second 04:29:33.100 |
128 dimension of the each token to the local memory and 04:29:38.680 |
Then needs to also copy for each query the second 04:29:42.640 |
128 dimensions from the high bandwidth memory to the local memory so that the cores can work with it 04:29:49.880 |
Now what happens in the multi query attention paper. So this paper here what they say is that 04:29:55.680 |
The bottleneck of the computation of the attention is not in how many dot products we are doing 04:30:02.960 |
But how much it how much time it takes to copy the memory from the high bandwidth 04:30:08.200 |
Bandwidth memory to the local memory so that the cores can work with it 04:30:12.160 |
Why because in the GPU we have a lot of cores that are very fast at computing computation 04:30:18.240 |
But the GPU is not so fast at copying stuff around so the memory copying is very slow compared to how much 04:30:25.040 |
Computations it can perform. For example, let's open the 04:30:31.680 |
It's here you can see that the A100 has okay 80 gigabyte of memory in the high bandwidth memory 04:30:46.160 |
Teraflops operations per second if you are working with the 32-bit 04:30:50.080 |
But as you can see the GPU memory bandwidth is much slower than the number of operations it can do 04:30:57.060 |
Because the teraflow floating-point operations per second means 04:31:05.640 |
Billions of operations per second so it means thousands of giga operations per second while here we have only 04:31:18.640 |
So basically in in a lot of computations that we do in the GPU 04:31:22.320 |
The bottleneck is not how much compute we are using but how much data transfer is happening for this compute and as a matter of fact 04:31:29.560 |
Flash attention basically exploits this difference in computation and memory transfer 04:31:36.760 |
To reduce the memory transfer and redo computations because you it's faster than to redo computations twice instead of copying 04:31:49.480 |
For the computation. So basically what we do is we are willing to sacrifice computation 04:31:55.160 |
To reduce the data transfer. This is what we do with flash attention 04:31:59.400 |
This is also one of the reason we use the gradient checkpointing 04:32:02.800 |
So gradient checkpointing basically means that during the backward pass we redo some 04:32:06.720 |
computations instead of saving them because if we save them then we need to recopy them from the high bandwidth memory to the local 04:32:12.380 |
Memory, so it's faster to redo them instead of copying them the already processed one 04:32:21.180 |
So the one clock time which means the total time to compute the attention is determined 04:32:26.080 |
Actually is bottlenecked not by the number of dot products that we are doing but how much data transfer happens 04:32:31.800 |
So how to reduce the data transfer that we do when we do the multi head attention 04:32:41.280 |
so what will happen is that the first head imagine we only use one head for the 04:32:50.240 |
Having multi head also for the keys and values. So we don't have this part anymore 04:32:54.400 |
we only have a multi head for the we have many heads for the 04:33:02.480 |
We only have one we only have multi head for the queries 04:33:06.840 |
So we don't have multi head for the keys or we have less heads for the key 04:33:11.000 |
Imagine that we are in the extreme case in which we only have one head for the key and value 04:33:16.080 |
But we have multi head for the query. What will happen is that the first core will copy the first 04:33:21.080 |
128 dimensions for the queries from the high bandwidth memory to the local memory and also the 04:33:31.720 |
It will perform the computation now. Meanwhile, the also the second head needs to do its computation. So in parallel 04:33:39.200 |
So, how can it do it needs to copy the 128 dimensions for the query? 04:33:50.740 |
Dimensions from for each of the keys because it can be it can reuse the one for the keys 04:33:59.440 |
Heads of the queries is sharing some heads for the keys so that they don't need to copy 04:34:06.080 |
Again for different heads these dimensions, but they can share the already copied ones 04:34:12.480 |
So this is the extreme case of having only one head for the keys, but we can have a group of heads 04:34:22.560 |
Instead of we have eight heads for the query and then we have four heads for the keys 04:34:27.740 |
so the head number one and two for example for the query will share this head here and 04:34:33.720 |
Then the head number three and four will share this head here 04:34:38.720 |
So the head number one and two for the query will share this head here so that the total amount of transfer for the keys 04:34:47.080 |
Then the head number let's add here add number three and the head number four will share a different 04:34:53.720 |
Head of the keys, but it's shared as you can see every two head. We are sharing one head of the keys 04:35:07.080 |
128 dimensions in total for both of these heads 04:35:10.440 |
This reduces data transfer which speeds up the computation of the attention 04:35:15.120 |
And this is the reason we have here in the computation of the attention the projection for the WK and WV 04:35:23.000 |
Has less parameters because we are trying to compress these 04:35:32.720 |
Equal to the number of heads that we need for this projection 04:35:37.360 |
So for the keys, for example, if we have only two heads for the keys 04:35:49.960 |
Four heads of the query will have one head for the key 04:35:54.300 |
Imagine we have four heads for the keys and values then we will have this one will be four 04:35:59.000 |
So what will happen is that every two heads of the query will be using one and this one will become 512 04:36:06.640 |
Every two head of the query will share one head of the keys. So the total data transfer is reduced 04:36:13.240 |
So we speed up the computation of the attention 04:36:15.920 |
Of course, you may be wondering but this should also reduce the quality of the model because we have less parameters 04:36:22.120 |
We have less expressive power for the keys and values and it's true 04:36:26.040 |
So if you look at the paper, they say that in the multi query attention 04:36:30.300 |
It reduces the quality of the model, but not much so it's something that we can afford to lose 04:36:39.080 |
Let's check group query attention paper, which is this one 04:36:44.560 |
So in the multi query attention, you have one head for the keys and values 04:36:50.120 |
Which is shared for all the heads of the queries in the group query attention 04:36:54.480 |
We have a group of heads for the queries sharing one head of the key 04:37:01.280 |
So when you have multi query attention, you have only one head here for the query and the keys and values 04:37:07.440 |
When you have a group query attention, you have multiple heads 04:37:11.160 |
Of the keys sharing one head of the queries sharing one head of the keys and values 04:37:19.260 |
Multi query attention, which is only using one head for the keys and values reduces a lot of the quality 04:37:24.720 |
a good compromise is between the full multi head attention and multi query attention is the group query attention which reduces 04:37:32.700 |
Slightly less the quality of the model, but still gives you this computational advantage of reducing the quantity of data transfer 04:37:42.440 |
Group query attention is that you reduce the size of the KB cache because as you remember 04:37:47.480 |
We have one KB cache for each layer and in each 04:37:54.320 |
so if we compress these tokens the total amount of memory required for the KB cache reduces them and 04:38:01.120 |
Actually, the KB cache is also one of the bottlenecks in today's language model 04:38:06.280 |
So we have these big language models that are like 70 billion parameters or whatever 04:38:12.940 |
But the the problem using them is not even actually the GPU memory requirement just for storing the model 04:38:20.700 |
But actually for storing this big KB cache because you have to store each single token in each of the layers of the model 04:38:26.920 |
Which actually grows very fast if you have a lot of tokens 04:38:29.960 |
Okay. Now that we have seen how the group query attention works, we can proceed further 04:38:40.400 |
So the next part that we need is this beautiful thing called the rotary positional 04:38:45.560 |
Encodings that I will not explain right now. We I will explain them after 04:38:52.380 |
for now, we just consider them as a black box that adds some information encodes the information of 04:38:58.300 |
Position in the tokens and later we will see how it works 04:39:01.700 |
Let's implement the forward method. So the forward method is this one 04:39:07.140 |
so basically it takes the hidden states, which is the input to the 04:39:12.140 |
After the in the decoder layer is the output of the first 04:39:19.180 |
Then we have the attention mask the position in the position that we need to apply to each token because we need to apply the positional 04:39:25.300 |
Encodings and then the KB cache in case we are using it and now we will implement it 04:39:30.020 |
So the computation of the attention is the same as before 04:39:38.180 |
The first thing we do is we extract the batch size and what how many 04:39:45.120 |
So what is the length of the input sequence because as you remember when we do token generation 04:39:50.220 |
During the prefilling the QLAN will be all the inputs prompt 04:39:54.900 |
But then during token generation the Q will only be one single token because we want to 04:40:00.060 |
Generate all the last part of the attention matrix. So the last row so we need only one query 04:40:05.580 |
But how can we have all the keys to attend to because we have something called the KB cache which will store all the keys 04:40:11.580 |
So what we are computing here is the same as before 04:40:15.860 |
So we are converting the input sequence into query key and values and then we are splitting this 04:40:22.300 |
Embeddings into groups of dimensions based on how many heads we have for the query key and values 04:40:31.020 |
For the query, we will split it into numHeads number of groups 04:40:35.100 |
Each number or each group will have headDim number of dimensions and for the keys and values 04:40:41.420 |
We will have numKeyValueHeads number of groups and each group will have headDim number of dimensions to manage 04:40:48.620 |
Then we do this transposition so I can show you again. What does this transposition do? So let's do it 04:41:00.980 |
So the first part that we are doing here big up to the transposition is this one 04:41:06.740 |
So we are multiplying the input sequence with WQWK and WV and splitting these 04:41:15.780 |
So that each embedding is a group is a list of groups where each group is managing some dimensions 04:41:23.780 |
So now what we end up is basically a sequence of what? 04:41:28.100 |
Tokens where each token is made up of groups and each group is managing for example 04:41:35.420 |
Then we use this transposition because we want to have at the first dimension the heads dimension 04:41:45.260 |
So instead of having a sequence of tokens where each token has groups of dimensions 04:41:51.300 |
We want a list of groups where each group is a head 04:41:55.420 |
Each head has some tokens how many equal to the sequence length and each token is a mini token 04:42:03.840 |
Which is the dimensions dedicated to that specific head. So the head number one will have 04:42:09.460 |
128 dimensions the head number two will have the next number group of 04:42:14.420 |
120 dimensions etc until the last one which will have the last group of 128 dimensions 04:42:20.580 |
This allow us to compute the multi-head attention this for this using this 04:42:26.180 |
Sequence this sequence this sequence and this sequence all in parallel 04:42:31.620 |
Okay, and this is the meaning of this transposition 04:42:37.520 |
Transpose the next thing that we do is we apply the rotary positional encodings and now 04:42:44.020 |
We didn't talk about the rotary positional encodings and we will talk about later 04:42:48.540 |
But for now, you need to think that we are not changing the shape of these keys and queries and values 04:42:57.100 |
modifying them by adding some information that 04:43:00.540 |
Encodes their position and it will be done by this method called apply rotary positional embedding 04:43:10.060 |
just think that in the query and the keys we have encoded some information which will be leveraged by the attention mechanism to 04:43:18.020 |
Relate tokens to each other differently based on their position basically, but we will see that later. So 04:43:24.100 |
Suppose that we have already encoded the positional 04:43:27.200 |
Information. So now we need to as you remember when we do work with the KV cache 04:43:32.460 |
we pass only one single token as input to the layers of the 04:43:38.620 |
Transformer and this single token is added to the KV cache in the keys and the values cache of this 04:43:47.020 |
Particular layer then we retrieve the content of this KV cache which includes the newly added the token and all the previously saved 04:43:56.940 |
Output of this KV cache to calculate the attention. So let's implement this KV cache 04:44:02.980 |
so it's very simple because it's only one method to implement which basically will just take the 04:44:08.500 |
Single token that we are sending in which is this key states will add it to the key cache 04:44:13.940 |
will take this value states which is one single token add it to the value cache and then retrieve all the content of the cache as 04:44:20.860 |
Output so all the past token it has seen plus the current one 04:44:25.060 |
So let's implement it and we go to the beginning of the file 04:44:38.740 |
So we create a constructor as you can see it is a kind of a buffer where that includes one buffer for each layer of 04:44:44.940 |
the model one for the keys and one for the values 04:44:48.980 |
We also have this helper method that allow that tells us how many items the KV cache currently stores 04:44:56.780 |
So if this KV cache does not contain any item we say zero if it contains something then we return 04:45:03.060 |
What is the number of items it stores which as you remember when we add the something to the KV cache we are adding 04:45:10.100 |
This tensor here, which is the key value states and value states which are tensors of this shape 04:45:17.700 |
So batch size and number of heads sequence length and head dimension 04:45:21.540 |
Which means that the sequence length is the second last dimension. So that's why 04:45:27.700 |
We return the second last dimensions to retrieve the sequence lengths currently stored in the KV cache 04:45:33.060 |
We then implement the update method which is also very simple and I added some 04:45:41.900 |
So basically it means that it this will add the content of this key states and value states to the KV cache of this layer 04:45:49.620 |
And then it will return whatever is stored for this layer 04:45:53.820 |
So if we have never added anything to the KV cache of this layer, then we create it. So we basically append this tensors 04:46:00.900 |
It means that we have nothing else to concatenate it with 04:46:04.660 |
However, if we otherwise we are we already have some tokens in the key cache and the value cache of this particular layer 04:46:11.540 |
Then we concatenate whatever is already present with the newly incoming token along which dimension along the sequence dimension and the sequence dimension 04:46:19.620 |
We saw before is the dimension -2. That's why we concatenate them along the dimension -2 04:46:24.960 |
so after concatenating them we retrieve all the content of the 04:46:29.340 |
K and V cache and return it for the current layer and this is what is happening here 04:46:35.340 |
Here so we add this incoming key values and key states and value states to the KV cache 04:46:43.420 |
Then we retrieve them and we use them to compute the attention 04:46:46.900 |
Now you need to remember that when we do use the KV cache 04:46:50.700 |
There are two phases when working with the model with the KV cache 04:46:54.700 |
There is one part called the prefilling in which we have the prompt the prompt in our case will be the image tokens plus 04:47:00.640 |
The user prompt so the what the user wants the model to do with this image 04:47:05.920 |
It will be a list of tokens. So this key states and this value states will be a list of tokens 04:47:12.220 |
So they will be all added to the cache for the first time because initially the cache will be empty and will be retrieved here 04:47:18.380 |
When we do token generation, we use the last token output by the model and 04:47:26.660 |
But we always retrieve all the content of the KV cache to compute the attention because the each query needs to attend all the past 04:47:35.620 |
It needs to attend all the past keys which are then used to compute the weighted sum using the values 04:47:42.760 |
Um, okay, what is the next part of the computation of the attention? Well, well, well here 04:47:55.400 |
Now we need this method called the repeat KV which basically will repeat the 04:48:03.560 |
Of the keys and values that are missing for the heads of the query 04:48:11.880 |
Um, okay, let me explain it with the iPad because it's much easier to draw than to explain by words. So let's go here 04:48:23.980 |
Okay. So what happens with this repeat method is that we have the projection 04:48:30.160 |
Through WK and WV of the token that results in a smaller token 04:48:36.680 |
Which gives us some benefit from the KV cache point of view for example 04:48:40.500 |
But to compute the attention each head needs to share the heads 04:48:45.360 |
Each query heads needs to share the head with other query heads when working with the keys 04:48:53.760 |
The first two heads of the query needs to share one head for the keys 04:48:57.920 |
Then the second two heads for the query needs to share one head for the keys 04:49:05.360 |
Repeat this because we are working with the naive implementation of the attention which does not really 04:49:12.040 |
Actually benefit from this optimization. So what we do is basically we just repeat the missing heads as 04:49:18.940 |
You can see here. So we we take the heads that are missing and we just repeat them to match the heads 04:49:31.580 |
Like this one so that it's like each head each query head which has its own head also for the keys 04:49:37.480 |
This is because actually we are not creating a custom CUDA kernel for the computation of the attention 04:49:43.240 |
So we repeat it and we just pretend like the grouped query attention never happened 04:49:50.760 |
If you use a flash attention flash attention actually leverages the reduced number of heads of the keys and values to optimize the computation 04:50:00.560 |
So basically we are kind of reversing the effect of grouped query attention when calculating the attention because we don't have this 04:50:07.440 |
Custom CUDA kernel that can leverage this by not copying the missing heads 04:50:16.680 |
So we can implement that as well because it will just repeat the heads that are missing for the keys and values 04:50:27.360 |
As you can see if we have a tensor and we know that this tensor has the following shape 04:50:32.920 |
So the batch the number of heads the sequence length and the head dimension 04:50:36.840 |
If we only need to repeat it once then we just return it because we don't have to repeat anything 04:50:41.720 |
otherwise, we introduce a new dimension, which is how many times we want to repeat this number of heads and then we 04:50:49.180 |
We do this reshaping which will basically repeat this number of heads that much number of time 04:50:57.040 |
Actually, the repetition is done by the expand method here. So we introduce a new dimension here 04:51:02.640 |
Which is the number of repetitions and then we expand it. This expansion basically repeats whatever content is 04:51:09.440 |
This content here for each of the heads in the nrep heads 04:51:15.540 |
So basically we are repeating whatever comes after these two dimensions this number of times 04:51:22.680 |
and then we remove this helper dimension that we have created the nrep dimension that we only created to repeat the number of heads and 04:51:30.680 |
How do we do it? We must multiply the number of repetitions that we need with the number of key value heads 04:51:37.320 |
So at the output of this method the number of heads that you will have is the same as the number of heads of the query 04:51:45.920 |
So now it will this key states and value states will have the same number of heads as the query 04:51:51.400 |
So now we can just compute the attention like we have always been doing so by doing the query 04:51:55.640 |
Multiplied by the transpose of the keys divided by the square root of the model, etc, etc 04:52:05.880 |
so we compute the attention weights just like this standard formula query multiplied by the transpose of the keys divided by the square root of 04:52:12.520 |
The D model the model is the number of dimensions 04:52:25.360 |
That's why we in our case will always be made of zeros because we don't have any padding 04:52:30.440 |
so we don't need to mask anything and also during the prefilling we don't mask anything because 04:52:34.360 |
We always let the prompt the user prompt. So the text prompt to also attend feature tokens. Why? Because the polygamma 04:52:46.800 |
They decided that the prompt the user prompt or the task prompt does not need to be causal because anyway 04:52:53.480 |
It will never be generated by the model. It will always be 04:52:58.680 |
So we apply the softmax and then the dropout but the dropout we never have so this stuff here is very simple 04:53:08.480 |
Row by row then we apply the dropout but the dropout is always zero and we as you know 04:53:13.380 |
The dropout is only applied during training but just ignore it like it's not there 04:53:17.960 |
Then the output of the multi head attention is multiplied by the value states 04:53:24.160 |
So this attention weights is multiplied by the value state value matrix, which will result in that 04:53:35.920 |
an aggregation of previous tokens based on the 04:53:40.440 |
Score defined in the attention matrix. So if you want to visualize it again, I can show it to you again. So let's go here 04:53:47.640 |
When we do the multiplication with the V which is here 04:53:56.240 |
Let's say this one here is a contextualized token and that will include information about three tokens. I love pepperoni and 04:54:03.640 |
It will be a weighted sum of these three tokens 04:54:08.240 |
So I love pepperoni based on the following weights 04:54:11.440 |
So basically the token I will contribute to 20% of information the token love will contribute to 40% of information 04:54:18.640 |
The token pepperoni will contribute 40% of information and the last token will not contribute any information because it has been masked out 04:54:25.880 |
So this is what happens when you multiply the V that you are doing a weighted sum using the attention weights as weights 04:54:35.840 |
Then what else we need to do we need to check okay the output shape and that's fine I can do that so 04:54:51.360 |
So we transpose back to have again the sequence length as the second dimension then the num heads as the third dimension 04:55:01.000 |
Concatenate all the heads together just like we saw before so now each token is back to the head hidden size 04:55:08.400 |
Dimension where this hidden size is the concatenation of the output of each head 04:55:13.740 |
but we if you just concatenate the output of these heads then the each embedding will just be an 04:55:21.800 |
Independent calculation of each head concatenated together 04:55:25.640 |
So we need some kind of mixing mechanism and this mixing mechanism is given by WO which will mix all these 04:55:32.880 |
Dimensions with each other so that the result of each head is kind of mixed with each other through this WO projection 04:55:42.400 |
Token from this multi head attention is not just a concatenation of multiple independent heads 04:55:49.520 |
But it's something that is also mixing the results of this independent heads 04:55:54.600 |
And then we result will return the result of this multi head attention 04:55:59.240 |
Now one thing that we have considered as a black box so far is the rotary positional encoding 04:56:06.800 |
we are encoding somehow the positional encodings in these queries and keys and then the 04:56:13.040 |
Multi head attention will leverage it now. It's time to expand on that and understand how it works. So let's do it 04:56:20.320 |
All right. So let's talk about positional encoding guys 04:56:23.800 |
so traditionally we are used to work with the 04:56:27.720 |
Positional encodings applied directly at the entrance of the transformer, which means that we take some embeddings 04:56:34.400 |
So we transform we have our tokens which indicates the position of the token in the vocabulary 04:56:40.180 |
We convert them into embeddings using the embedding layer, which is this stuff here 04:56:50.020 |
Vectors to these embeddings that encode the position information of each token because otherwise the model has no 04:56:56.200 |
notion of position the model treats each token as you as you saw before each head just does a dot product of two tokens and 04:57:04.320 |
If the position information is not encoded in these two tokens that the dot product can only access the embeddings 04:57:10.840 |
So it does not have any notion of which token comes first and which comes later 04:57:16.180 |
So to encode this information, we basically traditionally we are used to add a positional encoding here to the embeddings of each 04:57:24.080 |
Token and so that the embeddings basically encode the information of the position in the original transformer paper. They proposed this 04:57:31.540 |
sinusoidal positional encodings which are also known as absolute positional encodings because they encode the absolute position in the 04:57:39.240 |
Inside each token. So the token number one will have some dimensions some vector that will encode the position number one 04:57:45.980 |
The token number five in the sentence will have the position number five added to it, etc, etc 04:57:51.060 |
What we use in most language models nowadays is the rotary positional encodings 04:57:57.580 |
Which are in the family of the relative positional encodings and they work as follows. So let's open the paper 04:58:03.420 |
They were introduced in this paper called the raw former enhanced transformer with rotary positional embedding 04:58:15.820 |
Positional encodings is that we do not add them directly to the embedding of each token 04:58:22.420 |
so that each token encodes the information of its position, but they 04:58:26.060 |
modify the attention mechanism in such a way that the attention mechanism takes into 04:58:32.100 |
Consideration the position of the tokens to relate them differently based on their position. Let's see how they did 04:58:42.580 |
We have this multi multi head attention mechanism that uses the dot product as to relate tokens to each other 04:58:52.780 |
encoding of the embedding vectors of tokens such that 04:58:58.080 |
When we do the dot product, which is an inner product. So this sign here means the inner product 04:59:03.980 |
So can we find an encoding for the token called FQ for the query and FK for the keys? 04:59:11.380 |
that encodes the position information inside the embedding XM for the query and 04:59:17.940 |
XN for the keys such that when we do the dot product 04:59:24.580 |
this dot product, the output of this dot product 04:59:27.140 |
Only depends on the embedding of the first token the embedding of the second token and the relative distance between them 04:59:35.120 |
So that's why they are called relative positional encodings because they depend the dot product is modified 04:59:40.660 |
so the attention mechanism is modified such that the dot product should depend only on the 04:59:46.660 |
Embedding of the first token on the embedding of the second token and the relative distance between them 04:59:56.120 |
information inside of our embedding such that this dot product will depend only on the embedding of the first 05:00:03.740 |
embedding of the second and the relative distance 05:00:12.740 |
Proposed the following case for the 2D case. So imagine we have an embedding vector made up of only two dimensions 05:00:20.740 |
How to encode the information of the position in this two-dimensional vector as follows 05:00:34.260 |
Rotation matrix. So if you have ever worked with the rotation matrix like when you do rotation of a vector in 2D space 05:00:41.720 |
you basically multiply the vector by this matrix here where the 05:00:45.640 |
Argument of the cosine and the sine is a multiple of an angle that defines by how much you want to rotate this vector 05:00:58.380 |
Multiply the two dimensions of this vector by this matrix here 05:01:03.180 |
Which is we will see what is it and then this matrix here, which is a rotation matrix 05:01:08.700 |
Then basically we are rotating this vector by some angle defined by this 05:01:18.100 |
This will encode the information so the output of this operation 05:01:23.660 |
So the output of this operation will be a 2D vector which will encode the information of the position 05:01:32.460 |
Such that when we do the dot product of two vectors encoded like this, this dot product is guaranteed to be 05:01:40.740 |
To be a function of the embedding of the first vector, embedding of the second vector and the relative distance that was 05:01:52.980 |
The difference of the distance that was encoded into them 05:01:56.160 |
Basically, but we usually when we have an embedding we do not have a 2D vector 05:02:04.540 |
We have a multi-dimensional vector, maybe 1000 dimensions or 2000 dimensions 05:02:09.940 |
So they take the 2D case to the general case and the general case basically they say okay instead of 05:02:21.820 |
So instead of using this 2D rotation matrix, we need to have this big rotation matrix here for an 05:02:27.900 |
D-dimensional vector. So here is the d-dimensional vector 05:02:31.980 |
If you look at this vector this matrix here as you can see it is a sparse matrix 05:02:38.820 |
Which means that it is mostly made up of zeros and only some elements are non zeros 05:02:44.580 |
So if we encode the information using this transformation here by using this matrix here 05:02:50.860 |
We will be doing a computation that will result in the following property being verified 05:02:56.380 |
which is that the when we do the dot product this dot product will only depend on the 05:03:01.140 |
Embedding of the first token the embedding of the second token and the relative distance of the two positions that were that was encoded into 05:03:09.980 |
But we will be doing a lot of unnecessary computations because a lot of zeros will be 05:03:14.780 |
Will be multiplied by other elements which will result in zero. So we are doing a lot of 05:03:23.940 |
If most of the elements are non zeros and only some of them are non zeros 05:03:29.780 |
That means that you are doing a lot of computations uselessly 05:03:32.620 |
Because you already know that in advance that they are going there. They are zeros 05:03:37.220 |
So is there a better way to compute this encoding mechanism to reduce this unnecessary? 05:03:44.860 |
Computations knowing already that most of them are zeros and we also know where they should be zeros 05:03:50.660 |
Well, yes, there is it is possible and they propose another 05:03:59.540 |
Which basically says that if you want to encode the position information inside your tensor inside your embedding 05:04:08.900 |
Here this so a d-dimensional vector because we know it's a d-dimensional vector. So where d can be 1000, 2000 05:04:14.940 |
Whatever it is. Suppose in our case, it's 1024 05:04:18.180 |
You multiply it element wise. So this is element wise multiplication by another matrix constructed as follows 05:04:26.580 |
Where the first element is a cosine of m theta 1 and the second element is cosine of m theta 1 etc 05:04:33.460 |
Where m is the position that you want to encode in this vector and the theta 1 theta 2 are 05:04:40.540 |
Computed using the following formula here. So they show it 05:04:47.500 |
Theta I is equal to the 10,000 to the power of minus 2 I 05:04:52.020 |
Divide by D where I is from 0 to D divide by 2. I remember correctly 05:04:57.620 |
They show it here. Yeah, I goes from 1 to D divide by 2 05:05:07.500 |
So basically what we are doing is we are multiplying each dimension of this vector by a cosine 05:05:13.140 |
Where where the argument of the cosine is a multiple of a base theta 05:05:19.380 |
Multiplied by the position of the token that we want to encode into this token plus 05:05:26.340 |
The dimensions of this vector but rotated and with changed signs 05:05:32.940 |
Multiplied element wise with the sign of the same arguments that we use for the cosine 05:05:42.540 |
And when you do the dot product of two vectors encoded like this 05:05:47.060 |
What will happen is that the dot product is guaranteed to be 05:05:50.980 |
The number that comes out of this dot product 05:05:54.460 |
Will be depending on the embedding of the first vector 05:05:58.340 |
So the information that was encoded before adding the positional encoding the embedding of the second vector 05:06:04.360 |
So the information that was encoded in the vector before adding the positional encoding and the relative distance plus 05:06:12.980 |
Basically the rotary positional encoding also have a 05:06:17.260 |
decaying effect based on the distance between two tokens 05:06:21.260 |
which means that the dot product as we know the dot product is converted into a score by the 05:06:27.980 |
Softmax, so it tells us how intense is the relationship between two tokens 05:06:33.020 |
So the bigger the dot products the more that that token will contribute to the output 05:06:41.140 |
So each of the attention scores tells us how much information that token will contribute to the output contextualized embedding 05:06:48.900 |
So with the rotary positional encoding what happened is that this dot product will modified in such a way 05:06:56.500 |
That the dot product will be high when two tokens are close and as they move apart 05:07:03.740 |
So the distance between the two tokens for which we are doing the dot products grows 05:07:08.940 |
The dot product will decay will decrease in magnitude 05:07:13.820 |
So the output number will be smaller and smaller and smaller based on the relative distance between the two tokens 05:07:19.740 |
And they give a relative upper bound based on the relative distance between two tokens 05:07:26.380 |
So, rehearse, to encode the positional information of a token using 05:07:32.500 |
Rotary positional encoding we need to do the following computation where we take the vector of the token 05:07:39.380 |
We multiply it by a special matrix constructed like this 05:07:45.380 |
Vector of the the token itself, but with dimensions changed in position 05:07:51.260 |
So first we create a special vector where we put first the second dimension of the vector, but with the change sign 05:08:00.820 |
Dimension then the fourth dimension with its sign change then the third dimension, etc, etc 05:08:06.860 |
And then multiplied by a sign this matrix constructed as follows using the theta values 05:08:13.940 |
calculated according to this formula here this one here and 05:08:19.820 |
The each of this sign and cosine is basically 05:08:24.180 |
Working with an argument that is a multiple of this base theta multiplied by the position that we want to encode into this token 05:08:33.460 |
And if you want to visualize in the rotary positional encoding paper 05:08:38.940 |
They also say what is the meaning of this rotary positional encoding? 05:08:42.860 |
So basically each two dimension as you can see from this matrix here 05:08:46.580 |
Each two dimension are being rotated by the same angle 05:08:50.300 |
So basically it's we are have a token that is made up of many dimensions 05:08:55.780 |
So each pair of dimensions is getting rotated like a 2d vector 05:09:00.500 |
So each two dimensions are considered like a two dimensional vector 05:09:05.340 |
Which is getting rotated by an angle that is a multiple of the base angle 05:09:10.460 |
Multiple with respect to the position that you want to encode 05:09:15.020 |
And this is the the meaning of the rotary positional encoding. So the rotary positional encoding to rehearse again 05:09:21.820 |
modify the attention mechanism in such a way that the attention score that is generated is dependent on the 05:09:29.580 |
Relative distance between two tokens and they also prove in the paper that this attention score 05:09:34.940 |
Decays as the distance between the token grows 05:09:37.940 |
Okay, now that we have seen how it works. Let's code it 05:09:43.940 |
And actually in the code that we are going to write you will see that 05:09:46.660 |
I am going to use the HuggingFace implementation of the rotary positional encodings 05:09:51.240 |
And we will see that the rotary positional encoding that it's implemented in the HuggingFace library. It's slightly different from the 05:10:04.580 |
But it according to the authors it results in the same computation. So 05:10:12.020 |
They I will also share the blog post in which they they explain why they do it this way 05:10:17.060 |
So it's a slightly difference, but the idea is the same 05:10:20.340 |
So it will result in a slightly different calculation, but the effect is the same. So let's do it 05:10:24.980 |
All right, let's implement this rotary positional encoding 05:10:28.180 |
So the first thing we need to create is this gamma rotary positional encoding class 05:10:33.060 |
So for that we can do it. I think here it's same no problem 05:10:41.460 |
Okay, so then we are giving some parameters dim is the head dimensions because each head because the 05:10:48.820 |
Rotary positional encodings modify the attention mechanism 05:10:51.720 |
The attention mechanism is performed independently for each attention heads 05:10:56.420 |
So each head will have its own positional encoding applied to the tokens 05:11:01.540 |
So this dim is the set to the head dimension. So the number of dimensions managed by each head in the multi-head attention 05:11:09.060 |
Then we have the max positional embeddings, which tells us 05:11:11.700 |
What is the maximum number of positions we can encode? 05:11:17.540 |
Set to 8000 actually in the gamma configuration here. It's initialized to 2000, but actually it will be overwritten 05:11:23.480 |
And then we have the base parameter theta which is set to 10000 also in the original paper 05:11:41.220 |
Here as you can see, it's 10000 to the power of minus 2 id. So this stuff here 05:11:49.540 |
Then we have this inverse frequency. So this inverse frequency is just the formula you can see here 05:11:55.380 |
So 10000 to the power of minus 2 i divided by d where i goes from it's written here 05:12:08.580 |
And so the formula we are using is actually I think this one here to calculate it 05:12:13.940 |
So 10000 to the power of minus 2 i divided by d 05:12:24.660 |
Minus something but when you have the negative power, it means one over the same thing with the positive power 05:12:38.880 |
Let me write it. Actually when you have x to the power of minus 3 05:12:45.440 |
It means 1 over x to the power of 3. So that's why you have 1 over 05:12:54.160 |
And what is this something that we are raising to the power 10000 to? 05:12:57.840 |
It's a list of numbers that goes from 0 to dimension divided by 2 which is the i 05:13:05.280 |
Divide by d where d is the number of dimensions 05:13:09.760 |
So of the vector to which we will apply the rotary positional encoding which is according to this formula here 05:13:20.880 |
d divided by 2 and d is the number of dimensions of the vector to which we apply the rotary positional encodings in our case 05:13:26.800 |
It's equal to the head dimensions because each head will have it positional encodings applied to it 05:13:32.480 |
We use this arrangement to generate a list of numbers from 0 to 05:13:37.200 |
d divided by 2. So basically it's a 0 to dim by skipping every 2 05:13:43.040 |
What else we need to do here I believe we need to go let me check 05:13:51.120 |
Okay, so now we can implement the forward method of this 05:13:58.880 |
So to calculate the rotary positional encodings we need to generate so now let me check the go back to the paper and then explain the 05:14:06.720 |
Forward method so to calculate to apply the rotary positional encodings. We need the vector itself 05:14:14.800 |
The vector itself and then we need to multiply each dimensions by some cosine and each dimensions 05:14:21.220 |
Rotated and with its change its sign changed with some signs 05:14:26.300 |
computed as follows so given some positions we can for each position m compute the cosine and the sine that will be 05:14:34.300 |
Needed to multiply by these vectors and this is what we do in the forward method here 05:14:39.420 |
We actually extract the cosines and the sines that will be applied to each tokens 05:14:44.140 |
Depending on the positions of these tokens. So for each token, we will have a different position 05:14:50.700 |
So this m parameter indicates the position of the token 05:14:54.860 |
So for each m we can compute the cosines and the sines and this is what we do in the forward method here 05:15:00.220 |
So we take the inverse frequency we add another the 05:15:03.820 |
Another dimension, which is I believe it's for the batch dimension 05:15:12.060 |
Disable the auto cast so the auto cast in torch is for mixed precision 05:15:16.700 |
so I don't want to go too much into the detail of this stuff, but 05:15:21.500 |
Mixed precision is basically when you train a when you train a model 05:15:25.500 |
You don't have to work with the floating point 32 numbers always because the most modern gpus 05:15:31.580 |
They also support working with the 16 bit numbers 05:15:34.940 |
Which makes computations faster and also reduces the memory of these computations. Of course, you use a little bit of precision 05:15:42.400 |
But the the precision that you need for some operations is not necessary for some operations. You don't need that much precision. So the 05:16:00.280 |
So it will use the smaller precision for the numbers when computing certain operations and higher precision 05:16:06.600 |
So 32-bit when computing other operations such that we are kind of we never lose much 05:16:16.280 |
Probably here for the rotary positional encodings. We want to retain the full quality of so the full 05:16:31.960 |
We are basically multiplying each frequency by each position that we want to encode because as you can see from the paper 05:16:40.760 |
We need to multiply this m by the base frequency. We have already the base frequencies in this 05:16:48.820 |
So we are multiplying it by each m. So we are computing the arguments of this cosines and sines 05:17:00.120 |
Cosines and sines. Why? Because we have them for dim divided by two 05:17:05.640 |
So for half the vector, but we need it for the entire vector 05:17:10.520 |
And we are concatenating here. Now. This is actually different from what we do in the paper 05:17:18.600 |
We need to repeat each argument twice for each successive dimension 05:17:25.240 |
So for each two dimension, we need the same argument 05:17:28.200 |
what we are doing here with the concatenation is actually we are taking this one then this one then 05:17:35.400 |
The theta 3 then theta 4 and then again, we are repeating theta 1 theta 2 theta 3 theta 4 instead of doing theta 1 theta 1 05:17:42.200 |
theta 2 theta 2 theta 3 theta 3 so the overall numbers of 05:17:46.200 |
Numbers that we will produce in the arguments that we produce is the same 05:17:51.480 |
But instead of being like in the paper theta 1 theta 1 theta 2 theta 2 theta 3 theta 3 theta 4 theta 4 blah blah 05:17:59.000 |
We are actually doing theta 1 theta 2 theta 3 and then we are repeating them 05:18:07.720 |
Why are we doing this? Now, it's a very long story, but basically it looks like when HuggingFace converted the 05:18:16.200 |
Weights of the model for example llama from the original pre-trained model into the HuggingFace 05:18:27.800 |
Projection the query and the key projection which is the embedding of the token 05:18:37.580 |
And then to accommodate for this permuted dimension 05:18:42.940 |
They are doing again a different computation for the rotary positional encodings 05:18:48.840 |
So the overall effect that will result from this computation is the same as the original paper 05:18:54.600 |
but they are doing this double permutation because one permutation was already done when doing the 05:19:00.740 |
Conversion of the script from the original pre-trained model to the HuggingFace 05:19:13.300 |
repository by this user who posted why the positional encodings are done differently than the paper and the 05:19:20.180 |
authors the HuggingFace explained saying that 05:19:24.020 |
When they converted the weights from the original model to the HuggingFace model 05:19:30.020 |
They permuted the dimensions of the wq and wk and wq and wk are the projection metrics that are used to compute 05:19:37.940 |
The query and the keys we apply the rotary positional encodings to the query and the keys. So we need to 05:19:43.140 |
recompute do another permutation to counter effect the effect of the first permutation. So that's why the 05:19:49.700 |
The computation we are doing does not reflect exactly the paper 05:19:53.380 |
Let's go forward so we have created the argument of the cosine and the sine 05:20:02.820 |
Doing with this argument. So when you calculate call the cosine function on a 05:20:07.700 |
tensor it will calculate the cosine using the 05:20:11.140 |
Dimensions of this vector as arguments for the cosine and the same we do it for the sine 05:20:19.860 |
in the paper is basically this two thing here that we need for 05:20:25.860 |
Applying the rotary positional encoding to each vector and we have computed the cosine and the sine for each 05:20:31.700 |
Position that we have in our sequence. So for each m that we have in our sequence 05:20:37.620 |
So let me delete this stuff. Otherwise it remains in my notes forever 05:20:41.700 |
Let's go forward now. We need to implement another method called apply rotary positional embedding 05:20:48.680 |
Which we include here and which I also copied from HuggingFace 05:20:54.580 |
What we'll do basically, okay, this will add another dimension, which is the head dimension to these cosines and and sines that we pre-computed 05:21:01.880 |
Where did we pre-compute them? Well, we computed them here 05:21:05.540 |
So as you can see, we extract the cosines and the sines using the rotary positional encoding class that we have created before 05:21:11.060 |
Using the value states is not used. It's just used to extract the data type of the resulting vector 05:21:20.340 |
So the m parameter of each of the arguments of the cosine and the sine 05:21:24.420 |
So we compute the cosines and the sine and then we use them to apply the rotary positional encoding to the query and the keys 05:21:29.540 |
Which will result in the output query and the keys with the rotary positional encoding applied. So now we are implementing this method here 05:21:40.100 |
While multiplying the dimension of the vector query with the cosines, which is this part of the formula 05:21:47.380 |
So as you can see the vector multiplied by the cosine 05:21:51.140 |
And then the rotated vector so with its dimensions changed and the signs changed multiplied by the sign 05:21:58.260 |
Which is this part of the formula here. We need to implement this method here rotate half 05:22:03.700 |
Which is again not equal to what is in the paper because we need to change the we need to permute the dimensions because the 05:22:11.620 |
original vectors so the q and k are permuted by 05:22:15.700 |
This query projection and this key projection 05:22:20.980 |
This rotate half method basically will take the first part of the 05:22:24.740 |
Embedding and then it will take the second part of the embedding with its sign changed. I believe here 05:22:32.820 |
And it will concatenate it it's different than the paper because in the paper we need to create 05:22:39.540 |
Here we need to create minus x2 then x1 minus x4 x3. But here what we are doing is 05:22:48.900 |
Let me check imagine the token is made up of 1000 dimensions. So we are doing minus 500 05:22:54.520 |
5124 dimensions. This is minus 1 513 minus 514 minus 515 blah blah blah 05:23:09.780 |
But because of the permutation that was done to the wk and wv wq and wk projections 05:23:18.840 |
Okay, now we have also implemented the rotary positional encodings which encode the position information 05:23:24.940 |
Right before the attention so that the attention mechanism will reflect this encoded information inside of each token 05:23:36.840 |
What else do we need to build here? I believe we have everything. So let me do a very simple 05:23:47.720 |
Guys, I think now we can proceed to the inference code. So we need to use this method 05:23:53.240 |
So these classes that we have built to actually inference something. Let's do it 05:23:57.320 |
All right, guys, let's go to the inference code. So let's create a new file called inference 05:24:06.520 |
I also have prepared the test image that I will be using to 05:24:11.160 |
Inference the language model. I will ask the language model 05:24:13.640 |
What is this building and the language model should tell me that what is the name of this building? 05:24:24.280 |
Let's start by writing some code. I will copy a large amount of code 05:24:29.640 |
Because it's very nothing. No much machine learning here 05:24:34.120 |
So basically i'm using a library called fire. So let's import stuff first 05:24:43.720 |
Let's import some stuff so i'm importing a pill for the image loading torch fire fire is a library that allows you to 05:24:51.640 |
Pass the command line arguments to a file to a script as parameters to a function 05:24:59.080 |
So it will do automatically the parsing of the command line parameters 05:25:02.780 |
And what I need to pass as the as command line is the model path 05:25:07.720 |
So what are the weights of the model the prompt that we will be using to inference the model 05:25:12.520 |
The image that we'll be using as condition for this prompt 05:25:15.880 |
And the max number of tokens to generate the temperature that we want to apply later 05:25:22.520 |
The later we will see the do sample if we don't want to use the greedy strategy 05:25:26.520 |
And if you don't want to use the cuda or the nps in case you are on the macbook 05:25:31.180 |
So we forced to use the cpu as device for the computation in the neural network 05:25:37.880 |
The first thing that this method will do is okay, we'll print which device we use 05:25:44.600 |
Method that we will implement later with the load hugging face model given the path and the device will load the model with the hugging 05:25:54.440 |
By copying each tensor in the right position, but because we kept the name the same as the hugging face model 05:26:04.120 |
We copy some we basically take the input and we process it using this polygamma processor which will take his input the tokenizer 05:26:18.520 |
Transform it is input for our gamma model, which will then decode it 05:26:23.480 |
And we will do that this in test inference method 05:26:27.720 |
So for now, we are just creating the polygamma processor and the model itself using this load hugging face model, which we will create later 05:26:34.760 |
Actually, no, let's do it now. So let's create a new file called utils 05:26:39.100 |
And this utils file needs to have the following code 05:26:44.440 |
So it's importing some stuff and then it's loading the hugging face method. So it's loading the tokenizer, which I said we will be using the 05:26:54.920 |
Hugging face one. So we will not be coding the tokenizer 05:26:57.500 |
But the weights of the model we can load them and if you look at the hugging face model 05:27:03.560 |
If you go to the repository of the model, you will see that each model is a list of 05:27:09.160 |
Safe tensor files each of these safe tensor files is actually a dictionary that contains the weights of the model 05:27:16.120 |
So you can actually click on this icon here and it will show you what each of these them contains 05:27:21.800 |
As you can see this one contains the multi-modal projector weight and bias 05:27:25.640 |
This one contains the vision tower embeddings, encoder layers one, layer two, layer three, etc, etc for all the layers 05:27:34.920 |
The wq projection, wk projection, wv projection, the weights and the bias 05:27:40.200 |
The weights, the bias of the layer normalization, the weight of the layer normalization, etc, etc 05:27:46.600 |
and each file contains a dictionary that contains some part of the 05:27:53.240 |
So what i'm doing here is I find all the safe tensor files and then I load them each of them into a dictionary 05:27:59.020 |
And then I use this dictionary to load the state dict of our neural network 05:28:03.960 |
I also create the model using the config.json file that is present in the repository of the hugging face 05:28:12.200 |
Model, so every hugging face model has this config.json 05:28:17.160 |
So we create the configuration that is used to create our model using this configuration file 05:28:22.840 |
and then I call tie weights which will copy the weights of the 05:28:27.000 |
Embedding layer to the language modeling head which is the linear layer that projects the embeddings into logits 05:28:33.800 |
And then we return the model and the tokenizer. So here there is no machine learning. I'm just loading the 05:28:39.720 |
The weights of the model from the safe tensor files creating the 05:28:46.920 |
Model using the configuration saved in config.json 05:28:49.900 |
And then loading this state dict which means that I am loading the weights into our class 05:28:56.120 |
This this class here into this model class and then i'm tying the weights and returning the tokenizer and the weights 05:29:02.920 |
So now we can launch the inference. So we have the model and the tokenizer. We have created the processor 05:29:08.200 |
So we have initialized it then we need to launch the inference 05:29:11.160 |
Let's see how the inference works. So let's go back to here 05:29:16.360 |
This test inference is also not so hard, but we need to do some 05:29:25.460 |
So what we are doing is first of all, we take this 05:29:32.280 |
Which is a text and we pass it to the processor and the processor will give us 05:29:37.560 |
As you can see from the processing polygamma, it will return us the pixel values 05:29:48.200 |
These values from the processor. So we need to create this function which is also a simple helper function that allows to 05:29:59.180 |
So we load the image we create the prompt which because the processor expects as input the text 05:30:05.640 |
As a list and the image as a list even if it only works with one of them with a list of size one 05:30:11.800 |
it takes the output of the processor which is the input ids the attention mask and the 05:30:16.120 |
Pixel values of the image then it moves to the right device each of them 05:30:20.680 |
So move to the right device is also a simple function that moves each tensor to the device specified by this function 05:30:29.960 |
And then returns it so now we have the input ids we have the attention mask we have the pixel values 05:30:40.040 |
And what we do for based on how many tokens we need to generate. Oh, I already removed the label 05:30:46.920 |
Based on how many tokens we want to generate with we launch the inference 05:30:51.640 |
At the beginning this input ids only includes the prompt 05:30:56.040 |
So it includes the image tokens and the text tokens without of course any output tokens because we need to generate the output 05:31:03.240 |
So what we are doing at the first iteration of this for loop is the prefilling 05:31:08.440 |
So the KV cache is empty the input ids contain the 05:31:13.880 |
Placeholders and the text tokens the pixel values contains the image 05:31:19.080 |
Loaded as a numpy array and then the attention mask, which is just a list of ones because we are never working with padding now 05:31:29.240 |
So the polygamma model, which is this here will merge the image features that we are passing 05:31:36.520 |
So these pixel values it will run them through the image encoder, which will return some image features 05:31:46.040 |
We replace the image placeholder tokens with the image features extracted from the image encoder. So now we have a list of embeddings 05:31:53.660 |
Where the first embeddings are the image embeddings and then the text embeddings 05:31:58.060 |
And then we send it to the language model for decoding. So let's go back to the inference 05:32:03.480 |
So the first iteration of this for loop is the prefilling 05:32:06.380 |
Which means that the query key and values are the same sequence length and they contain the tokens of the prompt 05:32:13.080 |
The output of the prefilling is a list of embeddings 05:32:20.780 |
But we take only the last logit to predict the next token 05:32:25.000 |
So that's why we take out the logits and we take only the last logit here 05:32:29.160 |
So this is the sequence dimension and we take the last item in this sequence dimension 05:32:36.040 |
So now let's go to the iPad actually because I want to explain how top p works 05:32:42.040 |
So let's go. Let me check if this is working. Yeah still working 05:32:55.160 |
Let's open a new page. So when you generate logits 05:32:59.320 |
basically, it corresponds to a kind of a distribution after you apply the softmax 05:33:05.260 |
So the logit is a vector. So let me draw here is a vector 05:33:11.240 |
Where the number of dimensions is equal to the vocabulary size 05:33:23.480 |
Token in the vocabulary and it indicates it's an indication by the model on what the model thinks should be the next token 05:33:32.120 |
What we do is we can do to understand. What is the next token? 05:33:36.680 |
We need to apply the softmax which will convert each of these numbers. So each of these numbers into a 05:33:42.920 |
Probability score so something that sums up to one and it's always non-negative 05:33:49.000 |
And we could take for example the highest one to predict what is the net to understand what is the next token 05:33:56.520 |
Sampling method. So this is a list of numbers, right one for each position in the vocabulary 05:34:02.700 |
So for example for the token hello, the model could say some score the token pizza 05:34:09.880 |
I don't know car it will give another score, etc, etc 05:34:16.760 |
Do sampling which means that we sort all of these numbers that we get 05:34:21.720 |
So all of these numbers that we get we sort them in decreasing order 05:34:37.180 |
Score so with top p what we are doing with the top p of 0.9 05:34:44.060 |
Suppose that to the token. Hello, we have assigned the model has assigned a probability 05:34:48.940 |
Let's say of 0.2. This one is a 0.5 and this one is 0.1 05:34:53.580 |
Then we have some other token. Let's say 0.05 and then another token. That is 0.1 05:34:59.660 |
Again, I don't know if this sum up to one but okay and then some other token and some other token 05:35:05.260 |
We sort them in decreasing order which means that we sort them like this. So we take hello 05:35:26.940 |
0.1 then we have something else that 0.1. Then we have something else that is 0.05, etc, etc 05:35:33.660 |
With the top p, let's say of 0.0, not 0.9. It's a little bit more 05:35:39.740 |
0.0, not 0.9. It's a little too much. Let's say top p 05:35:53.200 |
The token such that their cumulative score is at least this one 05:35:58.780 |
So we will take basically all the tokens that when they sum up 05:36:03.820 |
We sum them up with their probability score. They sum up to this amount and then we sample from them 05:36:13.740 |
Take into consider for example with the 0.7. We will consider only these two tokens 05:36:20.780 |
Sample from we then rearrange these numbers such that again, they sum up to one 05:36:26.540 |
So suppose that after applying again the softmax this sum up to this will be changed 05:36:32.700 |
So this will become let's say 0.75 and this will become 0.25 05:36:41.600 |
So basically what will happen is that 75% of the time we will choose this token and 25% of the time 05:36:48.700 |
We will choose this token. This is the meaning of top p. So among all the tokens we talk we sort them 05:36:56.780 |
With who that with the cumulative probability score that reaches this top p 05:37:02.060 |
And then we some sample from them just like they are a distribution by themselves 05:37:06.960 |
Before sampling them because they need to be a distribution so we need to apply the softmax again 05:37:13.580 |
So this is what we do with the top p instead what we do with greedy is that we just take the highest one 05:37:18.860 |
And that's it. But with the top p we are actually 05:37:26.940 |
To sample because some of them are basically the model is saying don't use this token because the probability score assigned to it 05:37:34.060 |
It's very very slow. So why should we even consider it? So that's why we use top p 05:37:37.740 |
We only consider the most likely one chosen by the model 05:37:41.340 |
So we don't introduce any noise in the generation process 05:37:50.540 |
So what we are doing here we are sampling with the top p if we decided to sample 05:37:54.540 |
Otherwise, we just take the one with the highest probability score, which is the greedy strategy if we don't want to sample 05:37:59.740 |
There is also this thing called temperature. So what is temperature temperature basically means that we want to divide 05:38:06.160 |
The as you can see here we divide the logits before applying the softmax 05:38:17.020 |
Basically what happens is that before we apply the softmax these numbers are not 05:38:24.620 |
So for example, this may be 10. This may be 7. This may be 5. This may be 2. This may be 1 05:38:33.980 |
When we apply the softmax, we are basically sorry when we are applying the temperature we are 05:38:41.820 |
Making the difference between them a little smaller 05:38:45.580 |
So we basically if the model is giving us the following distribution 05:38:49.980 |
So it's telling us that this token is likely but this is very 05:38:53.100 |
Much more likely and this is less likely and this is less likely etc, etc 05:38:58.460 |
What we are trying to do with the temperature is basically we are reducing the gap between these peaks 05:39:08.860 |
We are more likely to choose more diverse tokens 05:39:12.300 |
Because then with the temperature what will happen is that the hello instead of being chosen 25% of the time 05:39:18.700 |
It will be chosen. Let's say 33% of the time and this will become 05:39:22.140 |
0.66. So basically we are introducing some noise in the choice that we do 05:39:36.380 |
I know it's a little difficult to visualize but basically with the temperature 05:39:40.060 |
We are trying to make it more likely to choose more diverse tokens because we are reducing the gaps between the probability scores 05:39:49.580 |
And then we do sampling with top p which I will 05:39:55.900 |
Does what we saw before so we sort by descending order and then we sample from the distribution 05:40:01.980 |
So actually let's do it. I think it's let's do it one by one 05:40:07.260 |
Sample top p we can put it here. So as you can see we are sorting in descending order 05:40:14.300 |
We are calculating the cumulative sum. We are only taking the one that have the cumulative sum equal to the p parameter 05:40:25.340 |
Normalize again so that they sum up to one because we have removed some tokens from this distribution 05:40:32.860 |
And then we sample from this distribution using this multinomial and then we take the token 05:40:40.300 |
So we have applied the top p so now we know what is the next token we 05:40:46.300 |
Take this token and we add it to this generated tokens array 05:40:51.340 |
If the next token corresponds to the stop token, which is the end of sentence token, then we stop the generation 05:41:00.300 |
And then we take these input IDs as you can see then as for the next iteration 05:41:08.860 |
At at each inference step we use as query only the last predicted token 05:41:16.300 |
So this is what we are doing here. So at the second iteration of this for loop 05:41:19.900 |
Our input IDs will only become one single token 05:41:23.660 |
And so the first iteration we are doing the prefilling. So the input IDs is all the tokens of the prompt 05:41:30.140 |
So the image tokens and the text tokens of what we want to do with this image 05:41:34.300 |
At the second iteration these input IDs will only be one token 05:41:39.260 |
So how can the model will work with only one token because the model always has access to all the previous 05:41:45.500 |
Keys and values because they are have been saved in the KVCache. So when we calculate the attention the model will add this 05:41:52.540 |
Single token to the KVCache retrieve whatever is inside the KVCache and use it to calculate the attention 05:42:02.060 |
We keep increasing the attention mask by adding one because we want to attend to all the past token in the KVCache 05:42:13.580 |
You are used to think of the padding as something that is present on the right 05:42:16.860 |
But actually padding can also be done on the left 05:42:19.020 |
So because on the left, we don't have any padding token 05:42:22.540 |
So the attention mask is always made up of ones and also in my implementation. I am not never working with the paddings 05:42:29.580 |
We generate these tokens we concatenate them together because we save them into an array 05:42:33.980 |
So we need to generate a tensor which is then sent to the tokenizer for decoding and then we print 05:42:42.140 |
And now we can finally run the generation. So the inference so I will copy the script that I have already prepared 05:42:52.140 |
I have already saved the weights of the model 05:42:54.620 |
So if you want to run this code, you need to download the repository of this model clone it locally 05:43:05.900 |
Path to where you save it you give the prompt that my prompt is this building is and the model should tell me 05:43:11.580 |
What is this building and the image file is this building here. It's a building in Xi'an, China 05:43:17.260 |
And then we use this temperature the top p and we do not sample 05:43:23.020 |
I want the greedy strategy and I also want to use CUDA. We run the script like this. So now let's run it 05:43:31.740 |
I think yeah should be no problem. So launch inference. Let's see 05:43:38.780 |
All right guys, so after I have launched the inference actually my computer went a little crazy 05:43:48.780 |
And then it worked because I don't know why my CUDA sometimes doesn't work and it blocks all my computer 05:43:54.460 |
So if you run the inference using the code that we have made it should give this output 05:44:00.220 |
So this building is the oldest clock tower in the world 05:44:03.180 |
Which is actually I don't know if it's the oldest tower in the world, but actually this is called the jungle 05:44:08.220 |
So it's a clock tower in Xi'an. So it's a very famous building and looks like the output is correct 05:44:13.900 |
So thank you guys for watching this video. I know it has been a very very long journey 05:44:18.860 |
And I had to do a lot of explanations. I had to kind of improvise sometimes to do this explanation 05:44:24.780 |
So there it is possible that may there may be some 05:44:27.660 |
Imprecisions in my way of explaining because I don't have a transcript that i'm reading 05:44:32.060 |
For all of the things that I have talked about 05:44:36.220 |
I just look at the code to try to come up with the right words to how to explain it 05:44:41.420 |
And of course you cannot find always the right words immediately 05:44:45.420 |
Maybe you need to watch it at least for one minute to get the right words 05:44:50.220 |
Hopefully at least 90% of the content is super 05:44:52.780 |
Correct and the other 10% maybe will have some noises 05:44:56.220 |
So I will try to clarify the things that I have not been explained correctly in the comments or in the description of the video 05:45:02.060 |
Thank you guys for watching this video. So please share it with your friends and 05:45:07.020 |
Like it if you like it and subscribe to my channel 05:45:11.260 |
A lot of people have asked me. What is the best way to contribute economically? 05:45:15.600 |
To me to support me, but I believe I I thankful thank god. I don't need any economic support for now 05:45:22.780 |
If I would ever need it, I would be the first one to ask 05:45:25.740 |
So if you want to help someone economically, there are many people in the world that you can help 05:45:29.740 |
So there are people in war areas in palestine in ukraine. You can help them economically 05:45:34.880 |
But for me, I just need you guys to follow me and to share my video. This is the best way to help me out 05:45:40.620 |
Also, I work at a company known as writer and my team is currently hiring 05:45:44.620 |
We are looking for amazing researchers and you can find more 05:45:50.940 |
We train our own models. We have plenty of gpus 05:45:54.060 |
So if you are a researcher in dealing with the language models, but any area of machine learning you are feel free to 05:46:00.220 |
Send your resume. So thank you guys and have a nice day