back to index

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation


Chapters

0:0 Introduction
5:52 Contrastive Learning and CLIP
16:50 Numerical stability of the Softmax
23:0 SigLip
26:30 Why a Contrastive Vision Encoder?
29:13 Vision Transformer
35:38 Coding SigLip
54:25 Batch Normalization, Layer Normalization
65:28 Coding SigLip (Encoder)
76:12 Coding SigLip (FFN)
80:45 Multi-Head Attention (Coding + Explanation)
135:40 Coding SigLip
138:30 PaliGemma Architecture review
141:19 PaliGemma input processor
160:56 Coding Gemma
163:44 Weight tying
166:20 Coding Gemma
188:54 KV-Cache (Explanation)
213:35 Coding Gemma
232:5 Image features projection
233:17 Coding Gemma
242:45 RMS Normalization
249:50 Gemma Decoder Layer
252:44 Gemma FFN (MLP)
256:2 Multi-Head Attention (Coding)
258:30 Grouped Query Attention
278:35 Multi-Head Attention (Coding)
283:26 KV-Cache (Coding)
287:44 Multi-Head Attention (Coding)
296:0 Rotary Positional Embedding
323:40 Inference code
332:50 Top-P Sampling
340:40 Inference code
343:40 Conclusion

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hello guys, welcome back to my channel today. We are going to code a visual language model from scratch
00:00:04.720 | Now, what do I mean by first of all by visual language model?
00:00:08.000 | And what do I mean for by coding from scratch?
00:00:10.240 | The visual language model that we will be coding is called the polygamma and it's a language model visual language model that came out
00:00:16.640 | From google around two months ago
00:00:18.960 | About the weights, but the paper came out around two weeks ago
00:00:22.960 | So we will be coding it from scratch meaning that we will be coding from scratch the vision encoder
00:00:29.360 | You can see this here. Okay the linear projection, which is just a linear
00:00:32.560 | Layer the language model itself
00:00:35.680 | So which is the transformer language model how to combine the embeddings of the image tokens with the text tokens
00:00:42.160 | And of course how to generate the output using the condition. So what is the language visual language model?
00:00:48.160 | First of all, well visual language model is a language model that can extract information from an image
00:00:52.960 | So if we have an image like this, for example and a prompt like this, for example, where is the photographer resting?
00:00:59.120 | The visual language model can understand where this photographer is resting by looking at the image
00:01:04.640 | And generating a response in this case. The response is in a hammock under a tree on a tropical beach
00:01:10.080 | The topics of today basically are first of all, we will be talking about the vision transformer
00:01:15.760 | Which is the vision encoder that we'll be using to extract information from this image
00:01:19.760 | But this vision transformer has been trained in a particular way called contrastive learning
00:01:25.280 | So we will be talking about a lot about contrastive learning because I want to review not only what is contrastive learning
00:01:30.320 | But also the history of how it works
00:01:32.400 | So the first well-known model is CLIP and then it was transformed into CGLIP by google
00:01:37.760 | So we will be seeing these two models
00:01:40.480 | Then we will be coding the language model itself
00:01:43.040 | So the gamma language model how to combine the embeddings of the vision model and the language model
00:01:49.600 | But this one we'll do it in code
00:01:52.640 | And we will be talking about the KVCache because we want to
00:01:55.600 | Use this language model for inferences
00:01:58.320 | So we want to do it in an optimized way and the best way of course is to use the KVCache
00:02:03.200 | So we will be coding it from scratch
00:02:05.200 | Not only we will be coding it. I will explain step by step how it works
00:02:08.800 | The rotary positional encodings because we need them for the language model and the normalization layers because we have them in the vision model
00:02:15.760 | And also the language model. We will be seeing what is the batch normalization, the layer normalization and the rms normalization
00:02:21.520 | I will be explaining all the math behind them
00:02:23.520 | In this video i'm also using a slightly different approach at teaching let's say
00:02:28.640 | Which is by drawing so I will be drawing every single tensor operations that we'll be doing especially in the attention
00:02:34.800 | Mechanism because I want people to not only look at the code and hope they get something
00:02:39.520 | Like an idea of how it works
00:02:41.920 | But actually I want to show each single tensor how it's changing by drawing it from scratch
00:02:48.640 | I think this helps better visualize what happens in the transformer model, especially during the attention mechanism
00:02:54.340 | So we know what each view operation each reshape operation that we are doing to each tensor and also the matrix
00:03:01.360 | Multiplications that we are doing so we can visualize what happens to the tensors itself
00:03:05.680 | What are the prerequisites for watching this video?
00:03:09.120 | Well, you have a basic knowledge about the transformer. You don't have to be a master about it
00:03:14.880 | It's better if you have watched my previous video on it
00:03:16.960 | Which will give you the background knowledge to understand this video and you have a basic knowledge of neural networks
00:03:22.320 | So at least you know, what is a loss function, you know, what is a linear layer?
00:03:25.440 | And at least you know, what is backpropagation you don't need to know how it works or the mathematics behind it
00:03:32.560 | But at least you know that we train models using backpropagation
00:03:35.460 | Having said that guys, let's jump to work. So the first part I will be explaining is the visual transformer
00:03:44.160 | So this visual encoder we will be seeing what is the contrastive about it
00:03:47.840 | and we will be coding it and then we will move on to how to combine the
00:03:52.960 | Embeddings of the image tokens and the text tokens. The only part that we will not be coding is the tokenizer
00:03:59.700 | Because I believe it's a separate topic that deserves its own video. So hopefully I will make another video about it
00:04:05.760 | So let's start
00:04:08.160 | All right guys before we go deep into each of these topics
00:04:12.080 | Let me give you a little
00:04:14.800 | Speech actually, so we will be exploring a lot of topics like a lot of topics
00:04:20.800 | We will be reviewing for example each of the single
00:04:23.600 | Operations that we do in the attention mechanism and we will be looking at it from the code point of view
00:04:28.880 | But also from the concept point of view and from the tensor operations point of view
00:04:34.640 | There may be some topics that you are already familiar with and that's perfectly fine
00:04:39.120 | There are some others that you are not familiar with and that's also perfectly fine because I will be explaining each topic multiple times
00:04:45.760 | So for example, we will be
00:04:48.320 | Implementing the attention mechanism at least twice
00:04:50.960 | So if you don't understand it the first time along with the code, then you will have another time to
00:04:56.080 | Understand it and with a different explanation
00:04:59.520 | And the same more or less goes goes on with all the other topics. For example, we will be first introducing the
00:05:04.880 | Normalization in one part and then I will review again the normalization
00:05:09.140 | The positional encoding done in one way and then we will see another type of positional encoding
00:05:13.760 | So don't worry if you don't understand everything at the beginning because I will be reviewing anyway each topic multiple times
00:05:21.360 | The important thing is you don't give up
00:05:23.600 | So if there is some topic that I couldn't explain because of lack of time
00:05:27.200 | For example, I will not be explaining how convolutions work because there are plenty of videos on how convolutions work
00:05:32.480 | So if you can pause the video watch five minute video on how a convolution work and then come back to this video
00:05:38.400 | That's the best approach I recommend
00:05:40.560 | The second thing is always write down all the code that I am I will be showing you so write it
00:05:46.400 | Line by line character by character because that's the best way to learn. So now let's get started
00:05:52.880 | Let's start with the first part. So the first part we will be talking about is this contrastive vision encoder
00:05:58.400 | Which is something that takes any as input an image and converts it into an embedding
00:06:03.700 | Actually a series of embedding. We will see one for each
00:06:07.360 | Block of pixels of this image. So basically our image will be
00:06:12.320 | Split into blocks of pixels like this into a grid and each of this grid will be converted into an embedding you can see here
00:06:22.640 | This embedding is a vector of a fixed size
00:06:25.840 | and that will be concatenated with the
00:06:29.040 | Tokens embeddings because as you know, each token is converted into what is known as an embedding
00:06:35.040 | Which is a vector of a fixed size. They will be concatenated and sent to the transformer which will basically attend to this
00:06:41.520 | Image tokens as a condition to generate the text. So this is called conditional generation
00:06:48.800 | But okay, we will explore all this stuff here
00:06:51.760 | Let's talk about this vision encoder now the vision encoder
00:06:55.200 | First we need to understand what is why it's called a contrastive vision encoder and to understand why it's contrastive
00:07:02.160 | We need to understand what is contrastive learning
00:07:04.240 | So let's go back to another slide, which is this one
00:07:08.720 | Let's go here
00:07:13.600 | Imagine for now, we will consider the image encoder as a black box and later
00:07:17.840 | We will transform this black box into something more concrete
00:07:20.740 | now imagine that you have
00:07:23.600 | You go to the internet and when you go on wikipedia
00:07:26.260 | You see an image and when you see an image there is always a description of what is inside that image
00:07:31.680 | If you use a crawler you can crawl all of these images with the corresponding descriptions
00:07:37.460 | That in this will produce a data set of images along with the descriptions
00:07:42.560 | Now imagine that for some now for now imagine we have a text encoder that is most usually is a transformer model
00:07:50.400 | And then we have an image encoder which most of the cases it's a vision transformer
00:07:55.300 | And for now, we consider them as black boxes
00:07:58.560 | So it's something that takes as input an image and produces
00:08:01.940 | Here an image and produces an embedding representation of this image
00:08:07.040 | And if you feed a list of images, it produces a list of embeddings one corresponding to each image. What is this embedding?
00:08:13.920 | It's a vector that captures most of the information of this image
00:08:17.600 | And we do the same with this text encoder. So the text encoder is a transformer model that produces a series of embeddings. We will
00:08:24.240 | We'll see later
00:08:27.120 | But imagine you have this text encoder that given a text produces a single embedding of a single text
00:08:33.040 | But if you feed it a list of text it will produce a series of embeddings each corresponding to one single text
00:08:39.280 | now imagine
00:08:42.240 | The data set that we were talking about before which is the data set of images along with the corresponding descriptions
00:08:48.420 | So imagine we feed this data set of images along with the corresponding description to the image encoder and respectively to the text encoder
00:08:57.520 | It will produce a list of image embeddings and a list of text embeddings
00:09:02.580 | Now, what do we want these embeddings to be? Of course, we want the embedding
00:09:08.980 | Of the first image to be representative of that image
00:09:12.740 | So we want this embedding to capture most of the information of that image
00:09:16.500 | and of course, we want the embedding of the text number one to be
00:09:20.180 | A vector that captures most of the information about that text
00:09:26.560 | Moreover with contrastive learning we don't want only to capture information about the image or the text
00:09:33.200 | But we also want some properties and the property that we want from these embeddings is this
00:09:38.400 | We want the embedding of each image
00:09:42.000 | when its dot product with the
00:09:45.520 | Embedding of the corresponding text it should give a high value for this dot product
00:09:51.840 | And when you do the dot product of an image with a text that is not the corresponding one
00:09:56.880 | It should produce a low number for this dot product
00:09:59.520 | So basically with contrastive learning what we do we take a list of images
00:10:04.320 | We take a list of text which is the corresponding text one for each of these images
00:10:08.880 | So imagine that the image number one correspond to the text number one the image number two correspond to the text number two, etc
00:10:14.560 | etc, etc
00:10:16.400 | We encode them into a list of embeddings and then we want to train
00:10:20.800 | This model so this text encoder and this image encoder to produce embeddings in such a way
00:10:26.880 | That when the dot product of the image with its corresponding text is done
00:10:31.600 | It should produce a high value and when you do the dot product of an image with a not corresponding text
00:10:36.960 | For example i2 with text3 it should produce a low value
00:10:42.640 | What we can do is basically we take this text embeddings, which is a list of embeddings
00:10:47.520 | We take this image embeddings, which is a list of vectors
00:10:50.660 | We do all the possible combinations of dot products
00:10:53.680 | So the image number one did with the text number one image number one with the text number two image number one with the text
00:10:58.800 | Number three, etc, etc
00:11:00.480 | Then we do the all the also for the text number one
00:11:03.520 | So the text number one with the image number one text number one with the image number two text number one with the image
00:11:08.320 | Number three, etc, etc
00:11:10.240 | And then we want to find a loss function that forces
00:11:13.520 | These dot products to be high so that each text with its corresponding image to be high
00:11:18.880 | While all the other possible combinations to be low in value
00:11:22.560 | And we do that basically by using what is known as a cross entropy loss. So
00:11:29.120 | To understand why we use cross entropy loss. We need to explore how language models are trained and we will do that very briefly
00:11:38.160 | To not get us confused. So when we train language model, we do the we do so using what is known as the next token prediction task
00:11:45.680 | Imagine we want to train a language model on the following sentence. So I
00:11:52.560 | pepperoni pizza
00:11:54.560 | Pizza
00:11:58.480 | How do we train such a language model? Well, we give a prompt to this language model for now
00:12:03.200 | Let's consider it as a black box. So I
00:12:08.240 | I love pepperoni
00:12:10.420 | We feed it to the language model
00:12:15.760 | The language model will produce a series of embeddings
00:12:18.580 | Which are then converted into logits. So what is the logits? The logits is a distribution. It's a vector
00:12:25.440 | that tells
00:12:27.200 | What is the score that the language model has assigned to what the next token should be?
00:12:32.560 | Among all the tokens in the vocabulary. So for example, imagine this first number here corresponds to the token. Hello
00:12:39.280 | the second token here corresponds to the
00:12:43.120 | The second number here corresponds to the token. Let's say pizza
00:12:46.640 | The third corresponds to the token car the fourth
00:12:51.120 | Number to the token dog, etc, etc
00:12:54.800 | Which one we want to be the next token? Of course, we know that the next token is a pizza
00:12:59.680 | So we want the token number pizza to be high and all the other tokens to be low in value
00:13:04.480 | So we use the cross entropy loss basically to make sure that the next token is pizza. So how do we do that? Basically we
00:13:13.040 | Language model will output a list of numbers and we force the language model
00:13:17.200 | To produce the following output. So pizza should be one and all the others should be zero
00:13:22.000 | To compare these two things
00:13:25.680 | This one should be a distribution
00:13:28.880 | So basically the cross entropy loss what it does it takes a vector it converts it into a distribution
00:13:34.900 | With the softmax function and then we compare it with a label and we force the output to be equal to the label
00:13:42.400 | This will change the language model
00:13:45.200 | To generate a distribution the next time after the training in such a way that the pizza is given a high number and all the others
00:13:52.320 | Are given a low number and this is exactly the same that we do here for contrastive learning
00:13:57.360 | So we can use the cross entropy loss
00:13:59.680 | To force for example in this column here only this number to have a high value and all the others to have a low value
00:14:06.320 | And for this row here
00:14:08.480 | Only this number to have a high value and all the other number in this
00:14:11.920 | Row to have a low value and for example for this row
00:14:14.560 | We want the second item to have a high value and all the others to have a low value, etc, etc
00:14:19.040 | And we do that with the cross entropy loss
00:14:22.480 | Now here is the code that the pseudo code that they show in the
00:14:27.520 | Clip paper on how to implement the clip training with contrastive loss
00:14:31.840 | So basically we have a list of images and a list of text
00:14:35.360 | We encode them and they will become a list of vectors called image vectors and text vectors here
00:14:42.080 | image embeddings and text embeddings
00:14:44.720 | We normalize them later. We will see why we normalize stuff
00:14:49.040 | But okay, it's make sure that we reduce the internal covariance shift, but for now ignore it
00:14:53.680 | Anyway, we normalize them later. We will talk about normalization
00:14:56.900 | We calculate all the possible dot products between these embeddings
00:15:01.520 | So the text embeddings and the image embeddings, so we basically generate this grid here
00:15:08.720 | We generate the labels the labels are what well for the first row
00:15:13.280 | We want the label the first item to be maximum for the second row the second item for the third row the third item
00:15:20.320 | And that's why the labels are arranged this
00:15:22.800 | This is basically the the function arrange generates a number between zero and in this case n minus one
00:15:29.680 | So for the row number zero, we want the item number zero to be maximum for the row number one
00:15:35.600 | We want the item number one, etc, etc until the row number n minus one
00:15:38.880 | We want the n minus one item to be the maximum one
00:15:42.480 | Then we calculate the cross entropy loss between what is the output of the model
00:15:45.920 | So what are the numbers assigned by the model to each of these dot products and what we want?
00:15:50.560 | The maximum to be among these numbers. This is the labels
00:15:54.240 | And we do it by rows and by columns this one you can see here
00:16:00.720 | then we sum these
00:16:03.200 | Losses and we compute the average so we compute the average loss between all the rows and all the columns
00:16:10.480 | And this is how we do contrastive learning. Now, let's explore. What is the problem with CLIP?
00:16:16.320 | All right. So what is the problem with CLIP?
00:16:20.560 | Well, the problem with CLIP is very simple is that we are using the cross entropy loss
00:16:25.760 | And the cross entropy loss basically needs to have a compare does the comparison between two distributions
00:16:32.160 | So in language model we compare the output logits which are transformed into distribution
00:16:38.080 | With the label so which item of this distribution we want to be the maximum one and we do the same here
00:16:43.440 | So we have this column
00:16:45.600 | We convert it into a distribution and we do it through a function called the softmax function
00:16:50.960 | So the softmax function basically it is a function that takes as input a vector and converts it into a distribution
00:16:57.860 | What does it mean? It means that when you have a vector like this, for example, it will be a list of numbers
00:17:04.960 | To be a distribution each of these numbers needs to be non-negative. So it needs to be
00:17:09.760 | Greater than or equal to zero and plus all of these numbers needs to sum up to one
00:17:15.600 | That's what a distribution is
00:17:17.760 | Of course
00:17:18.320 | The model will predict some numbers and it cannot force all the sum of these numbers to be one and it cannot force the numbers
00:17:24.640 | to be
00:17:26.320 | non-negative
00:17:27.440 | So we apply to the output of the model this function called the softmax
00:17:31.440 | Which transforms them into a distribution and then we can compare it with the labels
00:17:35.040 | So our label in the case for example for the first
00:17:37.840 | For the second row will be this
00:17:40.160 | So we want the first item to be zero the second item to be one and this one to be zero this one to be zero
00:17:45.200 | This one to be zero this one to be zero, but we need to apply the softmax to the output of the model
00:17:50.240 | now the softmax
00:17:52.960 | Function has a problem which is
00:17:55.040 | And we will see now
00:17:57.920 | this is the expression of the softmax basically to we take the output of the model and we
00:18:03.120 | exponentiate each item in the output vector, which could be a row or a column
00:18:08.240 | And after exponentiating we also divide them with the sum of all the other items
00:18:15.600 | So the exponential of all the other items
00:18:17.760 | So which means that we need to calculate first of all for each row the exponential of the item
00:18:23.840 | And then we need to divide by the sum of all the exponentials of all the other items including itself
00:18:28.800 | The the problem is that we are using this exponential. The exponential is basically a function that grows very fast
00:18:36.400 | So if the argument of the exponential
00:18:38.660 | Grows the exponential will become huge
00:18:41.680 | And this is a problem for computers because in computers we store numbers using a fixed representation
00:18:48.480 | Which could be 16 bit or 32 bit which means that we cannot represent up to infinity
00:18:53.520 | But we can represent each number up to 2 to the power of n minus 1 basically if you don't have negative numbers
00:18:59.520 | So if the exponential is too big then our numbers will grow too much and it may not be represented by 32 bit
00:19:07.440 | And that's a problem. So we need to make this softmax function numerically stable
00:19:13.520 | So whenever you heard the term numerical stability in terms of computer science
00:19:17.360 | It means that we want to make sure that the number can be represented within 32 bits or 16 bits or whatever
00:19:23.040 | range we are using
00:19:25.440 | How to make this softmax numerically stable?
00:19:28.640 | Well, the trick is this. The softmax is uh, each item is exponentiated
00:19:34.740 | So we do the exponential of each item
00:19:39.040 | And then we divide it by this
00:19:41.680 | This denominator which is known as the normalization constant, which is the sum of all the
00:19:47.360 | Exponentials of all the other items in the vector
00:19:50.000 | Now as you know, this is a fraction
00:19:52.320 | So in a fraction you can multiply the numerator and the denominator by the same number without changing the fraction
00:19:57.200 | So we multiply by this constant called c
00:19:59.840 | Each number can be written as the exponentials of the logarithm of the number
00:20:06.160 | And this is because the exponential and the log are inverse functions
00:20:10.400 | So we can write c as follows. So the exponential of the log of c
00:20:14.480 | By using the properties of the exponential which means that the exponential of the product
00:20:21.280 | The product of two exponential is equal to the exponential of the sum of the arguments
00:20:26.340 | We can write it like this
00:20:28.400 | And then we can bring this exponential inside the summation because of the distributive property of the product with respect to the sum
00:20:35.680 | After we bring it inside we can use the same
00:20:37.920 | Rule we applied above which is the exponential of the product is equal to the exponential of the sum of the arguments
00:20:43.620 | Now what we notice is that if we subtract something from this exponential
00:20:49.300 | this log of c
00:20:52.480 | We can make the argument of the exponential smaller which may make it numerically stable
00:20:58.320 | So what we choose as this log of c, basically we choose the
00:21:02.700 | Negative maximum number in the array that we are normalizing using the softmax
00:21:07.440 | This way basically the argument of the exponential will decrease and it will be less likely that this exponential will
00:21:16.860 | Go to infinity
00:21:19.980 | Which makes it numerically stable
00:21:22.460 | Now this basically means that to calculate the cross entropy loss for each of these
00:21:29.340 | columns and each of these rows
00:21:32.940 | First of all the model needs to output a list of
00:21:36.460 | Text embeddings and a list of image embeddings as you can see then we do all the possible dot products
00:21:42.460 | Then for each column first of all
00:21:45.260 | We need to find the maximum value in this column so that we can subtract it before calculating the softmax
00:21:51.120 | Then we need to apply the exponential to each of these items
00:21:54.780 | then we sum up all of this exponential to calculate the
00:21:59.160 | Normalization constant then we divide each of these numbers by this normalization constant
00:22:03.800 | so as you can see to apply the cross entropy loss involves a lot of computations and
00:22:09.960 | Also, it forces you to always have imagine you want to parallelize this operation
00:22:15.400 | Imagine that you want to distribute each row
00:22:19.080 | between different devices
00:22:21.640 | So this device here needs to have all the row in its memory because it needs to calculate this normalization constant
00:22:27.960 | So it has needs to have access to all of this row and if you want to do parallelize by column
00:22:33.800 | Then you need to have all the column
00:22:35.800 | In your memory because you need to calculate the first of all the maximum item then you need to calculate this normalization constant
00:22:41.960 | Then you need to normalize them so dividing by this normalization constant
00:22:45.240 | So it is involves a lot of computation
00:22:47.400 | But also it makes it difficult to parallelize because at any moment each device needs to have at least one full row or one full
00:22:53.960 | Column, which does not allow us to go to very big batch size
00:22:57.880 | And this is a problem. So if you look at the cglib paper, they say that note that
00:23:04.360 | Due to the asymmetry of the softmax loss the normalization is also independently performs two times
00:23:10.600 | So first of all to make the softmax numerically stable, we need to go through each single vector calculate the maximum
00:23:17.020 | Then we need to calculate the softmax
00:23:19.160 | but then we also need to calculate the softmax by rows and then by columns why because this
00:23:25.800 | Matrix here is not symmetric. So as you can see
00:23:28.920 | This is image number one with all the text and this is
00:23:32.840 | Text number one with all the images and this item here is not equal to this item here
00:23:37.480 | Because this is image number one with the text number two, and this is image number two with the text number one
00:23:43.640 | Because it's not symmetric means that you need to calculate the softmax for each single rows
00:23:48.040 | And then you need to calculate it for each single column and then you can calculate the loss
00:23:52.840 | So the problem with the clip is that it's very computationally expensive to calculate this loss this contrastive loss
00:24:00.680 | that's why in the cglib paper they propose to replace the
00:24:05.080 | Cross entropy loss with the sigmoid loss
00:24:10.440 | So with the cglib what we do is as follows
00:24:13.160 | Again, we have an image encoder that converts a list of images into a list of embeddings one for image image
00:24:19.880 | Then we have list of text which convert each text into a list of embedding one for each text
00:24:25.160 | Then what we do
00:24:29.320 | We calculate this all the possible dot products
00:24:31.880 | So the image number one with the text number one image number two with text number two and also image number one with text
00:24:37.160 | Number two text number three text four text five blah blah. So all the possible dot products between all these embeddings
00:24:43.100 | then instead of treating the loss as a distribution over a row or a
00:24:49.560 | Column or a row
00:24:52.200 | So we don't say in this row in this column
00:24:55.160 | I want this item to be maximum or in this row. I want this item to be maximum
00:25:00.440 | We use what is known as binary
00:25:06.040 | We use it as a binary classification task using the sigmoid loss
00:25:09.720 | In which each of these dot products is treated independently from each other
00:25:15.400 | So this is considered a single binary classification task in which we say okay this item here should be one
00:25:21.880 | This item here should be zero. This item here should be zero. This item here should be zero independently of what are the other items
00:25:29.400 | This one here should be zero. This one should be here zero, etc, etc, and we can do that with the sigmoid function
00:25:35.480 | So as you can see, this is the function the signature expression of the sigmoid function
00:25:38.920 | It takes as input this value called z which will be the dot product of our vectors
00:25:45.000 | And the output of the sigmoid is this stuff here, which is a number between zero and one
00:25:51.160 | So what we can do is we take each of these dot products. We run it through a sigmoid
00:25:55.900 | And then we force the label to be one for corresponding
00:26:01.240 | Text and images and zero for not corresponding ones. So each of these dot products now becomes a
00:26:07.560 | independent binary classification task
00:26:10.120 | basically this allow us to
00:26:13.240 | Grow the batch size to millions of items and also to parallelize because we can put this block here into one device
00:26:20.600 | And it can calculate it independently from this other device because they do not need to calculate any normalization
00:26:27.400 | Constant for each item or the maximum item in each row or column because each of them is independent from the others
00:26:34.360 | Now you may be wondering why are we even using a contrastive
00:26:40.060 | vision encoder
00:26:41.640 | I mean
00:26:41.960 | Why cannot we just use an ordinary vision encoder that just takes an image and instructs some kind of embeddings that capture the information?
00:26:49.480 | Of this image why we want it to be
00:26:51.480 | contrastive
00:26:53.160 | because
00:26:54.280 | We want these embeddings to not only capture a information about the image, but we want these embeddings to be
00:27:01.720 | Good representation that can be then contrasted or can be used along with text embeddings
00:27:09.100 | And this is exactly what we do in a vision language model. We extract some
00:27:13.800 | image
00:27:16.020 | Embeddings which are vectors representing we will see later a patch of the image
00:27:21.480 | So this you need to think of this image as being divided into a grid and this first
00:27:26.440 | second third four five six
00:27:28.920 | So we produce in this case, for example, nine embeddings which are nine vectors
00:27:33.800 | Each of them represents information about a patch of the image
00:27:37.080 | So we want these embeddings to not only be
00:27:42.200 | Representing the information of these patches, but also to be able to be contrasted with the text
00:27:48.520 | Which is what we do in a visual language model
00:27:50.360 | So we have some prompt and we kind of contrast it with the image embeddings to produce an output
00:27:57.560 | It is not really a contrastive learning in this case because we are using it as a condition
00:28:02.600 | We will see later how these things are merged
00:28:04.920 | But we want a visual language a vision encoder that is already trained to be used with the text because it has a better
00:28:11.880 | Representation for the image for being used along with the text. That's why we use the contrasting vision encoder
00:28:18.360 | also, we use them because they are cheaper to train so
00:28:21.880 | You can basically to train a contrasting vision encoder
00:28:26.600 | You just need to crawl billions of images from the internet
00:28:30.360 | Each of them already has a kind of a description because you can for example in wikipedia
00:28:35.480 | You always have the description of each image, but also the internet when you have an image you always have the html alt text
00:28:42.520 | It's called
00:28:44.040 | Which is the alternative text that is displayed when the image is not shown
00:28:47.320 | So you always have access to some kind of description
00:28:49.980 | Now, of course this vision encoder may be noisy because they we crawl stuff from the internet
00:28:55.400 | Which means that this stuff may not always be correct
00:28:58.280 | So sometimes you see a picture but the description displayed is not correct or maybe the crawler didn't get the correct information
00:29:04.920 | But because we train it on billions and billions and billions of images eventually it learns a good representation of this image
00:29:13.880 | So this vision encoder that we will be using is basically a vision transformer. So now let's talk about the vision transformer
00:29:21.020 | Let's talk about it here
00:29:24.600 | So the vision transformer is a transformer basically that was introduced in this paper and image is worth 16 by 16 words
00:29:32.680 | In which basically they train a transformer as follows. So first of all, what do we
00:29:39.960 | How does a transformer work?
00:29:43.320 | we will see later in detail what is the
00:29:45.640 | Attention mechanism, but for now, I just need you to remember that the transformer model is a sequence to sequence model
00:29:52.520 | which means that you feed it a sequence of embeddings and it outputs a sequence of
00:29:57.480 | contextualized embeddings
00:30:00.180 | What we do to encode an image with the vision transformer we take an image and we
00:30:07.240 | Split it into patches and in this case, for example, we can split into 16 patches
00:30:13.000 | So this is the first group of pixels. This is the second group of pixels
00:30:17.160 | This is the group of pixels on the bottom right of the image. This one is on the top right top right, etc, etc
00:30:23.400 | we extract
00:30:26.280 | Information about this patch using a convolution
00:30:29.020 | So when you run a convolution you can extract information about a group of pixels from the image
00:30:36.120 | And then for example, this one will produce this output
00:30:39.640 | This one the convolution of this patch will produce this output. The convolution of this patch will produce this output, etc, etc
00:30:46.520 | And then we flatten them. So we lose the positional information
00:30:50.300 | We just take we don't care if this four is the top right or the bottom left
00:30:55.800 | We just concatenate them one with each other
00:31:00.200 | We do we lose the two dimensionality in this case basically so we transform into a sequence of
00:31:05.640 | patches instead of being a grid of patches
00:31:08.760 | Then we add this position information so we say that okay, this is the patch number one
00:31:15.320 | So, how do we do that?
00:31:16.680 | This patch basically the embedding of this patch that will be the result of this convolution will be a vector
00:31:22.600 | We add to this vector another vector that tells the model
00:31:27.800 | Hey, this is the patch number one and this is the patch number two, and this is the patch number three, etc, etc
00:31:32.920 | So we do that by adding so this plus operation you can see here
00:31:36.600 | and unlike the
00:31:38.920 | Vanilla transformer or the transformer model that we see for language models
00:31:42.600 | These positional encodings are not calculated using sinusoidal functions, but they are learned
00:31:48.040 | So they are vectors that get added always so the positional encoding number one always gets added to the top left
00:31:55.720 | Patch the positional number two always gets added to the second patch from the top left, etc, etc
00:32:02.040 | The positional encoding number 16 gets added always to the bottom right patch
00:32:06.680 | So the model
00:32:09.560 | Has kind of access to this to the 2d representation of the image
00:32:13.800 | So the model will learn basically that the patch number 16 is always on the top right and this is always on the top left
00:32:19.960 | We feed it to the transformer
00:32:22.200 | So this is a series of embeddings because the sum of two embeddings is a series of embedding
00:32:28.120 | We feed it to the transformer model for now
00:32:30.760 | Let's consider it as a black box and later when we code it, we will explore each layer of this transformer
00:32:35.500 | The transformer what it does it does the contextualization of these embeddings
00:32:40.680 | So at input we have this each series of embeddings each of them representing one single patch
00:32:47.640 | The output of the transformer through the attention mechanism will be a series of embeddings again
00:32:52.920 | But each of these embeddings is not only capturing information about itself, but also about other patches
00:32:58.680 | In language models, we do what is known as
00:33:02.440 | We use in the attention mechanism. We use what is known as the causal mask. So this first
00:33:08.280 | Embedding should be only capturing information only about itself the second one only
00:33:14.360 | About itself and the previous one the third
00:33:17.240 | About itself and the two previous one the fourth one about itself and the three previous one, etc
00:33:23.000 | This is what we do with the language models with visual language models in the with the trust
00:33:27.880 | Sorry, not with visual language, but with the vision transformers
00:33:30.940 | We don't care about this
00:33:34.280 | being
00:33:35.720 | The model being autoregressive we say so we don't want these patches to only encode information about the previous patches because in the in an image
00:33:43.240 | There is no autoregressiveness. So it's not like the patch number 16 of an image
00:33:48.920 | It depends only on the previous patches and the patch number one does not depend on any others
00:33:53.960 | Because imagine you have an image in which the sun is here or the light source is here
00:34:00.360 | then this part here will be light will be illuminated, but
00:34:05.320 | So the illumination here depends on what is coming after in the image
00:34:10.680 | So in the image, we don't have this autoregressive
00:34:13.180 | relationship
00:34:15.400 | Why in the text without we do because we we write the text from left to right or from right to left
00:34:21.080 | But anyway, each word that we write depends on what we have written previously
00:34:25.400 | But this doesn't happen with image. So basically this contextualized embeddings
00:34:30.460 | They capture information about themselves, but also all the other embeddings
00:34:37.800 | We use this contextualized embedding to capture information about each patch
00:34:43.080 | But also how it is present in the image. That's why we want them to contextualize
00:34:47.740 | So we want each patch to include information about its position, which is given by the positional encoding
00:34:53.480 | But also about what is surrounding this
00:34:55.880 | patch in the image
00:34:58.600 | By contextualizing them. So when we code it, this will be more clear for now. I just want you to get a
00:35:05.400 | Idea of what we are going to code. So we are going to code a model that will take an image will apply a convolution
00:35:13.020 | To extract a series of embeddings. You can see here. We will add a positional encoding to these ones
00:35:19.560 | Which are learned we will apply the attention mechanism
00:35:23.480 | Which is will be a series of layer actually of the transferable model that will contextualize these embeddings
00:35:29.080 | And then we will use this contextualized embedding as input to the language model for decoding the output of the language model
00:35:35.240 | So let's finally start coding
00:35:37.240 | Now in this video I will be
00:35:40.920 | Using a slightly different approach, which is I will not be
00:35:43.960 | writing each line
00:35:45.560 | I will be copying each line and explaining it step by step because I want this video to be more about explanation than just
00:35:52.040 | Coding because I want to use the code for explaining what happens under the code under the hood
00:35:58.280 | So let's create our first file, which is the modeling
00:36:03.240 | Oops, I'm using Chinese
00:36:05.240 | Siglip.py
00:36:07.560 | And let's start by importing stuff which we need I don't need copilot
00:36:14.060 | And then we create our first class which is the siglip-config
00:36:19.100 | So, what is this basically we will be using this visual encoder and this visual encoder will have some
00:36:27.700 | Configurations, why do we need a configuration class because uh, polygamma comes in different sizes
00:36:33.620 | Let me put this one. Okay
00:36:36.740 | Polygamma comes in different sizes
00:36:39.540 | Which means that each of this size of polygamma each of these models polygamma models has a different configuration for its vision encoder
00:36:46.660 | So let's see each of them
00:36:48.420 | The hidden size basically it's the size of the embedding vector of this vision transformer that we are going to encode
00:36:54.900 | the intermediate size is the
00:36:57.700 | Linear layer that we use the size of the linear layer that we use in the feed-forward network
00:37:02.340 | The number of hidden layers is the number of layers of this vision transformer
00:37:06.820 | The number of attention heads is the number of attention heads in the multi-head attention
00:37:10.500 | The number of channels is how many channels is each image has which is RGB
00:37:15.080 | The image size is because polygamma comes in I remember three sizes. So 224, 448 and
00:37:22.580 | 896 something like this
00:37:26.180 | The default information that we put here is the for polygamma 224
00:37:29.960 | Which supports of course image of size 224. So if you provide any image, it's first get resized into
00:37:36.980 | 224 by 224
00:37:39.840 | The size of each patch. So what is the number?
00:37:42.980 | It will be divided each image will be divided into patches. Each patch will be 16 by 16
00:37:48.980 | and the this way is a
00:37:52.260 | Parameter for the layer normalization. We will see later
00:37:54.420 | The attention dropout is another parameter that we will not be using in the attention calculation
00:37:58.900 | Basically, it's a dropout that we use in the attention, but we will not be using it
00:38:02.660 | And the number of image tokens indicates how many output embeddings this attention mechanism will this transformer vision transformer will output
00:38:11.060 | which is the how many
00:38:13.140 | Image embeddings we will have for each image
00:38:17.460 | Now before we saw that each an image encoder is something that converts an image into one single embedding
00:38:24.340 | So that represents all the information about that image
00:38:27.140 | but in the case of the vision transformer we can use all the output of the vision transformer to have because as we saw before
00:38:33.940 | Vision transformer is a transformer model. So which takes as input
00:38:38.180 | A list of embeddings and it outputs a contextualized embedding
00:38:42.820 | So each of these contextualized embedding will be the tokens of our image
00:38:46.740 | so it will not be one single embedding that represents the whole image, but
00:38:49.940 | Lists of embeddings that represent a patch of each image, but also information about other patches through the attention mechanism
00:38:57.460 | But we will see this later. So now this class is very very basic. It's just a configuration of our cglib
00:39:03.380 | Now let's start by coding the structure of this vision transformer. So let me copy this stuff here
00:39:13.700 | How to follow this video now I
00:39:16.260 | I am copying the code because I have already written before and I want to explain it instead of
00:39:21.780 | Coding it because I also allows me to copy the comments and also allows me to avoid any mistakes while coding it
00:39:29.220 | But I recommend that you code it from scratch. So you take this video and you just type whatever I am pasting here
00:39:37.460 | This is the best way to learn because it's like when you study a mathematical proof
00:39:42.500 | You should not just watch the proof on the piece of paper
00:39:45.860 | Because even if it you think it makes sense to you
00:39:49.460 | It doesn't actually because when you write it by hand, so when you code each of these lines by hand
00:39:55.300 | Your mind will think why am I typing this? Why am I writing this? Why am I multiplying this number by this number? Why am I?
00:40:03.380 | Calling this function so you question yourself when typing
00:40:08.180 | That's why I recommend that you type this code while I am pasting it
00:40:12.420 | I do it by pasting otherwise this video will be 20 hours
00:40:17.140 | The first thing that we do is we create this vision
00:40:19.140 | Model, this vision model is made up of a transformer and it has a configuration
00:40:23.380 | So basically what we are doing is we take the pixel values of this our image, which will be loaded with NumPy
00:40:29.300 | So when you load an image with NumPy it gets converted into an array that is channeled by height by width
00:40:35.540 | But we can have a batch of images. That's why we have a batch size here. So the batch dimension
00:40:41.940 | And our vision transformer will convert this into a batch size NumPatches
00:40:47.140 | Which is how many NumImage tokens we have here and each
00:40:51.300 | Vector will be of a fixed dimension called embeddim here
00:40:56.340 | So basically our vision model will take an image as you can see a batch of images and it will give us a batch of
00:41:04.100 | List of embeddings one list of embeddings for each image where each embedding is a vector of size embeddim
00:41:11.480 | Okay. Now let's code the vision transformer, which is very simple also
00:41:16.760 | So let's do it also step by step actually
00:41:19.960 | so this vision transformer is basically a
00:41:23.400 | Torch layer
00:41:27.400 | Where we pass the configuration we save this embeddim, which is the hidden size
00:41:31.560 | We saw before which is the size of this embedding vector
00:41:34.360 | We first need to extract the embeddings from this
00:41:40.180 | We need to extract the patches from this image, which will be done with this layer. We will call SigLip vision embeddings
00:41:46.680 | Then we will run it through a list of layers of the transformer
00:41:51.060 | Which is this SigLip encoder because it reminds the encoder of the transformer
00:41:55.380 | Which is a series of layers of transformer and then we will have a layer normalization and we will see later how layer normalization works
00:42:02.100 | The forward method is very simple
00:42:07.060 | So the forward method is basically we take these
00:42:09.700 | Pixel values, which is the image which is a patch of images and we convert them into embeddings, which is
00:42:16.100 | Which basically means that we are extracting the patches from these images. So let's visualize it here
00:42:21.860 | So what we are doing with this
00:42:25.540 | Image embeddings we are taking these images. We will run a convolution here to extract patches
00:42:32.260 | Then we will flatten these patches and add the positional encodings
00:42:35.960 | And this stuff here will be done by this SigLip and vision embedding
00:42:40.520 | then we take these embeddings which are
00:42:44.420 | Patches plus the positional encoding and we run it through this encoder, which is a list of layers of the transformer
00:42:51.300 | So this stuff here is our encoder. What is the encoder?
00:42:54.340 | Well, the encoder is a list of layers of the transformer
00:42:57.860 | So you can think of it as being a list of these layers here. Actually these layers here
00:43:02.820 | one after another which includes a multi-head attention, a
00:43:07.300 | normalization, a feed-forward network and the normalization
00:43:10.440 | In the case of the visual transformer the normalization is done before the feed-forward and before the multi-head attention, but that's the only difference
00:43:17.940 | So this part here, so a series of layers is called the here
00:43:24.100 | We call it the encoder because it resembles the encoder side of the transformer
00:43:28.200 | And then we have a layer normalization. So now let's go to code this vision embeddings
00:43:34.500 | So we want to extract information about these patches
00:43:37.880 | Let's do it. Where are the vision embeddings? Here. Okay
00:43:46.900 | All right, so
00:43:53.860 | The vision embeddings is basically, okay
00:43:56.100 | Taking again the configuration because each of these models needs to have access to the configuration because they need to extract different
00:44:01.860 | Information from this configuration. So we have the embedding size, which is the size of the embedding vector, which is the hidden size
00:44:08.020 | The image size is how big is the image?
00:44:10.980 | And the patch size is how big is the patch that we want to get from this image. So basically we are talking about
00:44:20.900 | In this case the patch size I remember is a 16
00:44:23.940 | Which means that we are going to take this patch here is going to
00:44:29.140 | 16 by 16 pixels
00:44:32.000 | How do we extract these patches? We do that through a convolution that is a 2d convolution, which it takes as input
00:44:38.740 | The number of channels of the image so three channels are gb and it produces all channels equal to the embedding size
00:44:46.100 | So the hidden size
00:44:49.620 | The kernel size so as you remember the convolution works like this, so let's use the ipad actually to draw so
00:44:56.020 | The convolution works like this. So we have an image
00:44:58.900 | Which is made up of let's say pixels. So suppose this is the grid of pixels
00:45:05.400 | And we have a lot of them
00:45:09.780 | Basically the convolution works like this imagine the kernel size is three by three
00:45:16.020 | So we take a three by three group of pixels. We apply this convolution kernel
00:45:21.220 | So if you are not familiar with how convolutions work, I will not be reviewing that here
00:45:26.100 | But basically it means that we have a matrix here
00:45:28.260 | You multiply each number of this matrix by the value of the pixel on which it is applied to it will produce
00:45:35.780 | features
00:45:38.020 | one feature
00:45:39.700 | And then you slide this kernel to the next group of pixel then you slide it again
00:45:44.900 | Slide it again, etc, etc, and it will produce many features in the output features
00:45:49.700 | However at as input we have three channels which you can think of it as three
00:45:55.700 | Parallel images one that is only red one that is only green and one that is only blue
00:46:01.460 | We run this kernel on all of these channels and it will produce
00:46:05.220 | Features how many kernels do we have?
00:46:09.920 | Depending on how many output channels we want. So for each output channel, we have a one kernel that is
00:46:15.440 | We have three kernels actually that is used for one for each of this number channels
00:46:22.960 | The stride tells us how we should slide this
00:46:27.440 | Kernel from one group of pixel to the next and we are using a stride that is equal to the patch size of the
00:46:34.240 | Kernels, which is equal to the kernel size. So which means that we take the first oops
00:46:40.400 | We take the first group of let's say three by three kernels
00:46:43.440 | Then we skip three kernels to we slide it to the next group of three by three. So there is no overlap
00:46:49.600 | So we take this kernel here
00:46:51.680 | Then we slide it to this group of pixel here
00:46:54.400 | Then we slide it to this group of pixel here so that there is no overlap. So basically what we are taking is
00:46:59.280 | list of features each extracted by a independent patch of this image that we run the kernel on
00:47:07.840 | And the padding if valid means that there is no padding added
00:47:11.200 | So basically this patch embedding is extracting information from our image patch by patch
00:47:18.000 | Where there is no overlap between these patches. How many patches do we have?
00:47:21.920 | Well, it's the size of the image which is 224 in the base version of
00:47:27.360 | PaliGamma divided by the patch size
00:47:31.200 | So image size is the number of pixels divided by how big is each patch and then to the power of two because we have
00:47:38.000 | Along two dimensions this image. So we run the patch. The patch is
00:47:41.840 | It's a square. So it's a 16 by 16 or 3 by 3 or whatever the number patch size is
00:47:49.600 | How many positions we have? So how many?
00:47:52.880 | Positional encodings we need well
00:47:55.360 | It's equal to the number of patches that we have because we need to encode information about where this patch came from
00:48:01.280 | So how many positional encodings we need equal to the number of patches that we have
00:48:06.080 | And what is each of this positional encoding? It's a vector. It's a vector of the same size of the patch
00:48:11.920 | So it's equal to embeddings. You can see here
00:48:14.480 | And it's a learned embedding. So it's a positional encoding that is a learned
00:48:20.160 | Embedding how many we have we have noon positions of them each of them with this size here
00:48:26.320 | And we will see later that each of them is added to the information extracted from the convolution
00:48:32.160 | So that each convolution output encodes information about where it came from in the image
00:48:37.360 | we register these positional IDs in the
00:48:40.800 | In the module which is just a list of numbers and we will use it later
00:48:47.440 | So this is just a range of numbers so between zero and noon positions mine one
00:48:52.720 | Now let's implement the forward method
00:48:58.240 | This is the reason I like to
00:49:00.320 | Copy and paste the code because I can copy all the comments without typing them one by one. Otherwise, it will take me forever
00:49:06.000 | So what we do now is okay. We had our image which is a pixel values here
00:49:10.640 | The pixel values came from noon pi so we will see later how we load the image
00:49:15.760 | but basically you have to think that you load the image with noon pi and noon pi loads a
00:49:20.880 | Batch of images, which is a channel height and width. It's a tensor with three channels and with the height of the image and the width of the image
00:49:28.880 | We will see that this
00:49:31.840 | Height and width is equal to the same because we resize each image to the input size of the image expected by the model
00:49:38.320 | So we will resize in the case. We are using the smallest polygama. We will resize each image to
00:49:42.960 | 224 by 224
00:49:47.040 | We extract this patch embeddings to this convolution so you can see here
00:49:51.520 | So this will basically take our image which is a batch of images and convert it
00:49:57.200 | Into a list of embeddings of this size
00:50:00.400 | So each image will be a list of embeddings of size embed dimensions
00:50:06.420 | How many patches we have well the number of patches
00:50:10.400 | For the height and the number of patches for the weight
00:50:14.720 | In this case, it will always be the same so you can think of it as a number of patches a total number of patches
00:50:20.720 | Each of patches with the dimension embedding dimension
00:50:26.900 | And as we saw before we flatten these ones, so we extract them here. Let me delete it
00:50:34.480 | So we extract these patches
00:50:38.960 | So we run the convolution and then we flatten them here
00:50:43.440 | So basically the convolution will give us 1 2 3 4 5 6 up to 16 or whatever the number of patches is
00:50:49.920 | and then we convert it into a tensor where the
00:50:52.800 | The patches are flattened
00:50:55.120 | So the first patch is here and the last patch is the last element of this tensor and this is what we do here
00:51:00.880 | Here because the output of the convolution is a 2x2 grid, but we don't want a 2x2 grid
00:51:07.520 | We only want a one-dimensional long list of patches and this is done by this flatten method here
00:51:13.520 | Then we transpose because we want the number of patches to come before the embedding dimension
00:51:19.300 | Because as input to the transfer we need to give a sequence of embeddings
00:51:24.480 | So that's why we want this num_patches dimension to come before so that it becomes a batch
00:51:29.600 | of sequence of embeddings and each embedding is a
00:51:33.360 | vector of size embedding dimension
00:51:37.360 | Each of these embeddings we add the positional encodings which positional encodings? Well the position
00:51:42.400 | Extracted from this embedding layer
00:51:46.140 | But which embedding do we want to extract? All the embeddings. So from 0 to
00:51:50.160 | Suppose we have 16 patches from 0 to 15
00:51:53.440 | What is the where is this information 0 to 15 is in this self dot position and this which is a range
00:52:00.080 | So as you remember a range is just a generates a list of numbers between 0 and the argument minus 1
00:52:06.960 | So we add we extract this the all the positional encodings from this position embedding
00:52:12.240 | Layer, which is this embedding layer here. We add it to the embeddings
00:52:16.880 | So what we are doing basically is we flatten this embedding
00:52:20.320 | We did that before then we add a positional encoding vector extracted from the positional encoding layer
00:52:25.600 | And these positional encodings are learned. So learned why because this embedding layer here is a list of
00:52:32.320 | embeddings
00:52:34.800 | That when the model is trained these embeddings will change according to the need of the model and basically we encode them
00:52:42.640 | So it's not like we are telling the model. This is position number one. This is position number two
00:52:48.000 | We add another embedding that is added to this
00:52:51.280 | patch
00:52:52.960 | each of these patches
00:52:54.480 | And then the model will learn to modify this positional embedding vector in such a way that they should encode the position
00:53:01.820 | Information because each of this position embedding is always added to the same patch
00:53:07.020 | So the first patch always receives the position number zero the second patch always the position number one
00:53:11.580 | We hope that the model actually tries to change this position embedding in such a way that they encode the positional information
00:53:17.580 | and actually it does because the model actually learns then the
00:53:20.700 | to relate
00:53:23.580 | Patch with each other by using their positional information
00:53:27.660 | And the only way for the model to do that is to change this position embedding in such a way that they encode the position information
00:53:33.840 | If you remember from the vanilla transformer, we use the sinusoidal functions
00:53:38.300 | So if you want to look at the original transformer if you remember
00:53:43.740 | We have this position information
00:53:45.740 | Where is it here? So we create this position encoding using sinusoidal functions
00:53:52.780 | So instead of learning them we actually pre-compute them and then we force the model to learn the pattern
00:53:58.780 | Encoded by these sinusoidal functions in this case. We are not forcing the model to learn any pattern
00:54:04.060 | We want the model to create the pattern that is most useful for the model itself
00:54:08.220 | so we hope that the model will try to create this embedding layer in such a way that it creates some
00:54:15.260 | embeddings that are helpful for the model to
00:54:17.800 | to understand the position information
00:54:20.780 | and this is the meaning of
00:54:22.780 | position embedding
00:54:24.540 | Now we skipped before the normalization layer. So let's go actually to
00:54:29.020 | Understand what is normalization and how it works so that we always don't leave anything behind that is not explained
00:54:36.620 | All right. Let's talk about normalization. So imagine we have a list of linear layers
00:54:42.460 | Now a linear layer is defined by two parameters
00:54:46.700 | One is called the input features and one is called the output features
00:54:50.220 | Imagine we have input feature is equal to four and output feature is equal to four
00:54:54.300 | Actually, there is another parameter called bias
00:54:56.860 | So it indicates if the linear layer also has a bias term and suppose that it's true
00:55:02.540 | To the input of the linear layer usually we have a batch of items and each item is made up of features
00:55:11.260 | Suppose that for now as input there is only one item and it's made up of four features
00:55:15.820 | And as you can see the input features are four
00:55:18.380 | What will happen with four output features is this the linear layer you can think of it
00:55:24.220 | As a number of neurons where the number of neurons equal to the number of output feature of this linear layer
00:55:31.180 | what each neuron does is basically it has a
00:55:34.780 | weight vector
00:55:37.900 | As you can see here made up of four weights
00:55:41.100 | How many weights does it have? Well equal to the number of input features that this layer accepts
00:55:47.900 | So which is a four
00:55:49.980 | What each neuron will do it will do the dot product of the incoming vector
00:55:55.100 | So the input vector x multiply dot product with the weight vector of this neuron plus the bias term
00:56:02.940 | Which is one number for each neuron
00:56:05.740 | And this basically dot product plus this bias will produce one output feature
00:56:10.540 | Because we have four neurons. We will have four output features
00:56:14.380 | So each neuron will do the same job, but each neuron will have its own weight vector and its own bias number
00:56:20.540 | So this one here will have its own weight vector different from the other ones and its own bias term here
00:56:25.900 | Then suppose that we have another
00:56:28.860 | Vector that takes as input four features and produces two output features
00:56:34.140 | So you can think of it as a linear layer with the two neurons
00:56:38.140 | where the first neuron has a weight vector made up of four numbers because
00:56:43.740 | The incoming vector has four features and then one bias term here
00:56:47.740 | It will produce an output vector of two items
00:56:51.420 | The first item will be this number here and the second item
00:56:54.860 | The second dimension will be the dot product of the weight vector of this second neuron with the input vector
00:57:01.260 | plus the bias term of the second neuron
00:57:04.460 | Now, what is the problem with
00:57:06.460 | With the linear layers, but actually with all layers in general
00:57:12.140 | The problem is this it's called the covariate shift. The problem is that
00:57:16.220 | When you have an input vector
00:57:18.860 | That changes from one batch to another in magnitude
00:57:24.240 | Then the output of the layer will also change in magnitude a lot depending on what is the incoming vector
00:57:32.860 | So for example, imagine this the first input vector is all the numbers are more or less around one and two
00:57:40.460 | And the output is also more or less around
00:57:43.580 | suppose around two
00:57:45.980 | Then if the next vector that is coming to this layer is
00:57:49.660 | Much different in magnitude from the first one then the output will also be much different in magnitude
00:57:55.360 | And this is a problem for the model
00:57:58.220 | So the problem is that if the input of a layer changes, then the output of this layer will also change a lot
00:58:04.140 | So if the input changes drastically the output will also change a lot drastically
00:58:08.160 | then because the loss of the
00:58:10.940 | Of a model during training depends on the output then the loss will also change a lot because the loss
00:58:17.820 | Then determines the gradient during backpropagation
00:58:21.200 | It means that if the loss changes a lot then also the gradient will change a lot and if the gradient changes a lot
00:58:27.020 | Then because the gradient determines how we update the weights of the model during training then also the update of these weights will also change a lot
00:58:36.300 | basically what happens is that the if the input the distribution of the
00:58:41.340 | Dimensions of this vector that is coming to the input of a layer
00:58:45.660 | Changes drastically from one batch to the next
00:58:49.260 | Then the output of the model will also change and then the loss will change then the gradient will change then the update of the weights
00:58:55.500 | Will change so what we will see that the loss will oscillate a lot
00:58:59.020 | And also the weights will try to keep up with this changing input distribution
00:59:03.840 | Which basically will result in a model that trains slowly. So here I have made a simple
00:59:09.900 | How to say
00:59:13.580 | Summary of what is happening
00:59:14.700 | So a big change in the input of a layer will result in a big change in the output of a layer which will result
00:59:20.540 | In a big change in the loss of the model which will change result in a big change in the gradient
00:59:25.840 | Of the during black propagation which will result in a big change in the weights of the network
00:59:31.580 | And what is the result of this is that the network will learn very slowly because the network will spend most of its
00:59:37.020 | Time but okay most of the effort trying to keep up with this distribution change in the input
00:59:43.580 | Instead of actually learning the features
00:59:46.140 | How to map the input to the output
00:59:50.300 | So the the first solution to this problem was batch normalization, which was introduced in this paper
00:59:55.660 | And with batch normalization what we do basically is that we have usually not a single item as input
01:00:01.740 | We have a batch of items suppose that we are training a classification image classification model
01:00:07.260 | So we have as input a list of images
01:00:10.460 | For example the image of a cat the image of a dog of a zebra of a tree of a stone etc, etc
01:00:16.220 | So you can think these are the dimensions of the vector that represent the cat
01:00:20.220 | These are the dimensions of the vector that represent the dog. These are the dimensions of the vector that represent the zebra etc, etc
01:00:25.820 | So what we do with batch normalization is that we calculate a statistic
01:00:30.240 | For each dimension of each item
01:00:35.100 | Which statistic do we calculate the mean and the the variance and then we
01:00:42.680 | Normalize each item by subtracting the mean and divide it by the standard deviation
01:00:48.620 | this will basically make each
01:00:51.020 | Dimension of each item be distributed
01:00:54.380 | According to a Gaussian with mean zero and the variance of one
01:00:58.780 | so basically what will happen is that
01:01:01.580 | each if we normalize each number if
01:01:05.420 | Because the image of a cat is much different from the image of the zebra
01:01:10.380 | Because the color distribution is different. The rgb distribution is different. So the pixel intensity is much different from each other
01:01:16.780 | What will happen is that the model will not see this change in magnitude
01:01:21.580 | but it will see
01:01:23.100 | And also will not see a change in distribution because all of these items will be distributed according to a mean of zero and the variance
01:01:30.140 | of one
01:01:31.420 | So what will happen is that the model will oscillate less in the output. So it will oscillate less in the loss
01:01:36.860 | So it will oscillate less
01:01:39.260 | In the gradient, so it will make the
01:01:41.500 | Weights of the model oscillate less
01:01:44.300 | So the model the training will be more stable. It will be it will converge faster basically this way. So
01:01:50.940 | To summarize
01:01:54.860 | Why do we need normalization is because the input of the model which depends on imagine you are training
01:02:00.860 | Classification or the image classification model then the input depends on the image and the image can be much different from each other
01:02:07.580 | If the image changes a lot, we don't want the model to feel this change in magnitude of the input
01:02:13.500 | We want the distribution of the inputs to be remain constant. Let's say
01:02:17.340 | So that the model doesn't oscillate so that this doesn't force the model to kind of just to keep up with the distribution
01:02:24.560 | This change in distribution. How do we do that? We we try to keep the distributions
01:02:29.520 | Constant so always try to have the input features to be distributed according to a fixed distribution
01:02:35.100 | Which is mean of 0 and 1 and we do that with this formula here, which comes from probability statistics basically each
01:02:42.060 | Distribution if you subtract its mean divided by the standard deviation, it will result in a Gaussian distribution of mean 0 and variance of 1
01:02:49.980 | Of course, this is valid also only for Gaussian distributions
01:02:58.220 | And this will basically result in a more stable training
01:03:02.060 | Now the best distribution actually worked fine. However, it has a problem with the problem is that
01:03:07.580 | Which best normalization each of these statistics so the mu and the sigma are calculated
01:03:13.840 | Along the batch dimension. So we calculate the mu and the sigma for the dimension number one of each of these vectors
01:03:21.820 | Along the batch dimension. So basically to calculate this mean we are summing up the first dimension of each of these vectors
01:03:29.420 | And divided by the number of items that we have
01:03:31.740 | So we are mixing the features of different items
01:03:35.820 | So we are mixing the dimension number one of the cat with the dimension number one of the dog
01:03:42.940 | so basically to to have good results, we need to use a big batch because
01:03:47.660 | If we use for example a cat and the dog it will result in one mean
01:03:52.780 | But imagine in the next batch, we have the cat and the zebra it will result in a completely different mean
01:03:58.620 | And then the next supposing the next batch we have a cat and the tree maybe it results in another different mean
01:04:04.700 | So also we will still have this problem of covariance shift because the mean is changing a lot between each iteration
01:04:11.120 | So the only solution to this actually is to use a very big batch size
01:04:15.340 | So we are forced to use a big batch size in order to alleviate this problem
01:04:19.660 | Of kind of mixing the dimensions along the batch dimension
01:04:25.980 | We introduce the layer normalization with layer normalization
01:04:28.860 | What we do is instead of calculating the statistics along the batch dimension
01:04:33.900 | We calculate them along the item dimension
01:04:36.220 | So the mu and the sigma that will be used to standardize the cat will only be
01:04:41.900 | Dependent on the dimensions of the cat not on the whatever the cat comes with
01:04:48.300 | So we are still doing each item minus its mean divided by the standard deviation
01:04:55.580 | But instead of this standard deviation and this mean coming from the first dimension of each item
01:05:00.620 | It comes from the average of this
01:05:03.180 | All the dimensions of the each item independently from the others
01:05:07.420 | So it doesn't matter which other item the cat comes with it will always result in more or less the same mu and
01:05:14.140 | Same sigma
01:05:17.660 | And this makes the training even more stable because we are not forced to use a big batch size
01:05:24.620 | And this is why we use normalization
01:05:27.120 | Okay, we have seen what is normalization now we should implement what is this thing called the encoder so this is Sigleap encoder
01:05:36.700 | Now the encoder is made up of multiple layers of the transformer model
01:05:41.980 | And the architecture more or less if you look at the vision transformer paper, it is like this
01:05:47.580 | So I changed it a little bit because I wanted to use the exact names that we will be using
01:05:53.660 | So we have first of all what we have so far is this thing called the Sigleap vision embeddings
01:05:58.460 | Which is basically taking the image it is
01:06:00.540 | Taking some patches of this image using a convolution each of this
01:06:05.740 | Output of this convolution is an embedding is used as an embedding. It's a vector
01:06:10.380 | And this embedding vector is added to another
01:06:14.300 | Vector called the positional encoding which is learned and then we feed this stuff to this thing called the encoder
01:06:21.260 | So we convert it into embeddings at the positional encoding then we feed it to the encoder
01:06:25.340 | And at the input of the encoder you need to think that we have
01:06:28.620 | These layers repeated n times here. It's written l times
01:06:33.340 | One after another such that the output of one becomes the input of the next layer
01:06:38.780 | the thing that you need to understand about the transformer is
01:06:42.460 | I repeat it is that the transformer is a sequence-to-sequence model that converts a sequence of embeddings into contextualized embeddings
01:06:51.280 | What does it mean? It means that at the input you have a list of
01:06:54.560 | Here embeddings each representing a patch of the image as an independent patch
01:07:01.520 | So this embedding here only captures information about the first group of pixels
01:07:06.000 | This embedding here captures all information about the second group of pixels, etc, etc, etc
01:07:10.560 | But then some through some magic called
01:07:13.760 | Attention mechanism this contextualized these embeddings become contextualized at the output of the transformer and we will see in detail this
01:07:21.520 | attention mechanism
01:07:23.600 | Such that this embedding here at the output of the transformer the first embedding is
01:07:28.240 | represents information about the first patch plus other it includes information not only about the first part but also about other patches
01:07:36.080 | And so is the second the third the fourth and the last one
01:07:40.320 | So they become contextualized in the sense that they capture information about the context in which they appear
01:07:46.400 | Which is different from language models in which each token captures information about the previous tokens in the case of the vision transformer
01:07:54.560 | Each patch includes information about all the other patches
01:07:57.600 | Now each of these layers
01:08:01.440 | is made up of so we have the this is the input of the encoder let's say
01:08:07.360 | And we will have the first layer of this encoder
01:08:10.480 | The first thing that we do is we apply a layer normalization and we saw how it works and why we use it
01:08:15.840 | The output of this layer normalization is a cop
01:08:18.800 | First the input of this linear normalization is saved for a skip connection that we do later
01:08:23.680 | Then the output of this layer normalization is sent to the self-attention mechanism
01:08:28.260 | It's this one here and this self-attention mechanism takes the output of the layer normalization as a query key and values
01:08:37.520 | It calculates the attention just like the usual formula
01:08:40.000 | So softmax of the query multiplied by the transpose of the key divided by the square root of the model multiplied by v etc etc
01:08:46.000 | The output of this self-attention is then summed up with this skip connection here
01:08:51.920 | Then the output of this summation is sent to this layer normalization along with the skip connection that is used later
01:08:58.480 | Then the output of the normalization is sent to this multi-layer perceptron, which is a list of linear layers
01:09:03.840 | We will see later and then we do another summation here with the skip connection plus the output of the multi-layer perceptron
01:09:10.180 | And then we do another layer like this and another another another and the output of the last layer is the output of our vision
01:09:18.320 | transformer. So as you can see
01:09:20.380 | the vision transformer takes as an input an image converted into patches. Patches are then fed to this
01:09:28.160 | Encoder which is a list of layers and the output is a contextualized
01:09:31.140 | patches or embeddings of these patches
01:09:33.860 | So let's code this encoder, which is basically this structure here
01:09:39.120 | And we will code each part of this structure and while coding each part we will go inside on how it works
01:09:46.880 | So the normalization we already know how it works, but we still have to explore what is this stuff here called the self-attention
01:09:52.580 | What is this stuff here called multi-layer perceptron?
01:09:56.240 | I believe it's convenient for us to go first through multi-layer perceptron and then we go to the self-attention
01:10:02.080 | I think because the self-attention is a little longer to do. So let me do the simple part first
01:10:06.480 | Okay, let's code this encoder
01:10:09.920 | Now I will copy the first part
01:10:13.440 | This one here, so let's copy it here
01:10:17.520 | So the encoder is made up of again, the constructor is made up of the configuration
01:10:22.240 | We save some stuff which is the hidden size and then we have a block called the self-attention block in this call this
01:10:28.000 | Here it's called the siglib attention. Now
01:10:31.200 | Note about the naming I'm using. So I am using the same names as
01:10:35.600 | the HuggingFace implementation
01:10:38.560 | For only simple reason which is I want to be able to load the pre-trained weights from HuggingFace
01:10:44.240 | So the pre-trained weights for the polygam are available on the HuggingFace hub
01:10:49.600 | So we want to be able to load them
01:10:51.680 | But each of these pre-load pre-trained models they have this dictionary of weights
01:10:57.040 | So where the dictionary tells you where to load each of these weights
01:11:01.520 | And if the names do not match you need to create some conversion script
01:11:04.720 | So I didn't want to do that and also it would just complicate the code uselessly
01:11:08.980 | So I just use the same names so that we can
01:11:12.240 | Load basically the pre-trained weights from HuggingFace
01:11:17.440 | Also because my code is based on the HuggingFace implementation
01:11:20.480 | So to create my code I use the HuggingFace implementation, but simplified a lot a lot a lot
01:11:25.680 | For example, I remade my own KVCache. I did a lot of
01:11:29.040 | Modifications to simplify it but it's based on the HuggingFace implementation
01:11:34.100 | anyway
01:11:36.080 | So we have this thing called the self-attention then we have a layer normalization. So we saw it's
01:11:40.400 | Where is it? And we have this layer normalization here
01:11:43.360 | Then we have this multi-layer perceptron, which is this stuff here. And then we have another layer normalization, which is this stuff here
01:11:49.920 | So we have two layer normalization. So now let's implement the forward method
01:11:54.480 | And the forward method I will copy it line by line so we can understand
01:11:58.960 | Okay this forward method. Now. The first thing we do is we save a residual connection, which is
01:12:05.680 | We basically save the input that we feed to this
01:12:09.260 | Encoder because we need to reuse it later. So we are saving this skip connection because we will need to use it here later
01:12:14.860 | Then we run it through the layer normalization the input
01:12:19.500 | And it's done here. So the layer normalization does not change the shape of the input
01:12:25.020 | It's just normalizing each of these dimensions such that they they all come up
01:12:30.700 | It's like they came out from a Gaussian of mean zero and variance of one
01:12:36.860 | Then we apply this magic thing that we will explore later called the self-attention and the self-attention system
01:12:42.380 | Also does not change the shape of the input
01:12:44.700 | Tensor, but as we saw before the attention mechanism is something that takes as input
01:12:50.140 | Embeddings and gives you contextualized embeddings. So it does not change the shape of these embeddings
01:12:55.600 | But we will implement it later. So for now just think of it as a black box that you feed in
01:13:00.700 | Embeddings and it gives you contextualized embeddings
01:13:03.980 | Then we have a residual connection and we can see that here. So this residual connection
01:13:09.500 | Skip connection was called
01:13:12.220 | Which is this first plus here
01:13:14.060 | So we are taking what we saved before with the output of the self-attention
01:13:18.300 | So what we saved before is this residual stuff here plus the output of the self-attention, which is this hidden states here
01:13:23.740 | This the result of the summation is saved again because there is another skip connection
01:13:29.100 | after
01:13:31.580 | I don't know why my alt tab is not working. So
01:13:33.580 | We save again another
01:13:36.380 | This stuff here. So we save it because later we need to use it here for the skip connection
01:13:40.860 | Then we do I guess another linear layer normalization which also does not change the shape of the input
01:13:49.340 | tensor
01:13:52.060 | And then we have this thing called the multilayer perceptron. Now the multilayer perceptron is something that
01:13:57.820 | It's not easy to explain what is used for but basically
01:14:01.100 | The multilayer perceptron we will see later is a series of
01:14:05.100 | Linear layers that takes each
01:14:09.740 | input embedding and
01:14:13.500 | Transforms it independently from each other from the others
01:14:17.820 | So while in the self-attention there is kind of a mixing of the patches incoming so that you get contextualized
01:14:24.380 | In the multilayer perceptron, there is no mixing between these let's call them tokens or patches
01:14:29.180 | Each of them is transformed independently
01:14:32.560 | And the multilayer perceptron allow us to increase basically first of all it adds parameters to the model. So the model has more
01:14:40.060 | Degrees of freedom to learn whatever it's trying to learn
01:14:43.980 | and the second
01:14:46.380 | Objective of the multilayer perceptron is that it allow to prepare
01:14:50.220 | Let's say prepare the the sequence of patches for the next layer. So if the next layer expect these patches to be somehow
01:14:57.980 | Different the multilayer perceptron allow to transform them
01:15:02.300 | Also, it adds a non-linearity. So the multilayer perceptron also includes a non-linearity which adds
01:15:08.060 | Which basically allow as you know non-linearities allow you to model more complex transformations
01:15:15.900 | So if you just create a list of linear layers without any non-linearities that you cannot model complex functions so that for example
01:15:22.220 | in the classification you cannot
01:15:24.300 | Map non-linearly separable data, but with by adding
01:15:29.900 | Non-linear transformations you add complexity to the model. So the model is able to map complex transformations
01:15:38.400 | So the multilayer perceptron just adds parameters and this non-linearity which is helpful to
01:15:45.420 | To to allow the model to learn whatever complexity it needs
01:15:49.180 | To to map the input to the output
01:15:52.620 | After the multilayer perceptron, I guess we have a
01:15:57.740 | Yeah, we have another skip connection and then we return the output of this skip connection here
01:16:04.140 | and also the skip connection does not change the shape of the
01:16:07.260 | Of the tensors of the embeddings
01:16:10.880 | Now, let's code first this multilayer perceptron. It's the easiest stuff to do
01:16:15.100 | So let's do it
01:16:18.300 | Let's go here. I I will also always copy first the
01:16:21.980 | Constructor and then the forward method so we can explore a little bit the structure and then we explore the logic
01:16:27.660 | So this multilayer perceptron just like in the vanilla transformer is made up of two layers
01:16:33.660 | plus a non-linear transformation
01:16:36.780 | So the first layer takes each of the embeddings which are we we can also call them tokens or patches
01:16:43.820 | Because most of the time we are dealing with language models and expands them
01:16:47.980 | So each of these vectors which is of size hidden size is expanded into this thing called intermediate size
01:16:55.180 | Usually it's chosen as three times the hidden size or four times the hidden size
01:17:00.380 | I remember in the vanilla transformer it was four times the hidden size
01:17:03.260 | Then we apply a non-linearity to this expanded tensor and then we compress it back to the hidden size dimension
01:17:12.780 | So let's do the forward method now
01:17:14.780 | Which is this one here
01:17:17.420 | So the first thing we do is we convert each of these embedded dimensions into intermediate sizes
01:17:23.340 | So again, we have a batch of images
01:17:26.060 | Each image is made up of num_patches number of patches each of this patch is represented by a vector of size embedding dimension
01:17:33.420 | With the first fully connected layer, we are expanding each of these patches into the intermediate size and then we apply
01:17:42.460 | A non-linear transformation in this case. It's the gelu function now
01:17:46.380 | You may be wondering why are we using the gelu function or the zwiglu function or whatever non-linearity there is
01:17:52.620 | The reason is always practical. So
01:17:55.660 | Basically
01:17:58.540 | There is a there is no like a rule of thumb for choosing the non-linearities to use for a specific case
01:18:05.020 | There are just some heuristics
01:18:07.820 | And the heuristics is that initially the transformer when it was introduced it was with the gelu function as non-linearities
01:18:13.840 | between these two fully connected layers
01:18:16.540 | But then people explored other non-linearities and they saw that they work better
01:18:21.500 | Now non-linearity is actually there is also some logic behind the choice of a non-linearity
01:18:25.980 | So because the non-linearity define also the flow of the gradient
01:18:29.820 | So for example, if you use the gelu function, if you look at the graph of the gelu function, let me draw it actually
01:18:36.940 | The graph of the gelu function is something like this. So
01:18:39.580 | Why I cannot draw it, okay
01:18:43.020 | So basically anything that is negative is zero. Let me use another color
01:18:49.100 | Anything that is negative is becomes zero basically and everything else is forwarded without any scaling
01:18:56.880 | So this means that if the input of the gelu function is negative the output will be zero and actually for any
01:19:06.220 | Negative input there will be no gradient because the gradient will be multiplied by zero. So it will not flow
01:19:10.860 | That's why for example, we introduced the leaky relu and other like
01:19:14.940 | In the relu family, there are other
01:19:18.060 | Functions that allow also a little bit of gradient flow from the negative side
01:19:23.660 | So the non-linearity basically tells you
01:19:27.020 | How the gradient will flow during back propagation. So having a non-linearity
01:19:35.980 | that allows
01:19:37.980 | That allows the gradient to flow back even when it's negative
01:19:40.940 | It means that the signal the model is not forced to always have the activation to be positive to have some
01:19:46.860 | Feedback from the loss function to optimize its weights
01:19:49.900 | And why we are using the gelu because people have tried it and probably it works better
01:19:56.780 | compared to the relu function for the same class of
01:20:00.140 | applications so in the vision transformer you see the gelu function, but
01:20:05.020 | In the lama, for example, they use the zwiglu function in other scenarios
01:20:08.300 | They use other functions and it's mostly based on heuristics on how they work in practice
01:20:13.980 | also, because a model is usually made up of billions and billions and billions of
01:20:19.180 | of parameters and it's not easy to find the regular regularity to understand why
01:20:24.860 | Specific non-linearity is working better than the other one
01:20:30.380 | Now, okay, then we apply the second linear layer
01:20:33.980 | Which is basically recompressing back this intermediate state into the embedding size and then we return it
01:20:39.980 | and this is our
01:20:42.860 | multilayer perceptron
01:20:44.860 | our next part is going to be
01:20:47.340 | we are going to code this attention mechanism for the vision transformer and we will see that it's
01:20:53.340 | Different than from those of language models because we don't have any causal mask or attention mask
01:20:59.980 | All right guys, so we have seen the multilayer perceptron now
01:21:04.460 | Let's go to the multi-head attention and for that
01:21:07.180 | I want to use the slides because I believe it's a little faster to explain on the slides and then we proceed with the code
01:21:13.420 | So what is the multi-head attention? The multi-head attention is a way of contextualizing stuff
01:21:19.420 | Which means that you start with a sequence of for example patches and you can think we have for example
01:21:26.140 | Four patches each of this patch is represented by a single vector of 1024 dimensions
01:21:32.620 | So you need to think of this as a vector of 1024 dimensions. So you need to think there are
01:21:37.340 | 1024 numbers in this row vector
01:21:40.700 | Then we have the patch number two the patch number three and the patch number four
01:21:44.700 | Each of this patch was extracted from a group of pixels from the initial image and it's only representing information about the patch
01:21:51.980 | It was extracted from so the part of the image it came from
01:21:56.300 | With the multi-head attention system. We uh, what we mechanism what we are doing is we are contextualizing these patches
01:22:03.820 | Which means that the output of the multi-head attention is a tensor of the same size
01:22:08.300 | As the input so this is a tensor of size 4 by 1024
01:22:12.480 | the output will be a tensor of size 4 by 1024, but where each of these
01:22:19.260 | Embeddings now does not capture information only about itself, but also about the other patches
01:22:25.820 | in the in the sequence
01:22:27.820 | This is for vision transformer for the language models we want something slightly different
01:22:34.220 | So for language models, we do have an input sequence, which is a sequence of tokens each token representing one single
01:22:41.020 | I don't want to use the term word because it's wrong but
01:22:44.780 | In my videos, I always make the simplification that each token is a word and each word is a token
01:22:49.740 | But this is not the case actually in tokenizer. So usually a token can be just any sequence of characters
01:22:56.320 | Does not does not necessarily be um, it does not need to be necessarily a word
01:23:01.660 | But for us let's treat them as word. It's just simplifies the explanation
01:23:07.340 | We have a list of tokens. Each token is represented as an embedding. Let's say of 1024 dimensions
01:23:14.140 | So it's a vector of 1024 dimensions. So
01:23:17.400 | 1024 numbers for this one 1024 numbers for this one, etc, etc
01:23:21.720 | The multi-head attention in the case of language models
01:23:25.480 | What we want is we want to contextualize each token with the all the tokens that come before it
01:23:31.640 | So the output of the multi-head attention in the case of language models
01:23:35.560 | And this is this would be known as the self-attention mechanism with causal mask
01:23:43.160 | Is a sequence with the same shape as the input sequence
01:23:47.320 | So this vector this matrix here is a 4 by 1024. So the output will be 4 by 1024
01:23:53.180 | And each of these tokens is not capturing information only about itself
01:24:00.120 | But also about all the past tokens now the word I does not have any past token
01:24:04.920 | So it will only capture information about itself
01:24:07.720 | But the word love will capture information also about the token I because it comes before it and the word
01:24:13.160 | Pepperoni will capture information about I and love because they come before it etc, etc until the last token which capture information about all the sentence
01:24:21.080 | Why do we want to do this in language models?
01:24:25.160 | Let me give you a little understanding of why we do it in this way with language models and why the transformer is
01:24:32.280 | revolutionary for language models
01:24:35.480 | This is going a little off topic with respect to the vision transformer
01:24:38.600 | But I think if you understand this then you will understand the big part of the transformer and why it even exists
01:24:43.640 | So let's copy this stuff here
01:24:46.120 | Let's open a new page
01:24:48.600 | Now what we do with the language models is you need to think that a language model is
01:24:53.640 | Something that we need to we retrain on what is known as the next token prediction task
01:24:59.480 | Which means that given a prompt the language model try to understand what is the next token that completes this prompt
01:25:05.560 | How do we generate text with the language model? We start with some tokens, which are the prompt we generate the next token
01:25:11.480 | We put it back into the prompt and we ask again the language model
01:25:14.120 | What is the next token the language model gives us the next token?
01:25:16.680 | Then we put it back into the prompt and then we ask again. What is the next token etc, etc
01:25:20.280 | So we need to train a language model to train a language model
01:25:24.600 | We need to train a model to predict the next token given the past tokens
01:25:29.320 | And the transformer allow us to do that in parallel when training
01:25:35.000 | Which means that we start with an input that is a series of embeddings
01:25:39.340 | Which are uncontextualized so we start with this one and each of these actually is one single token. So this is only I this is only love
01:25:47.960 | This is a pepperoni
01:25:50.760 | And this is a pizza
01:25:54.600 | The output of the transformer of the self-attention mechanism will be a series of
01:26:01.400 | embeddings that are
01:26:04.420 | Uncontextualized in such a way that each token captures information of only about itself, but also about all the past tokens
01:26:11.240 | How do we train and the transformer can do it in parallel?
01:26:14.840 | So the self-attention mechanism will take this as input and generate this output in parallel
01:26:19.800 | So it's not will generate one token at a time, but it will generate all of them in the in parallel using this multi-head attention
01:26:27.240 | How do we train a language model basically?
01:26:31.340 | As we saw before the language model is something that given a prompt needs to predict the output. So what we want is that
01:26:39.020 | We can we take the input which is a
01:26:43.020 | This sentence here. We feed it to the transformer the transformer will transform it into a sequence of embeddings
01:26:49.340 | Contextualized embedding and then we need some labels to train this language model
01:26:54.060 | So the labels what will be well, we will we want whenever the language models
01:27:00.300 | Is given the word I to predict the word love
01:27:03.420 | So big, oh, I think i'm using not the pen
01:27:09.420 | the word love
01:27:11.180 | whenever the
01:27:12.620 | Language model sees the word I love it should predict the word pepperoni
01:27:16.640 | Whenever it sees the word the sequence I love pepperoni it should predict pizza
01:27:26.620 | Whenever it sees the sequence I love pepperoni pizza
01:27:29.820 | It should predict the token end of sentence, which is a special token telling hey, I'm done with the generation
01:27:36.000 | Because the transformer can generate all of these contextualized embeddings in parallel
01:27:41.820 | we can also calculate the loss for each of these predictions in parallel and
01:27:46.300 | Calculate the with backpropagation updates the weights of the model to tell in parallel
01:27:53.020 | How the model should predict each of this token given the
01:27:56.780 | The previous tokens. So when we are given a sentence and we train language model the language model can
01:28:02.540 | Can be trained
01:28:05.820 | With only one forward pass on how to predict the next token inside of this sentence given the previous tokens as context
01:28:13.180 | In only one single pass of the transformer. That's why the transformer is so powerful because this contextualization happens in parallel
01:28:19.900 | So we can calculate the output in parallel for each position
01:28:22.540 | And because we know already know what is the label because the label is just the next token given the previous tokens
01:28:28.220 | we can calculate the loss in parallel for each positions and the model will learn in parallel how to
01:28:33.180 | Generate exactly this sentence in in one pass only
01:28:37.820 | so the model will not learn to generate one token at a time given the previous but
01:28:43.100 | All the sentence in one pass and that's why it's so powerful
01:28:47.740 | Now let's go back to our vision transformer
01:28:49.740 | Okay, so we have seen what is the difference between the vision transformer and the language model
01:28:54.220 | So in the vision transformer, we want to contextualize tokens or patches
01:28:57.980 | In such a way that they capture information about all the other patches
01:29:02.220 | But in the language model, we want each token to only capture information about itself and the previous tokens
01:29:06.940 | How does this self-attention mechanism work?
01:29:10.300 | We start with of course an input sequence. Our goal is to create an output sequence that is contextualized
01:29:16.380 | And there are many intermediate steps. So now we will see what are these intermediate steps one at a time
01:29:23.340 | Let's start by creating the class of this this attention mechanism and we will create it. Let's create it here
01:29:29.900 | Okay, so in the input we have the configuration of the model we save some stuff that we will need later
01:29:37.580 | So the hidden size the number of attention heads because we are dealing with multi-head attention
01:29:43.660 | Head dimension we will see later what is it and why it's used
01:29:47.020 | The scale is basically the if you remember the formula for the attention is
01:29:51.260 | The queries multiplied by the transposed of the keys divided by the square root of the model
01:29:57.340 | And this is one over the square root of the model
01:29:59.740 | So the stuff that we need to divide the query multiplied by the keys with
01:30:05.100 | Then we have this dropout which is zero. I never saw it used in
01:30:10.780 | In polygamma, but I believe there are other cglib models that use it. So they they put it here
01:30:15.580 | But it you can think of it like non-existent for now
01:30:19.180 | and then we have these three linear layers called w, k, w, q and w, v which are
01:30:25.580 | Parameter matrices that are also present in the vanilla transformer
01:30:29.180 | We will see later what they are used for
01:30:31.260 | And then we have this output projection which in the paper of the transformer is called the wo matrix and we will see later
01:30:36.940 | What is it is used for?
01:30:39.580 | Let's start by implementing the forward. So the forward method is this one
01:30:43.900 | What is the input of the forward method?
01:30:46.060 | Well, the input of the forward method of this attention mechanism is basically what
01:30:50.540 | Is the output of the layer normalization in this encoder layer class
01:30:55.580 | So the output of the layer normalization is fed to this self-attention mechanism
01:31:00.000 | So it is something of this shape. So it's a batch size by non-patches by embedding dimension
01:31:08.380 | So what is does it mean? It means that we have a batch of images
01:31:12.220 | Each of these images is made up of some
01:31:14.460 | patches how many
01:31:16.780 | defined by this number non-patches
01:31:18.780 | And each of this patch is represented by a vector with the size embed dimension
01:31:24.700 | You can think of it as a vector of 1024 dimensions. I don't remember the exact number of dimensions right now
01:31:30.940 | You can also think as this non-patches as a sequence length
01:31:35.740 | So before we saw that a language model is made up of a sequence of tokens here. You can think of it as a sequence of
01:31:40.940 | Patches where the sequence length is this non-patches here
01:31:45.020 | The first thing that we do in the self-attention mechanism is we take the input and we run it through three
01:31:52.060 | Transformations one is called wq one is called wk and one is called wv and after we run it through these
01:31:58.140 | Transformations the output will become query key and values
01:32:02.300 | So let's do it
01:32:05.900 | And it's this stuff here
01:32:07.900 | So we take the input sequence, which is this hidden states and we run it through wq here. It's called the qproj
01:32:14.620 | Wk here is called the kproj w here is called vproj
01:32:19.020 | The shape of the tensor does not change. Basically. These are parameter matrices
01:32:24.960 | So they just add parameters to our self-attention that transform the input sequence so that they become query key and value
01:32:33.100 | So it's the query key and value is just a transformation of the input sequence. However
01:32:37.740 | In this case each token still is independent from the other
01:32:42.140 | So there has been no contextualization happening with the linear layers. So linear layers always treat each token
01:32:47.500 | Independently from the others just like the multi-layer perceptron each token in the multi-layer perceptron is expanded and then reduced
01:32:54.300 | Here, it's not even not expanded nor reduced. It's just transformed because the size is from embedding dimension to embedding dimension
01:33:01.980 | So it's just a transformation of the single token
01:33:04.780 | Why we want to do it? Because the self-attention mechanism needs to see the same sequence in three different ways as query key and value
01:33:12.620 | So we do three different transformations
01:33:14.620 | Later, we will see why they are called query key and values
01:33:17.820 | The second thing we do is basically we split this each of these tokens into smaller tokens
01:33:28.540 | How many smaller tokens based on how many heads we have and now we see why so let me do something strange
01:33:35.420 | Which is i'm not copying the entire line. I'm copying a part of it
01:33:40.380 | We take this query state
01:33:42.140 | Which is a tensor of batch size numpatches embedding dimension and we are splitting the embeddim dimension into smaller
01:33:49.500 | parts
01:33:51.100 | Called head dimension. How many of this head dimension we have? We have numheads
01:33:56.560 | Okay, let me copy it all otherwise, I think it's going to be confusing. Sorry
01:34:02.080 | We also have this transposition later. We will see how it works. We will visualize the tensor operations
01:34:09.040 | We do it for the query the key and value, let's do it and then we see what is it about
01:34:19.360 | So let's go to the slides
01:34:24.000 | So at the input of this fission transformer, we have a sequence of patches you can think of it as a sequence of
01:34:31.120 | vectors each vector made up of let's say
01:34:33.680 | 1024 dimensions or you can think of it as a
01:34:37.600 | Sequence of tokens in case we are working with the language model and each token is represented by 1024 dimensions vector
01:34:44.720 | The first thing that we do is we convert this input sequence
01:34:48.640 | Which we will call x into query key and value and we do it through three transformations. One is called
01:34:54.000 | Wq one is called wk and wbn
01:34:56.800 | Which is basically a matrix multiplication
01:34:59.380 | Now if you look at the shape of the input sequence here, it's 4 by 1024
01:35:04.820 | So here you can see the input sequence is 4 by 1024
01:35:08.260 | Where 4 is representing the sequence dimension
01:35:12.320 | So how many tokens or how many patches you have and the hidden size represents how many what is the size of this embedding vector?
01:35:19.760 | We multiply it each of these with wq wk and wv
01:35:25.040 | Now if you look at the dimensions here wq wk wv they are
01:35:29.360 | The size is embedding dimension to embedding dimension. However here I have represented it as
01:35:35.040 | embedding dimension to 8 multiplied by 128 so
01:35:40.800 | The overall size is the same. So it's 1024 by 1024
01:35:44.340 | However, i'm splitting this second 1024 into eight groups and later we will see why
01:35:51.840 | so you can think of it as a
01:35:54.640 | matrix multiplication that takes a matrix multiplication between this tensor here 4 by
01:36:02.560 | 1024 and this other tensor which is also 1000 by 24 by 1024
01:36:08.880 | However in which the second dimension is split into sub
01:36:12.080 | Groups, how many eight groups because eight is the number of heads we are going to work with
01:36:18.080 | each having 128 dimensions
01:36:20.900 | if you do this matrix multiplication, it is
01:36:23.760 | It will result in this output here. So basically it's a
01:36:27.680 | 1024 multiply this dimension here cancels out as you can see
01:36:34.480 | And then we have the second dimension that remains so in the matrix multiplication the inner dimensions cancel out and the outer dimensions remain
01:36:44.880 | You can if you are confused by this you can think of it like this. So it's like a 1024
01:36:49.140 | And it's
01:36:53.360 | 1024 nothing has changed. I'm just grouping the dimensions. So that's why it's possible
01:37:01.920 | But it this grouping is helpful. And now we will see why
01:37:05.840 | Let's visualize this tensor operation at the max matrix level
01:37:10.000 | So when we do query this x multiplied by wq we have nx which is a 4 by 1024
01:37:15.540 | so it's a sequence of
01:37:18.480 | tokens, each token is
01:37:20.700 | 1024 dimensions
01:37:22.480 | And we are multiplying by a very big matrix, which is 1024 by 8 by 128. How to visualize this matrix?
01:37:30.080 | Well, this is a wq. So it's a parameter matrix
01:37:33.360 | It's also wq and wv. So they all have the same dimensions
01:37:37.780 | You can visualize this like this. You can think of it as a matrix made up of
01:37:42.400 | 1024 rows
01:37:45.740 | Each row is made up of smaller vectors
01:37:49.600 | How many smaller vectors? 8 of them and each of these smaller vectors is made up of 128 dimensions
01:37:58.180 | The overall size of this matrix is still 1024 by 1024
01:38:02.660 | But each of these let's say these vectors are split into 8 groups
01:38:08.020 | So that the output is also a matrix in which each of the
01:38:14.740 | Tokens is a split into multiple subgroups. So it's a matrix that is 4 rows
01:38:21.060 | So as you can see, this is 4 is the number of rows
01:38:24.900 | Each row contains 8 groups of smaller embeddings and each of these smaller embeddings is made up of
01:38:31.780 | 128 dimensions
01:38:34.260 | So why are we even doing this?
01:38:36.260 | With multi-head attention, basically what we want to do if we want
01:38:40.500 | The multi-head attention is a way to relate tokens with each other
01:38:44.980 | We don't want to relate tokens to each other by watching the full embedding of each token
01:38:53.060 | We want to do it with 8 different heads
01:38:56.420 | Such that each head works with a smaller part of the embedding of each token
01:39:02.020 | So the head number 1 will only watch the first 128 dimensions of each token in the entire sequence
01:39:11.300 | The head number 2 will watch the next group of 128 dimensions. So the dimension from
01:39:18.820 | 129 to 256 of each token
01:39:23.060 | So this head will learn to relate all these tokens by only watching this part of the embedding of this each token
01:39:28.340 | This head will learn to relate tokens by only watching this part of the embedding of each token
01:39:34.020 | And this last head will watch to we learn to relate tokens by only watching the last part
01:39:39.540 | Last 128 dimensions of the embedding of each token. Why?
01:39:49.620 | Many languages a word may have different meaning depending on the context in which it appears
01:39:56.180 | If we don't have multi-head attention because the multi-head attention we will see it later is based on what is known as
01:40:02.260 | What is a dot product?
01:40:04.580 | If we compute the dot product over all the
01:40:08.500 | all the
01:40:11.300 | Token then there is only way of calculating the dot product between two tokens
01:40:16.180 | Which is the full embedding of the first token with all the full embedding of the second
01:40:21.060 | So there is only one way of relating two tokens with each other
01:40:24.740 | By splitting each token into smaller groups
01:40:28.420 | Each dedicated to one head. So this is head 1, head 2 and head 8 and all the intermediate heads are here
01:40:36.820 | We learn to relate tokens to each other differently because each head is watching different parts of the embedding of each token
01:40:44.020 | And this is useful for language modeling, for example, because in language modeling
01:40:48.260 | Especially for example in Chinese
01:40:51.380 | Each word may have different meaning depending on the context in which appears
01:40:55.460 | So it may be a noun in some context. It may be a verb in some other context or an adverb in some other context, etc
01:41:02.980 | So we hope that this head here, for example learns to relate this token as a verb
01:41:08.020 | This head here will learn to relate this token as a noun and this head here
01:41:12.820 | Maybe will learn to relate this token as an adverb or some other property that this token has
01:41:17.700 | And this multi-head attention also has another advantage
01:41:21.320 | Because the multi-head attention is based on dot products between tokens
01:41:24.980 | This head here will do the dot product of this first 128 dimensions of this token with the first 128 dimensions of this token
01:41:33.140 | And this head because it watches this part of the token embedding and this other head watches this part of the
01:41:40.340 | Embedding they can work independently from each other
01:41:44.020 | And so because they can work independently from each other this computation can be parallelized
01:41:48.920 | That's why in the attention is all you need paper when they talk about the multi-head attention. They make this
01:41:58.260 | Drawing with multiple drawings behind you can see here with the head dimension appearing here, which means that each of this head
01:42:05.380 | Is computing this scale dot product attention in parallel
01:42:10.120 | With the other heads because each of them is working with a different part of the embedding of each token
01:42:15.860 | So they can work independently from each other
01:42:17.860 | And this is what we are doing here. So we group this
01:42:22.100 | This the embedding of each token into multiple subgroups
01:42:27.560 | Each dedicated to one head because we want this multi-head attention to happen in parallel
01:42:33.500 | Because each head is working with a different part of the embedding of each token
01:42:37.960 | And so it it becomes
01:42:40.600 | Much faster because we can compute all this stuff in parallel
01:42:44.440 | anyway
01:42:46.360 | What we have done in the code is as follows
01:42:48.440 | So we have taken our input sequence now here for the drawing. I have chosen a 4 by 1024
01:42:54.840 | but in the code it should be
01:42:56.840 | Depending on how many patches we have so numPatches by embedDimension
01:43:00.860 | We have multiplied each of them by the Q K and V
01:43:05.000 | And then we split them here as you can see in the
01:43:09.240 | In multiple heads, so we add this head dimension here in my slide
01:43:15.560 | I just pretend I am multiplying directly with a
01:43:19.080 | Parameter matrix that is already split into multiple heads
01:43:23.240 | Why am I doing differently here than compared to the code because we will be it will be useful for this
01:43:29.240 | Visualizing it this way is will be useful for when we will be
01:43:32.920 | Talking about the language model and especially we will be talking about grouped query attention
01:43:36.920 | Because with grouped query attention, we will see that the number of heads for the query
01:43:40.600 | Is much bigger than the number of heads for the keys and the values
01:43:45.240 | So here in the vision transformer the number of heads of the query key and values is the same
01:43:49.560 | So we don't use the grouped query attention and that's why
01:43:52.680 | We use the same number of heads for the query key and values
01:43:55.480 | Then we do this transposition and now we see what is this transposition
01:43:59.720 | So when you do this multiplication here, so you multiply the input by the Q projection. It will return the same
01:44:06.040 | input shape
01:44:08.600 | When you do this view, it will just split this last dimension. So this embedDimension into smaller parts
01:44:15.320 | So it will become num
01:44:17.400 | It will become like this
01:44:22.200 | Uh patches by heads, so we are splitting
01:44:25.340 | This dimension into these two smaller dimensions. So numHeads by headDimension
01:44:31.180 | So basically, what is this headDimension? headDimension is the embedding full embedding divided by the number of heads
01:44:38.120 | So this one imagine this is 1024
01:44:40.380 | Then imagine this is 8
01:44:43.240 | Then this will be 128 because it's 1024 divided by 8
01:44:53.080 | Because we are not reducing the number of parameters or we are not throwing away anything
01:44:57.800 | We are just grouping differently each of these embeddings
01:45:00.940 | With this transpose here, we are changing the position of the two
01:45:07.560 | Two dimensions which dimension the position the dimension number one and the dimension number two, which is the numPatches with the numHeads
01:45:14.780 | So basically we are doing numHeads and numPatches
01:45:20.040 | So this will be the output of all this expression. So it will be a tensor of this
01:45:25.880 | Of this shape batchSize numHeads numPatches headDim. Why are we doing this transposition? Let's see
01:45:36.680 | we have
01:45:38.040 | When we multiply by this wqwk and wv which is already includes the grouping. We are grouping each of these
01:45:44.360 | Vectors into sub groups each dedicated to one head
01:45:49.880 | Now what we have here is a sequence of tokens
01:45:53.080 | Each token is made up of eight group of embeddings. Each group of embedding is made up of 128 dimensions
01:45:59.960 | what we want, however
01:46:03.560 | because we want to compute the
01:46:05.560 | Multi head attention in parallel, which means that each head should be able to visualize
01:46:11.500 | The entire sequence but a smaller part of the embedding of each token
01:46:17.800 | We need to transpose these two dimensions. So we exchange the sequence dimension with the head dimension
01:46:23.580 | and a way to visualize this is this
01:46:30.120 | Let's do it. So we have this sequence of tokens each token is
01:46:35.560 | Divided into eight groups. Each group is made up of 128 dimensions. We want to convert it
01:46:43.320 | Into multiple sequences made up of only the part of the embedding dedicated to each token
01:46:49.480 | So when you do the transposition of these two dimensions here
01:46:52.920 | They become like this. So 8, 4, 128
01:46:56.620 | How can you visualize this matrix? You can visualize it as follows. It's a big matrix that contains eight smaller matrices
01:47:05.160 | each smaller matrices contains four tokens and each token contains
01:47:10.180 | 128 dimensions, which is exactly the dimensions
01:47:13.720 | That are dedicated to each of this head. So you can think of it as a sequence eight sequences
01:47:21.800 | where each sequence is made up of
01:47:24.760 | tokens and each tokens contain only the part of the embedding dedicated to each of the head that it's
01:47:31.720 | each of the eight heads
01:47:33.880 | It's composed of so this sequence here will only contain the first 128 dimensions of each token
01:47:40.440 | This sequence here will contain the next 128 dimensions of each token
01:47:45.560 | And the last sequence here will be a sequence of four tokens and each token will be made up of the last
01:47:51.400 | 128 dimensions of the initial tokens
01:47:54.600 | Why are we doing this? Because now we can compute
01:47:59.720 | The multi-head attention using this stuff here
01:48:02.040 | Independently from this one independently from this one independently from this one
01:48:07.400 | because each head has a sequence of four tokens and each token is made up of 128 dimensions
01:48:16.220 | And we end up in what we saw here
01:48:19.720 | So we can compute this scale.product attention using the query key and values where the query key values are not the entire
01:48:27.380 | Embedding of the token but are only the part of the token dedicated to that specific head
01:48:32.660 | So this head here suppose the head number one will be using the first 128 dimensions
01:48:38.180 | This second head will be using the second 128 dimension. The last head will be using the last 128 dimensions, etc
01:48:45.460 | So we have created the that's why we did this transposition because we now we can treat each head
01:48:53.200 | Independently each head is made up of is working with the four tokens
01:48:57.200 | Which is the sequence dimension and each token is made up of the part of the embedding dedicated to that head
01:49:02.960 | And this is why we do this transpose here
01:49:06.240 | The next thing that we do in multi-head attention is well, we have this
01:49:11.200 | Query key and values. What should we do?
01:49:13.440 | We should do query multiplied by the transpose of the key divided by the square root of the model
01:49:17.840 | And that's it. Yeah, so let's do it
01:49:22.560 | Let's calculate the attention weights, which is this one
01:49:26.720 | So we take the query
01:49:29.440 | Multiplied by the transpose of the keys where we are transposing the second and the third dimension
01:49:34.400 | What is the second and the third dimension?
01:49:36.320 | It's the numPatches with the head dimension because the query is pet size numHeads numPatches head dimension
01:49:43.200 | to multiply it with
01:49:45.520 | the keys we need to
01:49:47.520 | exchange the last two dimensions, otherwise
01:49:50.240 | You so multiply it we need like this. We need
01:49:53.520 | This stuff here
01:49:57.360 | Then we need head dimension and numPatches
01:50:02.020 | Such that if you remember in the matrix multiplication the inner dimensions cancel out and the outer dimensions remain
01:50:10.320 | so the outer dimensions basically
01:50:15.200 | numPatches numHeads
01:50:18.080 | then the hidden this is
01:50:20.080 | Head dimension will cancel out with this one and we will be left with numPatches
01:50:25.540 | So the output of this multi head attention basically, it's a matrix that is numPatches by numPatches for each head
01:50:32.160 | Let me delete this one
01:50:35.760 | So I know it's not easy to visualize it like this. So let's visualize it on the slides
01:50:41.440 | So what we are doing is we are multiplying the query with the transpose of the keys
01:50:45.280 | And then we are dividing by the square root of the model, but we already computed it here. So this is the square root of the model
01:50:51.120 | And we just because it's already one over square root so we just multiply it we don't need to divide by it
01:50:58.080 | So let's visualize in the slides how this multiplication works
01:51:03.120 | Okay, we already saw why we do the multi head attention because we want to parallelize the computation etc. So now what we are doing is we are
01:51:13.520 | for each head
01:51:15.280 | each head as we saw before is
01:51:17.280 | Made up of one sequence of embeddings where each embedding is not the full embedding of the token
01:51:23.920 | But it's a part of the embedding of each token. So it's a smaller embedding. Let's say
01:51:27.920 | So each head basically will do the following matrix multiplication when you do query multiplied by the transpose of the keys
01:51:34.400 | Each head is made up of a sequence of tokens
01:51:38.560 | And each token is not the full embedding of the token, but it's the first 128 dimensions of each token
01:51:45.440 | When we do the transpose of the keys each of these row vectors becomes a column vector as you can see
01:51:52.080 | And when we do this matrix multiplication for each head we will be getting this
01:51:59.120 | Matrix as output which is
01:52:04.280 | Sequence by sequence because as you can see when you multiply this matrix here by this matrix here
01:52:09.180 | You get four by four matrix as output because the inner dimensions cancel out
01:52:13.260 | What does this matrix represent?
01:52:16.860 | Each of these numbers represents the dot product of one token with another token
01:52:23.340 | So you can think of the rows as being the queries and the columns as being the keys
01:52:30.060 | This one here is the dot product of the first token of the queries suppose that the each of these tokens represent
01:52:37.100 | A sentence like I love pepperoni pizza
01:52:40.380 | Then this is the word I this is word the word love this is the word pepperoni and this is the word pizza
01:52:47.740 | Then this number here represents the dot product of the word I with itself
01:52:55.020 | So the first query with the first key
01:52:59.340 | This one here represents the dot product of the first query with the second key
01:53:05.340 | This one represents the dot product of the first query with the third key
01:53:10.780 | And we do all the possible dot products as you can see here
01:53:15.500 | now you
01:53:17.740 | Are and what does this matrix represent? This represents somehow the relationship between two tokens
01:53:24.380 | So the bigger the dot product the more intense is the relationship between two tokens
01:53:29.340 | Actually, it's then defined later. We will see that we apply the softmax
01:53:33.120 | But you can think of the dot product as being how the self-attention mechanism is relating to tokens
01:53:39.420 | How intense is the relationship of these two tokens?
01:53:42.140 | Why do we have this square root of the model as the denominator because
01:53:49.340 | We want to scale this dot product based on because usually when you train a model you train multiple variants of it, for example
01:53:58.860 | We when we and suppose some for example, imagine you want to try you train multiple variants and you have this
01:54:05.820 | You try multiple number of heads
01:54:08.940 | You don't want the magnitude of these numbers to change between one try and the next one
01:54:14.220 | So basically by dividing by the square root of the model you keep the magnitude constant
01:54:21.440 | Now what are what is this matrix doing so this matrix tells us how two tokens are related to each other
01:54:29.440 | Now in language modeling we also apply what is known as the attention mask
01:54:35.760 | So we don't want the word I to be related to future tokens
01:54:40.000 | So usually we don't want to compute this dot product
01:54:42.560 | We don't want to compute this dot product and we don't want to compute this dot product because we don't want the token I
01:54:47.520 | To be related to all any other token because there is no previous tokens
01:54:51.440 | We also don't want the word love to be related to the word the pepperoni and the pizza
01:54:56.560 | Because they come after it, but we want of course the word pepperoni to be related to the word love. So this this
01:55:02.080 | There should be a number here. So we don't want to mask out this one
01:55:05.680 | This is called a attention mask
01:55:08.240 | And how do we apply that basically?
01:55:10.960 | If we don't want some interaction between token to happen
01:55:15.040 | we can
01:55:17.120 | Calculate the matrix as usual
01:55:18.640 | So query multiplied by the transpose of the keys and then we replace all the numbers all the relationships that we don't want
01:55:24.640 | With minus infinity. So here we can replace this number here with minus infinity
01:55:29.620 | Here we can replace this number with minus infinity and then we can replace this number with
01:55:36.160 | Minus infinity
01:55:41.200 | So that after we need to apply the softmax the softmax will convert each of these numbers into a
01:55:49.360 | probability score
01:55:51.840 | because we want the relationship of one token with other tokens to be
01:55:56.880 | Between zero and one and also we want each row to sum to one
01:56:01.760 | Later, we will see why because actually the when we do the contextualization we are doing a weighted sum, but okay
01:56:08.880 | Let's forget it about now
01:56:11.040 | Anyway, the point is we apply the softmax row by row. So if we don't want the relationship of two tokens to
01:56:19.440 | Considered by the attention mechanism. We replace that particular dot product with minus infinity before we apply the softmax
01:56:26.560 | Because the softmax we saw before is an exponential
01:56:29.840 | It's e to the power of x when e is to the power of minus infinity
01:56:34.000 | It will become zero. So the output of the softmax will become zero for all the interaction that we didn't want
01:56:40.080 | So that's why we replace it with minus infinity
01:56:42.260 | Now, let me put back whatever we had before
01:56:46.160 | Okay, so this is uh where we apply the mask
01:56:50.800 | So as you can see if we apply the mask before we apply the softmax
01:56:54.080 | It will replace with zero all the interactions that we don't want
01:56:56.960 | And this is um
01:57:00.640 | what is the
01:57:02.560 | This matrix here is known as attention weights
01:57:05.280 | so it tells us how intense is the relationship between two tokens and
01:57:10.000 | This matrix here is calculated independently for each single head because here I show you only one matrix here 4 by 128
01:57:18.900 | But we have eight of them
01:57:22.400 | And each of them is calculated in parallel
01:57:25.840 | So you need to think that you have eight of this matrix if you have eight attention heads
01:57:31.200 | And in this case in the code, you can see that the output is a list of it's a batch
01:57:37.200 | Because maybe we have multiple images
01:57:39.440 | Each of these images is managed by multiple heads
01:57:42.960 | Each of these heads will learn to relate tokens differently
01:57:46.720 | So each of these heads will give us a numPatches by numPatches matrix or sequence by sequence matrix
01:57:52.000 | Where each of this number represents how this head is relating two patches with each other
01:57:59.440 | So now we have seen how to calculate this attention weights
01:58:02.240 | Which basically it's a matrix that tells you how two tokens are related with each other
01:58:06.400 | It's kind of a score of how the attention mechanism thinks two tokens are
01:58:10.320 | Related to each other
01:58:13.840 | We continue our journey
01:58:15.840 | The first thing we do. Okay, we verify the dimension of this matrix
01:58:19.520 | And then we apply the softmax the softmax as we saw before is a way to convert these attention scores into
01:58:28.560 | Numbers that are between 0 and 1 and also such that they sum up to 1
01:58:32.560 | And we do it by soft with the softmax function, which is applied by rows
01:58:37.760 | And that's this dimension. This is a
01:58:41.200 | What is the meaning of this dimension parameter which tells you how you want to apply it?
01:58:45.600 | So we are applying it to the last
01:58:47.760 | Dimension you can think of this as the row dimension. This is the column
01:58:52.960 | So if you apply it on entire all the columns, it means you are applying it by rows
01:58:57.920 | then we have the dropout but as I said before we don't use the dropout because
01:59:04.480 | I didn't see it in the parameters of the polygamma ever being used. So we have it, but we don't use it
01:59:12.000 | And as you remember the dropout basically takes random
01:59:15.920 | With the probability p it will set some activations to zero
01:59:20.240 | So some numbers of this input matrix to zero, but we don't use it
01:59:23.680 | And it only happens during training and it's a way to reduce overfitting
01:59:28.180 | But as it's not used
01:59:30.180 | The next thing that we do in the multi-head attention is we are multiplying this attention weights matrix with the v sequence
01:59:37.940 | the value sequence
01:59:39.940 | So we multiply this matmul means matrix multiplication
01:59:43.000 | We are multiplying this attention weights with the value states, which is the value sequence
01:59:48.500 | which is a transformation of the input sequence through this wv matrix and also by grouped by
01:59:55.860 | Heads, let's visualize this operation
02:00:00.260 | Let's go here
02:00:01.860 | so the output of the attention mechanism of the query multiplied by the keys is this matrix here where each number represents the
02:00:09.540 | How two tokens are related to each other by applying the softmax this number become between zero and one in each row
02:00:16.020 | And also in such a way that they sum up to one
02:00:18.740 | So here you can see it's 1.0 because there is only one number here. It's 0.4 and 0.6
02:00:25.140 | So they sum up to one and here is 0.2, 0.4, 0.4. So they sum up to one etc, etc
02:00:31.140 | Now when I say that these numbers represent the intensity of how the attention mechanism relates to token is because now when we multiply
02:00:39.720 | This matrix here, which is in the code is written as attention weights
02:00:45.220 | We multiply it by the v matrix. So the v sequence for the value sequence
02:00:52.500 | We are computing a weighted sum. Why?
02:00:55.460 | When we do this matrix multiplication
02:00:57.780 | We are multiplying for example a 4 by 4 matrix by a 4 by 128 matrix
02:01:03.860 | Where each of this v matrix is one for each attention head just like each of this matrix here
02:01:10.340 | Attention weights is one for each attention head. So each of these attention heads will be doing this
02:01:15.060 | Product in parallel. So each attention heads does query multiplied by the transpose of the keys in parallel the softmax in parallel
02:01:23.300 | and this
02:01:25.520 | multiplication with the v matrix in parallel
02:01:28.020 | I mean not these operations in parallel. It's the attention heads that work in parallel. The operations are sequential, of course
02:01:39.140 | What is the output of this
02:01:41.780 | Product it's a 4 by 4 multiplied by 4 128. So the output is a 4 by 128 because the inner dimensions cancel out and
02:01:49.140 | the outer dimensions remain
02:01:51.140 | Let's analyze this output matrix here
02:01:54.500 | So it will be a matrix with four tokens each token represented by not the full dimensions
02:02:01.140 | But because we are working with multi-head attention each head will have a smaller part of the embedding of each token
02:02:07.460 | So it will have 128 dimensions in case we have eight heads and the embedding dimension is 1024
02:02:13.240 | this first number here will be the
02:02:16.900 | Will be the dot product of the first row of this matrix with the first column of this matrix
02:02:24.500 | And as we can see from this row here
02:02:27.780 | All the values are zero except the first one
02:02:32.180 | which means that only this token here will contribute to the output here, which means that this and
02:02:38.660 | The second number in this matrix here
02:02:41.220 | So this stuff here will be the dot product of the first row of this matrix with the second column of this matrix
02:02:48.500 | But most of the values here are zero except the first one
02:02:52.420 | Which means that only this token here will contribute to this second number here
02:02:57.140 | So all the dimensions in this row will be contributed only by the first token
02:03:02.100 | multiplied each
02:03:05.120 | The dimension of the first token multiplied by the number one
02:03:09.620 | Because all the other tokens will be multiplied by zero zero and zero
02:03:14.900 | Let's look at the second row of this matrix here this one here the first number
02:03:20.900 | So the first dimension of the second row of the output
02:03:24.660 | Matrix will be the dot product of the second row of this matrix with the first column
02:03:31.860 | The first two numbers are non-zero and the second two numbers are zero
02:03:35.700 | Which means that only the dimensions of the first two tokens will contribute to this output embedding
02:03:41.400 | For each of these dimensions
02:03:43.460 | So for all the dimensions here will only be contributed by the first two tokens because all the other tokens
02:03:48.980 | Whatever there is now the number is here
02:03:52.100 | They will be multiplied by zeros. So they will not contribute to this output embedding
02:03:56.660 | That's why we can say that this is a contextualized embedding
02:04:00.200 | In which the contribution to this contextualization only comes from the first two tokens
02:04:06.740 | How are they these two tokens contributing? Well each of these numbers in the second
02:04:12.580 | Token will be multiplied by 0.4 and each of the number in the first token will be multiplied by 0.6
02:04:20.820 | This you can see it as the first token contributing
02:04:23.880 | 60 percent of the information to this contextualization and the second token contributing 0.4 to this
02:04:30.900 | 40 percent to this contextualized embedding
02:04:33.880 | And you can do the same for the third output
02:04:37.300 | So this output here the first number will be the dot product of this third row
02:04:42.740 | Multiplied by this first column and as you can see here, we have a zero because of the causal mask
02:04:50.100 | Which means that only the first three tokens will contribute to the third embedding here
02:04:55.220 | How much each token will contribute? Well, it depends on how are these numbers distributed?
02:05:00.440 | The first token will contribute 20 percent. The second token will contribute 40 percent and the third token will contribute also 40 percent
02:05:08.740 | So that's why when we talk about the attention width matrix, we talk about how to
02:05:13.460 | the matrix
02:05:15.540 | the attention mechanism
02:05:18.500 | Is telling us how intense is the relationship between two tokens so that each token will contribute that token will contribute more to the output
02:05:26.020 | embedding
02:05:28.260 | So if the the word let's say
02:05:30.500 | pizza and I are
02:05:33.120 | Very related to each other when then the embedding of the word I will contribute most to the output of embedding of this fourth
02:05:41.220 | contextualized position
02:05:44.660 | So it means that then the fourth was contextualized position will be 40 percent based on the information
02:05:50.660 | contained in the token I and 20 percent of the information contained in the word the love and
02:05:55.860 | 30 percent in the
02:05:58.180 | In the word the pepperoni, etc, etc, etc
02:06:00.980 | So this is why it's known as a weighted sum because you are
02:06:05.460 | Summing the contribution of each token if it's not masked out
02:06:11.540 | Weighted with the attention score
02:06:13.620 | associated by
02:06:16.100 | Calculated using the attention weights matrix here and we do this for each of this head in parallel
02:06:22.200 | So each head is watching a part of the embedding of each token and it's learning to relate them differently and then doing this weighted
02:06:30.100 | sum differently
02:06:32.180 | And each head will contribute
02:06:35.060 | Will output
02:06:36.500 | a list of contextualized embedding but each of this contextualized embedding will not be a full token
02:06:43.060 | It will be part of what is the full token and now we'll
02:06:46.180 | We see how we can merge the result of this multi-head attention
02:06:50.500 | And for that we need to look at the original paper. So if you look at the original paper
02:06:54.660 | We calculated this multi-head attention in parallel. And how can we merge the result of this multi-head attention?
02:07:02.340 | Well, we we we we go here and we basically concat these heads
02:07:06.820 | So we take the output of the first head we concat it with the next we concat with the third head with the fourth
02:07:13.460 | The fifth etc, etc
02:07:15.300 | All the heads so until we get the full dimension of the original token back because each head is made up of 100
02:07:22.340 | In case suppose 128 dimensions, so this will be the first 128 dimension then the next 100 and the third 100 etc
02:07:30.100 | Until the last 120 dimensions, so we get back the 1024 dimensions back
02:07:34.820 | And we do this stuff. Let's go back here
02:07:39.460 | here, so
02:07:42.180 | each head
02:07:44.180 | Will return a contextualized embedding for each position, but it's a contextualized
02:07:50.040 | Embedding
02:07:53.940 | That does not include all the original token contextualized but a part of it because each head is working
02:07:59.940 | In parallel with a part of the embedding of each token, then we concatenate them. So
02:08:05.140 | What we do is we basically we want to arrive to this stuff here. So we have a contextualized embedding
02:08:12.360 | Here one for each of the heads
02:08:15.300 | Okay, first we need to do I believe a transposition so we need to transpose back because before
02:08:23.300 | We transpose right? So we
02:08:25.300 | We put the head dimension first and then the sequence dimension
02:08:29.460 | So now we need again the sequence dimension and then the head dimension after
02:08:33.380 | so that each
02:08:35.860 | We go from this configuration
02:08:37.860 | Which is for each head. We have a contextualized list of tokens
02:08:43.220 | We want to get a list of tokens in which each
02:08:48.500 | Head is contributing its 128 dimensions, which are contextualized
02:08:53.000 | Embeddings, smaller embeddings, let's say
02:08:56.800 | So let's do this transposition also in code
02:08:59.700 | I believe it's here. So I think there is another checking of the output dimension
02:09:06.420 | We transpose back
02:09:10.820 | So we do this transposition back. So we did the first transposition here to exchange the
02:09:16.660 | Number of heads with the sequence dimension. Now we transpose back
02:09:19.780 | So we go back to the num_patches and num_heads
02:09:24.100 | So it's a sequence each sequence is made up of smaller
02:09:28.180 | Eight groups or num_heads group and each
02:09:31.140 | Head is made up of head dimension dimensions
02:09:35.000 | We do this contiguous because we want to reshape. Okay, it doesn't matter
02:09:40.260 | You don't have to know why we do this contiguous, but basically
02:09:45.920 | Contiguous means that we want the
02:09:47.920 | The tensor to represent the information in the memory in a contiguous way so that the next operation that we are going to do
02:09:55.600 | the reshape is basically
02:09:57.920 | Does not require any computation because when you do a reshaping or a viewing of a transfer of a tensor
02:10:04.240 | There is no change in the memory layout of the tensor
02:10:08.880 | Actually, the PyTorch will just change what is known as the stride of the tensor
02:10:14.480 | So if you go to a tensor
02:10:16.480 | We are going a little off
02:10:18.960 | off topic, but
02:10:20.960 | There is this thing called the stride which tells you how
02:10:24.080 | To go from one dimension to the next without changing the layout of how this tensor is allocated in the memory
02:10:30.560 | So when you do a view
02:10:32.480 | or a reshape
02:10:34.480 | The PyTorch will just change these numbers on the stride. Okay
02:10:38.640 | I will do another video on how this works
02:10:41.600 | But anyway, but this contiguous allow us to have this tensor all in the memory as a contiguous memory allocation
02:10:48.160 | So that this reshape operation can be done without
02:10:50.720 | Without
02:10:54.460 | computational overhead
02:10:56.320 | Now let's get back on track. So
02:10:58.320 | We did a reshape operation in the slides
02:11:02.240 | So after we have to do a reshape, we did the transpose operation and now we need to do a reshape operation
02:11:07.520 | So the transpose operation basically allow us to get again at the first dimension the sequence dimension
02:11:13.840 | Then the grouping of the group of
02:11:16.640 | dimensions of each token
02:11:19.260 | And each group contains 128 dimensions. Now, we need to concatenate them. How can we concatenate them?
02:11:25.680 | Well, we just want to merge these heads again together into one single token
02:11:30.480 | And we do that with this
02:11:33.660 | Reshape operation. So with reshape basically, we are going from numHeadsHeadDim to EmbedDim, which is
02:11:40.460 | In this case, it's 124
02:11:44.060 | So, how does it work the reshape basically the
02:11:47.900 | The PyTorch
02:11:51.580 | will take each of these
02:11:53.580 | Groups and will just merge them. So it will just concatenate them with each other. So instead of being a
02:12:01.020 | matrix that contains sub-arrays where each sub-array contains multiple sub-arrays and each of these
02:12:08.140 | sub-sub-array contains 128 dimensions, it will just become a matrix that contains one array that is made up
02:12:15.900 | 1024 dimensions, which is the concatenation of all these heads
02:12:20.940 | So this is how we merge the information of all this multi-head attention that was done in parallel into one single
02:12:30.780 | Token that is a contextualized version of the initial token
02:12:34.460 | So we as you can see we got back the initial shape
02:12:38.460 | So we started with before at the beginning of the multi-head attention. We started with
02:12:42.780 | 4 by 1024
02:12:45.660 | Input sequence and we end up with 4 by 1024
02:12:49.840 | Sequence
02:12:53.100 | There is one last part that we need to do
02:12:55.420 | that is
02:12:57.960 | Multiplication with this WO. So if you look at this concatenation that we have done
02:13:03.000 | The concatenation basically takes the this tensor this first token here
02:13:09.160 | Is just the concatenation of the first 128 dimensions, which are the output of the first head then the second 128 dimension
02:13:16.760 | Then the third 128 dimension and then the last 128 dimension. In total there are 1024 dimensions
02:13:23.880 | But there has been no mixing between the result of these heads. So it's just a concatenation of multiple
02:13:31.080 | of independent calculations
02:13:33.800 | Each calculation done by one head independently from the others
02:13:37.720 | But we want the token to not be a concatenation of independent calculations
02:13:43.720 | We also want to kind of mix the result of these heads with each other
02:13:48.600 | And the mixing happens when you do this multiplication by WO. The WO matrix is a matrix that is
02:13:54.920 | embedding size by embedding size
02:13:57.720 | Which basically
02:14:00.680 | As you can see does not change the shape of the input. So we have
02:14:03.400 | The input of this WO will be a 4 by 1024. We multiply by 1024 by 1024. So it results the same input shape
02:14:12.360 | But it will be
02:14:15.480 | because
02:14:16.520 | Let's look at this number here. This number here is the dot product of the first row
02:14:21.320 | So the first token with the first column of this matrix
02:14:24.840 | And the first column of this matrix is 1024 parameters. So all of these heads, so the
02:14:31.880 | 128 dimensions of the first head
02:14:34.500 | 128 dimensions of the second head, etc, etc
02:14:38.520 | Will all participate in the same dot product giving up one single number here
02:14:43.880 | So there have been a mixing of the results of this head. If we don't multiply with the WO there is not
02:14:49.800 | There is no mixing between the result of each head which happened independently in parallel
02:14:55.160 | And that's why we multiply it by WO
02:14:57.400 | So we don't want each token to be a contextualized version of multiple subtokens each calculated independently from each other by the multi-head attention
02:15:05.400 | We want of course it to happen because we want to parallelize
02:15:08.760 | But then we want to mix the result of this multi-head attention and we do that by multiplying by WO
02:15:14.620 | and now let's do it so
02:15:16.620 | For now, we just merge. So this reshape is basically doing the concat that we saw before in the attention paper
02:15:23.020 | Now we do the multiplication with the WO which is this stuff here. So out projection
02:15:28.540 | It won't change the shape of the tensor that is input to it
02:15:32.700 | And then we return it along with the attention weights. Actually, we will not be using the attention weights
02:15:37.020 | And now finally we have implemented the multi-head attention
02:15:41.340 | I just realized we forget something guys. So
02:15:43.900 | We forgot to implement this encoder. So we created the layer of the encoder, but we didn't create the encoder itself
02:15:51.660 | So what we created basically in this vision transformer is this stuff here. So let me open the slides
02:15:58.160 | We created one single layer like this one
02:16:02.700 | But we didn't create the sequence of these layers because an encoder is a sequence of these layers. So let's do it
02:16:08.620 | It's it's very simple. So this is a single layer
02:16:11.900 | But we need to create a sequence of them because we apply one after another such that the output of one is
02:16:17.180 | Used as input for the next one. It's a very simple class. So let's create it
02:16:21.340 | Let's create the
02:16:25.340 | Constructor so it's just very simple. It's a okay
02:16:28.620 | We save the configuration then each we create a sequence of layers where each layer is this encoder layer to which we pass the configuration
02:16:37.260 | How many we create based on how many layers it should have so the transformer layers
02:16:41.820 | And the forward is very simple. I can just copy it all. It's basically says, okay
02:16:48.780 | We have the input we give the input to the first layer and the output of this layer becomes the input to the next one
02:16:55.820 | So we do a for loop and then we return the the output of the last layer
02:17:00.380 | This is a very simple and as you can see between each layer, there is no change in the shape of the tensor that is fed
02:17:07.020 | I believe I think we have
02:17:10.380 | Coded all of the cglip. So which is our vision transformer
02:17:14.560 | You may think that I have lied to you by saying that at the beginning when we were talking about contrastive learning you
02:17:21.820 | Okay, actually, let's look at it. Otherwise, we will have the doubt so
02:17:27.580 | When we were talking about contrastive learning
02:17:29.580 | We were talking about generating one single embedding for each image
02:17:35.180 | But here we are generating a sequence of contextualized embedding
02:17:38.880 | So how can the image generate one single embedding?
02:17:43.680 | For a single image so in the transformer
02:17:47.600 | Is a sequence to sequence model
02:17:49.740 | So you give it a list of patches as input and it will give you a sequence of contextualized patches as output
02:17:56.540 | When working with something like clip, for example, if you want only one single embedding for each image
02:18:02.940 | You can just take the first output contextualized embedding from the transformer as a representative for the whole image
02:18:09.820 | Because it will force the model to put all the information in the first contextualized embedding
02:18:16.060 | So that's one way to do it
02:18:17.820 | Another way is to just take the average of all the output embeddings by the transformer to generate one single embedding
02:18:24.540 | Anyway, this was just a closing note before we move to the next part, which is our language model
02:18:30.620 | So let's go back to the architecture, which is here
02:18:33.180 | So we have coded this part here the vision encoder so we feed an image it will be
02:18:40.780 | The vision encoder extracts some patches each of these patches become an embedding to this embedding
02:18:47.100 | We add a positional encoding which is learned
02:18:49.740 | We send it to this magic box called the transformer layer, which will contextualize them
02:18:54.540 | We take the output of this contextualization and this becomes our
02:18:58.380 | Image embeddings
02:19:02.140 | Now before we can feed it to the language models
02:19:04.860 | These embeddings may not be of the same size of the embeddings used by the text layer
02:19:10.300 | So we will need to introduce this linear projection
02:19:14.460 | So in the next part of the video, we are going to code the language model including this linear projection here
02:19:20.540 | And we will learn how to merge these tokens the image tokens and the text tokens
02:19:25.740 | Okay, let's start
02:19:28.540 | So the next part that we are going to code is basically how to load the image from the disk to convert it into a tensor
02:19:35.100 | And also how to tokenize the text
02:19:37.500 | And we need we will see that we need to do the preparation of the text has to be done in a particular way
02:19:43.660 | Let's see actually why we have it has to be done in a particular way. So let's open the slides
02:19:48.220 | Oops, I think I closed it. So let me open it again
02:19:51.660 | All right
02:19:54.060 | So as you can see, we need to find a way to combine the image tokens with the text tokens
02:20:00.060 | So first we need to tokenize the text
02:20:02.220 | But we need to create some placeholders for where we will put the image
02:20:09.260 | Tokens before the text token. So I will use the term image tokens and image embeddings interchangeably
02:20:15.760 | because you can think of the image embeddings as kind of tokens that represents the image or and the
02:20:21.900 | Text are the embeddings that represent the text that is the prompt from the user
02:20:26.700 | so the first thing that we need to do is we need to learn how to load this image into a
02:20:32.300 | tensor because then as you can see from our cglib code the input to the cglib is
02:20:39.020 | A tensor that is has the channel the height and the width dimension
02:20:42.720 | which is then
02:20:44.780 | transformed into patches and contextualized, etc, etc
02:20:47.420 | Then we need to tokenize the text. We need to create this list here
02:20:52.540 | But we we will create first a list of tokens
02:20:56.140 | Each corresponding to the text tokens and then we will add some placeholders for where we will put the image tokens
02:21:03.500 | and then it will be the transformer that will
02:21:08.300 | Take these placeholders and replace it with the image. So
02:21:11.180 | I know it's a lot of things to remember. So don't worry. Let's code it and we will see it step by step. So let's go
02:21:18.460 | We create a new file called, let me check here processing
02:21:24.560 | processing.py
02:21:27.800 | We do some imports
02:21:35.820 | We create these two constants and later we will see why we need them
02:21:40.140 | For now, just create them
02:21:43.340 | Okay, let's start from the beginning. So let's create this class called the polygamma processor
02:21:48.960 | This stuff here
02:21:53.020 | It has a constructor
02:21:55.020 | Which is this stuff here
02:21:57.020 | It will take as input the tokenizer how many image tokens?
02:22:02.460 | We need to generate for the image and what is the image size that this particular gamma will work with
02:22:08.780 | We save it. We save these two values and then what we do
02:22:13.660 | We need to add some special tokens to our tokenizer. So now I show you why we need to do it and how it works
02:22:21.100 | So the tokenizer that polygamma is using is the tokenizer of the gamma model
02:22:26.940 | But the tokenizer of the gamma model was not created
02:22:30.320 | With the special tokens for the image. So what they did was they basically created these
02:22:36.940 | additional tokens called the because
02:22:40.620 | Polygamma can be used for multiple purposes
02:22:43.640 | So what we saw here in my slide is basically here is trying to extract information from an image
02:22:50.140 | So we have an image we have a prompt and the polygamma so which is basically the gamma model here
02:22:57.100 | will decode the response by
02:22:59.500 | interpreting the
02:23:01.420 | Prompt and using this one as additional information for the prompt
02:23:05.180 | the image
02:23:07.180 | Polygamma actually can do much more than this a polygamma can also do image segmentation so it can
02:23:14.220 | Segment the part of the image that for example for this leg
02:23:17.740 | It can do object detection
02:23:19.980 | So it can detect all the instances for of of tree for example
02:23:24.220 | If we do object detection for trees, it will probably give us this this okay
02:23:29.020 | This is not a bounding box this box here telling that this is a tree
02:23:32.380 | If we do it ask it to detect all the feeds it will give us two
02:23:36.380 | Boxes one for this one one for this one, etc
02:23:39.580 | So polygamma can do a lot of this and the way it does it by using special
02:23:44.380 | tokens
02:23:46.380 | For the segmentation they are called the segmentation tokens and for object detection. They are called local location tokens
02:23:53.580 | And but we will not be using them. So our goal here is just to inference polygamma
02:23:59.340 | So we will not be working with the object detection or object segmentation
02:24:03.120 | But if you want more information on how these tokens work, there is a very nice article
02:24:08.940 | Not only this one from google. So here in google they say
02:24:12.300 | That polygamma uses the gamma tokenizer, but they extend it with these further tokens that are used to tell
02:24:19.580 | In the output of the model, where is the segments?
02:24:23.420 | where is the bounding box position that it has detected or where is the
02:24:26.860 | location of the
02:24:29.580 | Of the segmentation mask that the model has detected
02:24:33.980 | Another article that I recommend is the hugging face blog article about
02:24:37.980 | Polygamma, let me find it. I believe it is this one here
02:24:43.180 | In which they describe how this attention masks work
02:24:47.100 | So as you can see polygamma can detect the cat and will give us this output which is a lock
02:24:52.460 | Tokens, as you can see lock 0094, 00256
02:24:57.100 | Which this number 0094, 0256 tell us the position of the top left
02:25:03.820 | Top right bottom left and bottom right corner of this bounding box here
02:25:09.180 | But we will not be using
02:25:11.900 | Here because we are only interested in using the polygamma as a conditional model for generating an output
02:25:18.060 | Conditioned on the image that we feed it in
02:25:22.540 | But anyway because the tokenizer
02:25:24.540 | Used by polygamma is adding these special tokens
02:25:27.740 | We also add them here and how to add them and how many to add them is described in this article
02:25:32.460 | You can see here. And so basically we have
02:25:34.700 | 1024 location tokens for image detection and then 128 tokens for object segmentation
02:25:42.720 | Okay, we save the tokenizer
02:25:47.100 | Then what do we need to do?
02:25:48.860 | We have we also need to create this constant called image token
02:25:52.380 | what is this constant basically when we
02:25:55.980 | when we
02:25:57.880 | process our text with the gamma tokenizer the gamma tokenizer will only generate of course the
02:26:04.220 | The tokens for the text but later we need to also insert in these tokens the image tokens
02:26:11.820 | So what we need to do what we do basically is we insert some placeholder tokens
02:26:17.260 | That will then be replaced by the embeddings
02:26:19.760 | Extracted by the visual encoder and this placeholder tokens that we will be using is this image token here
02:26:26.300 | And we add it also
02:26:28.860 | And we
02:26:32.300 | We add it here in the tokenizer
02:26:34.300 | Now how to use this polygamma processor. So the polygamma processor is a special class that given an
02:26:40.780 | Text which is the prompt of the user and an image will load the image
02:26:46.840 | Reprocess it so resize it rescale it. Whatever the vision model needs to see
02:26:51.800 | And we'll create this
02:26:54.520 | Text tokens with the placeholder for the image tokens. So let's do it
02:26:59.000 | We create this
02:27:02.120 | Method here the call why we create the call method. Well, basically this allows the
02:27:06.920 | the instance of the processor
02:27:10.040 | To to be called like a function
02:27:12.840 | So when you create the processor you will we will create it like this like polygamma processor and then we can use it like this
02:27:19.720 | Passing the arguments here. So this is why we implement the call method
02:27:24.040 | And the call method takes as input a list of text and the list of images
02:27:28.600 | but we will actually only accept one text and one images because I don't want to deal with the
02:27:34.200 | Padding otherwise, it will complicate our code. Our goal is not to make it universally
02:27:39.420 | Perfect. Our goal is to learn by doing and how it works. Actually, this is this code will be usable
02:27:45.160 | So we will actually run the inference later, but it will only work with one image and one prompt at a time
02:27:50.200 | It doesn't matter because later we can later
02:27:52.760 | I will try to make the code for fine-tuning this model
02:27:55.400 | And we will see that we will change this code a little bit to to accommodate for the padding
02:28:02.680 | Anyway, we need to process these images and we will use a special method called process images
02:28:09.400 | So if we take each of these images and we need to resize it
02:28:12.920 | We resize it to the image size that is accepted by this polygamma version. So
02:28:18.120 | As you can see the weights of polygamma
02:28:20.680 | Actually show there is multiple weights, but this is two to four only resizes the images to the size
02:28:28.100 | 124 by 224 and generates 128 tokens for this in each image
02:28:33.780 | then we rescale this image and later we will see why we do it and then we
02:28:38.100 | We normalize it using the mean and the standard deviation of ImageNet
02:28:43.540 | It's not really the ImageNet mean and standard deviation, but later we will see how it works
02:28:47.620 | Anyway, suppose that this method here will load the image will rescale it will normalize it etc and convert it into
02:28:57.460 | A tensor that can be then processed by the vision model
02:29:00.340 | We do it here so
02:29:04.980 | We create here a tensor. So because this will
02:29:08.020 | Return a list of tensor. We need to create a one single tensor with the batch size
02:29:13.540 | So we stack them stack basically means that if we have a list of tensor, it will create one single big tensor
02:29:18.980 | Where it adds one
02:29:21.620 | another dimension called the batch size one
02:29:25.220 | So instead of becoming a list of tensor it will become one big tensor
02:29:28.500 | This is a NumPy tensor it is converted into a torch tensor
02:29:33.860 | And then we
02:29:37.620 | Create the input to the model. So later we will expand this method. So now I just create them
02:29:44.020 | What is this method going to do? Well, this method is going to
02:29:49.780 | Let's check here. It's going to create the tokens of the text and create the placeholder for the image tokens
02:29:56.280 | and then
02:29:59.620 | We tokenize it using the placeholder tokens for the image
02:30:04.340 | And then we return it. So now let's expand
02:30:08.100 | This stuff I know that I have copied a lot of code. Now, I will explain it one by one
02:30:13.460 | So let's start at input. We have a list of text and the list of images. Let's process these images
02:30:19.300 | So let's create this process image function
02:30:21.620 | What is it going to do?
02:30:24.260 | Let's copy it. It's very simple actually
02:30:26.980 | Okay, the process image takes as input a list of images what is the size that we want of these images
02:30:35.700 | What is the kind of resampling that we want to do when resizing this image? You can do linear, you can be cubic, etc
02:30:44.160 | Rescale factor if we want to rescale this image and
02:30:47.360 | the normalization mean and the standard
02:30:50.580 | And this has the same meaning as the normalization that we do in the neural networks. So we want the
02:30:57.280 | The image no matter what it represents to always have the same distribution more or less
02:31:03.120 | So centered on zero and the variance of one
02:31:05.680 | And the way we do it is basically we take the image
02:31:10.000 | Values so the tensor we subtract the mean of all the images that we have in our data set
02:31:16.240 | And usually we use the mean of the image net data set and the standard deviation of the image
02:31:21.840 | Net data set
02:31:24.480 | I don't know why in the hugging phase they use 0.5 because it's actually not really 0.5
02:31:28.880 | It's very close to 0.5 each of these numbers, but it's not really so maybe it works anyway
02:31:35.920 | And we have one for each channel of the image. So one for r one for g and one for p
02:31:40.560 | So what is this
02:31:44.000 | Function going to do first it resizes the image by using this resampling method
02:31:47.760 | Then it will convert the image into a numpy array
02:31:50.400 | Then it will rescale it so that the pixel values instead of being between 0 and 255 will be between 0 and 1
02:31:57.440 | Then it will normalize using the mean and the standard deviation of image net
02:32:01.280 | And then it will move the channel dimension to be the first dimension. So
02:32:06.320 | Instead of being a height width channel, it will become channel height width
02:32:11.120 | Let's implement this very simple method. So there is first the resize
02:32:16.980 | The resize is just going to resize the image using the
02:32:21.600 | methods already implemented by
02:32:24.720 | The pill library. So this
02:32:27.120 | This one called the python imaging library
02:32:31.520 | So it will take the image and it will resize it using this resampling method
02:32:36.160 | Then we have this
02:32:39.920 | rescale
02:32:41.360 | The rescale is just going to rescale the image
02:32:43.680 | So it will convert each pixel value instead of being between 0 and 255. It will rescale it into
02:32:49.920 | Between 0 and 1. Why? Because as you can see here, we pass a scale factor of 1 over
02:32:56.060 | 255. So that's why we are multiplying it by this scale
02:33:01.420 | The next thing that we are doing is normalizing
02:33:04.800 | normalizing means that we want the each of these values to be
02:33:09.340 | distributed like it's coming from a Gaussian of mean 0 and variance of 1 and we do it by
02:33:14.380 | Subtracting the mean and dividing by the standard deviation as you can see here
02:33:22.140 | I believe we have already implemented everything for the process images
02:33:25.980 | Now, let's go further. So we have these images we are processing them. So they are still a list of images
02:33:32.700 | We convert them into they are converted into a list of numpy arrays and we do that here
02:33:38.780 | As you can see first we convert them into numpy arrays then we rescale, normalize, transpose
02:33:44.160 | So we have a list of numpy arrays
02:33:46.860 | This list of numpy arrays is converted into a single tensor instead of being a list of tensor is becoming one big tensor
02:33:53.260 | And then we convert it into a torch tensor. This torch tensor
02:33:57.900 | Is the pixel values that will be fed to the model to the image encoder
02:34:03.100 | Now we need to take our text
02:34:06.060 | And we need to tokenize it but we need to tokenize it by already accommodating for the position in which we will put the image
02:34:13.500 | embeddings
02:34:15.980 | And we do that by processing this each of this text through this function called add image tokens to prompt which as the name implies
02:34:23.260 | We'll add this image token placeholders to the prompt
02:34:26.140 | And the way it's done is here
02:34:29.020 | It's very simple actually also
02:34:31.740 | We can
02:34:34.300 | Save it here. It's a long comment because I found a little bug in this one, but okay later I explain to you
02:34:40.300 | But basically we add some image token placeholders. How many of them? Well, depending on how many image
02:34:46.540 | Tokens this model needs in the case of polygama 224. We need 128
02:34:52.320 | tokens, I believe
02:34:55.420 | Oh, no, this is not this is the text tokens, I think it's 256 I remember correctly
02:35:01.120 | Later we can check. I think it's in the config.json. Let's go here
02:35:11.080 | Image tokens
02:35:14.460 | Then we add the beginning of sentence token and then we add the prompt of the user. It's called the prefix prompt
02:35:20.700 | How did I come up with this function I didn't come up with it I copied from
02:35:26.860 | Hugging face implementation, but how did hugging face come up with this actually?
02:35:31.340 | It's from the paper of polygama
02:35:34.300 | So if we go to the polygama paper, let's go here
02:35:40.780 | Here they show you how to prepare the input for the gamma model
02:35:44.060 | So we have a list of image tokens
02:35:47.740 | Then we have the prompt of the user that tells us what the language model needs to do with these images
02:35:54.380 | So if as you saw the example before in in the introduction
02:35:57.920 | Here the prefix is this one
02:36:00.380 | So we want the language model to tell us where is the photographer resting by looking at this image and the model will generate this output
02:36:07.500 | So this is called the the prefix
02:36:09.900 | So this is the prefix and the prefix is built by first taking okay
02:36:14.460 | We take the image tokens and we are adding them here and based on how many this model particular size of polygama needs
02:36:21.580 | then we have the beginning of sentence token and this one then we have the tokens of the
02:36:27.260 | Prefix, which is the task that we want the language model to perform
02:36:32.140 | And then we have a separator the separator token is a slash n. So it's the new line
02:36:38.220 | new line character
02:36:41.740 | So we have this beginning of sentence token. So then we have the token the the task
02:36:47.100 | The the prompt by the user based on what task we want the language model to do and then we have the separator token
02:36:54.380 | Which is a slash n now in the paper. They say that they tokenize the
02:37:00.540 | Token separately
02:37:02.540 | so the slash n needs to be tokenized separately from the rest of the
02:37:06.940 | Input because we don't want the slash n to be merged with this with the
02:37:13.260 | With the prompt by the tokenizer, so as you know the tokenizer will convert a sequence of
02:37:19.580 | Characters into tokens and if in the dictionary of the
02:37:27.100 | The language model there is one character suppose that we ask the language model to tell me where is the photographer
02:37:34.000 | And suppose that the in the
02:37:37.900 | and then we have this new line suppose that in the vocabulary of the
02:37:43.260 | Language model there is a token that is like this. So
02:37:47.020 | rougher
02:37:48.940 | And escape and it will become one single one single token
02:37:52.860 | So suppose that this one becomes the token number three and then there is another token that is a space protog
02:37:58.620 | It becomes the token number five and then the token the d is another token. So it's the token number six, etc
02:38:05.180 | So we don't want the escape and to be merged with whatever comes before it
02:38:09.340 | So they in the paper, they recommend to tokenize it separately. So that's why I I wrote this
02:38:14.860 | Comment here to to note that it should be tokenized separately, but I don't know why in hanging phase they do it
02:38:21.580 | Without tokenizing it separately
02:38:23.580 | It could be a bug or it could be some other indication that I am missing
02:38:27.660 | So I just write it now later. I will investigate and probably ping the hanging phase team
02:38:31.900 | But for now, we just need to think how we prepare the input
02:38:35.500 | So the input is prepared like this a number of input image tokens
02:38:39.500 | What is each of this image token? It's this placeholder token that we created here this image token
02:38:46.940 | how many of them depending on the size of the model and we have this beginning of sentence token and then we have the
02:38:52.220 | Prefix the prompt of the user and then we have the slash n. We take all of this and we tokenize it
02:38:59.180 | Using our tokenizer
02:39:02.380 | And we return this stuff here. So we return this input
02:39:05.740 | Which is the input IDs and the attention mask that will be generated by the tokenizer
02:39:11.200 | In this case, we are not using any padding. So the attention mask will be just a list of ones
02:39:16.060 | So what is the input IDs? As you remember tokenizer converts the text into
02:39:21.580 | A list of numbers where each number represents the position in the vocabulary of each token
02:39:27.020 | So these are not embeddings. These are just input IDs
02:39:30.700 | So it's a list of numbers where each number represents the token position in the vocabulary
02:39:35.440 | So imagine our vocabulary is made up of words
02:39:38.940 | So the word hello the sentence hello world may be tokenized as follows
02:39:45.100 | so world
02:39:47.100 | It may be tokenized as a list of two tokens, for example, three tokens
02:39:52.300 | For example, the first one corresponding to the word hello
02:39:54.860 | Then the one corresponding to the space and then one corresponding to the word world
02:39:59.980 | Suppose it's the token number nine. So these are called input IDs. So it's not an embedding
02:40:05.100 | It's just one number for each token
02:40:07.740 | Then by the embedding layer, this will be converted into embeddings, which will be one
02:40:14.540 | Vector for each token. So with the suppose 1024 dimensions
02:40:19.440 | So this one will be for the first token
02:40:23.180 | 1024 dimensions then for the second token another 1024 dimensions, etc, etc, etc
02:40:29.340 | So this is how we prepare the input. So for now, we have resized the image converted into a tensor
02:40:35.740 | Then we have taken our prompt. We have added some placeholder tokens for the image then we have
02:40:41.980 | Added the prompt of the user and then the slash and character as indicated by polygamma
02:40:47.440 | And now our processor will return this stuff. Now, we need to understand what to do with this stuff
02:40:53.500 | So we need to code our language model. All right guys, so let's continue our journey by creating another file here called
02:41:00.620 | modeling_gamma.py
02:41:03.160 | Which will be our language model. So the language model that will decode the answer of the
02:41:09.980 | the answer
02:41:11.740 | Using the prompt or given by the user and the image that we have provided as input
02:41:16.300 | So we create this file. We import a little bit of stuff the usual stuff
02:41:21.740 | So torch, some math, typing and then we import siglib model that we have created before so the visual model and the configuration that it needs
02:41:29.580 | Let's do a bottom-up approach which means that we first create the structure of the model and then we create each single component
02:41:40.400 | So let's do it
02:41:42.400 | Let's do it this one. All right
02:41:46.720 | Our main class will be called the polygamma for conditional generation
02:41:52.720 | So why it's called conditional generation? Because we are conditioning the generation of text on the image that is provided as input
02:41:59.680 | This is why it's called conditional generation
02:42:01.680 | and also actually it's because of
02:42:04.240 | how we create the attention mask that we will see later because we are attending to all the tokens of the
02:42:09.760 | prompt of the user and all the tokens of the image
02:42:14.400 | Without any causality so it's used like a condition, but we will see that later. So
02:42:20.640 | The constructor accepts a configuration file, which we are going to create now
02:42:24.960 | It will create an instance of the vision model. So the encoder of the image it will create this multi-modal projector
02:42:32.000 | Which is a linear layer. Let's actually visualize it all these components
02:42:35.940 | So we go here and then we open this stuff. So basically the multi-modal projector is this
02:42:43.840 | linear layer you can see here linear projection
02:42:48.960 | the vision model is this
02:42:50.960 | Contrastive vision encoder and then we have gamma for causal language modeling, which is this our transformer decoder
02:42:58.000 | So this class basically polygamma for conditional generation is actually the class that will
02:43:02.080 | Make make connect all these components together
02:43:05.760 | I don't know why my pen is not working my ipad pen
02:43:09.920 | Oh now it's working. It looks like so
02:43:12.640 | Yeah now it's working. Okay, let's continue
02:43:15.520 | All right, so we have created this it will create an instance of the language model
02:43:20.880 | It will save some stuff like what is the language model? What is the vision tower, which is the
02:43:26.960 | image encoder the multi-modal projector
02:43:28.720 | which is the linear layer that will convert the size of the embedding output by the
02:43:32.720 | Vision encoder into the size of the embedding of each text token so that they can be concatenated with together
02:43:40.560 | We also save the padding token
02:43:43.200 | We need to create another method called tie weights and we will see later what is this about
02:43:51.200 | Or actually we can check now what this is about
02:43:55.280 | so tie weights basically means this so let's go back to our
02:43:59.440 | Here and let's open the attention mechanism. And actually let's open the transformer model
02:44:05.760 | so weight tying is a technique for kind of
02:44:08.800 | reusing the
02:44:11.820 | parameters of one layer into another
02:44:14.080 | And specifically in the case of language model most language models are in decoder only language model
02:44:19.600 | Which means that they are only made up of this part of the transformer without the cross attention
02:44:25.840 | So there is no this block here
02:44:27.840 | So it's they are made up of a self-attention with the normalization then a feed forward with the normalization a lot of layers like this
02:44:35.840 | so one after another then we have a final linear layer that projects the embedding output by these layers into
02:44:42.800 | Logits, and then we have the softmax to understand which of these tokens has the maximum
02:44:48.540 | Probability score given by the language model
02:44:50.540 | now in this
02:44:52.540 | the job of this linear layer is basically to convert the embedding of the
02:44:56.940 | Contextualized embedding output by the last layer of this series of layers
02:45:02.060 | Into the vocabulary size, which is exactly the opposite that this job
02:45:07.500 | Layer is doing so the embedding layer the embedding layer is converting the token ids
02:45:14.140 | So the position of each token in the vocabulary into an embedding while this
02:45:18.300 | Linear layer here is doing exactly the opposite converting an embedding into its position in the vocabulary
02:45:25.660 | Many language models not all of them
02:45:27.660 | use a technique called
02:45:30.300 | Weight tying which basically shares the parameters of this layer and this layer because they are doing basically one the inverse job of the other
02:45:37.580 | Which is also a technique actually to reduce the total parameters of the model because if you are sharing these parameters
02:45:43.900 | you will
02:45:45.900 | You will reduce the number of parameters
02:45:47.900 | And in many language models this depending on the vocabulary size
02:45:51.180 | These parameters can be actually quite expensive on the overall total number of parameters of the model
02:45:56.220 | So it could be like 10% of the parameters in this layer here
02:45:59.180 | So if you are sharing them, you are actually reducing the number of parameters
02:46:03.020 | Let's say by 10% because depending on the how many
02:46:05.420 | Tokens you have in your vocabulary
02:46:08.300 | So we created this method here tie weight and later we will implement it also in the language model
02:46:13.420 | So in the gamma decoder language model
02:46:15.740 | That will tie the weights of these two layers
02:46:18.780 | Okay, now that we have seen also this one. Let's go further, which is the implementation of the forward method. So
02:46:26.060 | So we implemented the forward method as follows so it accepts the input ids
02:46:32.940 | What are the input ids? The input ids will be the input ids extracted from this
02:46:39.480 | Polygama processor which will be
02:46:42.920 | Some image tokens. So a lot of tokens like this one image image image image
02:46:47.720 | How many depending on the size of polygama we are using?
02:46:50.600 | Then it will contain a beginning of sentence token. Then it will contain the prompt of the user
02:46:56.840 | So for example, tell me where is this photographer and then a new line
02:47:01.560 | Character the token corresponding to the new line character
02:47:07.800 | Yeah, text, okay, so, then we have the pixel values which is the
02:47:12.200 | Again is the image extracted from this polygama processor, which is the image
02:47:18.040 | loaded by this polygama processor
02:47:21.000 | rescaled resized and
02:47:23.300 | Normalized using the mean and the standard deviation of this image net standard mean and standard deviation
02:47:31.020 | It is converted into a pair into a tensor and then provided as is
02:47:37.480 | Then the goal of this polygama for conditional generation will be to take this image and feed it to the image encoder to get extracted
02:47:44.120 | the image tokens
02:47:45.640 | Then we have this attention mask. The attention mask is provided directly by the tokenizer
02:47:49.880 | So whenever you tokenize text using a tokenizer, it gives you two output. One is the input ids and one is the attention mask
02:47:55.880 | Because we will not be using any padding the attention mask will be a series of one
02:48:00.360 | Later, we will see how we also need to modify the attention mask
02:48:05.640 | But actually we will not be modifying because we will not be using any padding so
02:48:09.800 | Yeah, then we have the KB cache, which we will talk about later when we actually use it
02:48:14.920 | So for now just consider it as something that you don't know anything about and later we will discuss
02:48:19.400 | Okay, so let's see that
02:48:27.880 | We have first we make sure that we are not using any padding because I didn't implement the code to manage the padding
02:48:34.440 | Then we extract the input embeddings of the text tokens and the image placeholder tokens
02:48:40.200 | So in the language model, we have added a fictional token called
02:48:44.840 | Image, so this token here
02:48:47.640 | Which will be converted into an input id so it will be converted into a number which corresponds to its position in the vocabulary
02:48:53.980 | What we are doing is we are converting all the input tokens
02:48:58.520 | Which are the image tokens the beginning of sentence token the tokens of the prompt plus the new line character
02:49:05.240 | into embeddings
02:49:06.920 | of course the embeddings produced by the image placeholder tokens will be
02:49:10.280 | Junk because we will not be using them because they do not correspond to the actual image features
02:49:14.920 | But later we will replace them inside of this one with the correct one
02:49:19.400 | so now we have this input embeddings the first thing we do is we
02:49:24.200 | Extract the features of the image and we do it like this
02:49:27.320 | So we feed the pixel values of the image, which is a tensor directly to the vision tower. So the vision tower is our
02:49:34.280 | Siglip vision model. So it means that we are using the forward method here. So we are feeding the pixel values here
02:49:41.640 | It will extract what it will extract some patches with their contextualized embeddings
02:49:48.440 | So it will for each image it will give us n
02:49:52.340 | Patches and each of these patches is a contextualized patch actually
02:49:56.180 | The second thing we are going to do is we are going to resize this embeddings image embeddings into the same size of the
02:50:05.380 | language model
02:50:07.780 | Embeddings
02:50:10.100 | And for that we do this other line
02:50:12.100 | So we take the image embeddings extracted by the vision encoder and then we resize them using a linear layer called the multi-modal projector
02:50:20.340 | So later we will see this is actually just a linear layer that will convert this embedding
02:50:25.300 | So this embed dimension extracted from the vision encoder into the hidden size
02:50:29.540 | Which is the same embedding size used by the language model for each of this each of its tokens
02:50:34.420 | Now we need to merge the tokens extracted from the vision
02:50:41.300 | Model with the text token extracted from these embeddings which already contain some placeholders for where we should put the image tokens
02:50:50.420 | And for that we will create another method called
02:50:52.980 | Let me first paste it
02:50:55.700 | Called merge input ids with image features in which we pass the image features extracted from the vision encoder the input
02:51:02.740 | Embeddings extracted from the language model with which already contains the placeholders
02:51:07.720 | the input ids which are the original input ids fed to the
02:51:11.620 | The tokens fed to the language model and the attention mask given by it and the KB cache later
02:51:17.540 | We'll see why we need the KB cache
02:51:20.500 | Suppose that these input features have been merged so we will get these input embeddings these input embeddings. What are they?
02:51:27.620 | Well, let's visualize it on the
02:51:29.700 | Oh, wait, where is it? My okay
02:51:33.300 | Uh, oops
02:51:36.980 | So let's go here
02:51:41.940 | So what we are doing is basically we are creating this stuff here. So we are taking the
02:51:46.660 | First we are taking the image features extracted by the vision encoder and these
02:51:49.860 | Features are here
02:51:52.500 | Then we are resizing them using this multimodal projector, which is this stuff here
02:51:57.300 | Which will resize the each embedding vector to the correct size so that they can be concatenated with the
02:52:03.300 | embeddings of the text tokens
02:52:06.240 | the text tokens
02:52:08.660 | When we tokenize them, they already contain some placeholder tokens, which are those image tokens
02:52:14.500 | We saw before in the processing_polygamma.py file
02:52:17.460 | Our goal is to replace each of them with the features extracted from this vision encoder after it has been resized by the multimodal projector
02:52:26.260 | And for that we will use this method here
02:52:29.060 | So this method takes the image features extracted after
02:52:31.940 | They have been resized the input embedding extracted from the language model which contains the text tokens and the placeholders
02:52:37.940 | And it will replace this stuff here
02:52:40.580 | So suppose that now it everything has been replaced. So we treat it as a black box
02:52:44.580 | What we are going to do we are going to feed all this sequence
02:52:47.300 | Which is a sequence of image features and the text tokens to the language model
02:52:51.620 | which will
02:52:53.540 | Use the prompt of the user which are these tokens and the image fed by the user to generate some text
02:52:59.380 | So let's implement this part here, which is just calling a method
02:53:04.100 | And it's very easy
02:53:09.540 | Because it's just calling a method and later we will implement this language model
02:53:13.620 | So for now, I created the structure of what we are doing
02:53:16.420 | So we extract first we tokenize the text the text already contains placeholders
02:53:21.060 | We replace these placeholders with the features extracted from the vision encoder. We feed everything to the language model. The language model will
02:53:27.060 | Generate some output and we return this output
02:53:30.020 | Now our goal is of course to implement all of these blocks that we have created that we have taken for granted for now
02:53:36.820 | The first thing that we can do is to implement this polygamma config which will give us some understanding of what are
02:53:41.700 | What is the kind of configuration that this polygamma needs?
02:53:44.500 | For that we create it we need to create this polygamma config
02:53:51.800 | Okay, the polygamma config basically takes as input so the polygamma is
02:54:00.340 | So what is gamma? What is polygamma? And what is cglib?
02:54:05.300 | I think you should already have an understanding of it now. So polygamma is all of this stuff here all this stuff here
02:54:11.060 | So it's a combination of a vision encoder and a text decoder language model. So a gamma model
02:54:17.060 | So it's composed of two parts
02:54:18.980 | It's composed of a cglib vision encoder along with a linear layer that will change the embedding size
02:54:24.660 | And it's made up of a language model called gamma language model
02:54:29.540 | So the polygamma needs of course the configuration for this block here
02:54:33.860 | So the language model and the configuration for the vision encoder so that it can create an instance of
02:54:39.300 | This cglib class and of this gamma language model passing their own configuration to them
02:54:45.220 | And this is what you see here
02:54:47.300 | So you have the vision config which is the configuration of the vision encoder the text config which is the configuration of the text
02:54:53.140 | decoder which is gamma
02:54:56.340 | The ignore index is not used. We will not be using it for labels
02:55:00.340 | So if you are training, but we will only doing inference
02:55:02.820 | The image token index is the token corresponding to the placeholder image token. So the
02:55:08.500 | This token here. So let's this this stuff here
02:55:11.780 | The vocabulary size. So what is the vocabulary size of the model?
02:55:16.420 | the projection dimension is how
02:55:19.300 | What is the final dimension that the image features should be resized to before feeding to the language model?
02:55:25.940 | So what is basically the output size of this linear layer?
02:55:30.580 | Then we have the hidden size which is the embedding size of the language model
02:55:35.460 | So the language model has some tokens. These tokens are embeddings and these embeddings have a dimensions. How many dimensions?
02:55:41.780 | 2048 in the base version of gamma
02:55:44.900 | This stuff is something that HuggingFace needs we will not be using it
02:55:50.980 | We save the padding token id if in case it's fast, so we save the vision encoder
02:55:55.060 | We save the text encoder and then we need the configuration of the text language model
02:55:59.220 | Which is the gamma model to which we pass the of course the text configuration and to the vision encoder. We pass the vision configuration
02:56:05.400 | We have how many
02:56:08.100 | number of tokens
02:56:10.100 | For image tokens each image will generate which is basically the size of the image divided by the patch size
02:56:17.140 | So it's actually how many patches you get for each image
02:56:21.300 | Um, which is also corresponds to how many image tokens you get here
02:56:26.500 | Because of course if you divide the image by four you get four patches
02:56:31.700 | If you divide it in smaller parts, you get more patches and each a polygamma size
02:56:36.420 | So polygamma two to four, I think it has 256 tokens. Another one has more etc, etc
02:56:44.100 | Um, the projection dimension is how we want to resize this image tokens, etc
02:56:49.620 | So now let's create also the configuration for the gamma model
02:56:52.660 | which is just the configuration of any language model because it has
02:56:57.060 | A vocabulary size how much tokens we have in our vocabulary the hidden sizes. So what is the size of the embedding?
02:57:04.820 | Embedding vector of each token the intermediate size of the feed-forward layer as we saw before
02:57:12.020 | In Sigleap the number of hidden layers. So how many layers our transformer has in this gamma language model
02:57:18.740 | How many attention heads we have? Okay here we have a difference
02:57:22.340 | This is called the grouped query attention when you have a different number of heads for the query and for the key and values
02:57:28.340 | the number of heads here refers to the number of heads for the
02:57:32.420 | Queries and the number of heads for the key and values is this parameter here. We will see later how it works
02:57:38.180 | The head dimension is how many
02:57:40.180 | Dimensions each head will work with as we saw before we divide this big embedding into smaller groups one dedicated to each head
02:57:47.860 | This is how many dimensions each head will watch
02:57:50.580 | Now this configuration. It's a hard-coded
02:57:53.560 | But actually it will come from the configuration file of the polygamma model that we will load
02:57:59.460 | So if you go to hugging face, you can see
02:58:02.100 | Hugging face, polygamma
02:58:07.540 | You go to two to four you can see here
02:58:09.860 | We will load all this configuration from this config.json file
02:58:13.700 | Which as you can see contains this text config this visual config which contains exactly the information that we need here
02:58:20.500 | This max positional encodings indicates how much the maximum number of positions our model has been trained upon
02:58:28.740 | Which is necessary for the rotary positional encodings
02:58:33.380 | RMS norm is we will see later. What is the rms normalization, but just like the layer normalization
02:58:39.460 | We have this parameter called rms norm fps. Okay, I will explain it later
02:58:43.940 | Actually, the rope data is another parameter of the rotary positional encoding, which is the base frequency
02:58:49.640 | And also we will see later. What is it?
02:58:52.100 | the attention bias
02:58:54.100 | Indicates if in the attention matrices
02:58:56.420 | We are we want the bias because as you remember we have the wqwk and wv matrix
02:59:00.900 | These are linear layers and we can have also the bias term, but we I believe we never use the bias for this
02:59:07.300 | And it looks like we yeah, we don't use any bias for it. So if they don't overwrite it then it remains false
02:59:14.340 | Dropout just like before we are not going to use it and the padding token id and we save all this stuff. So nothing so
02:59:21.920 | Sophisticated here now the first thing that we are going to do since we have already implemented polygama for conditional generation
02:59:28.160 | I believe that the first thing that we can do is this method here merge input ids with image features
02:59:33.760 | But for that we will need to understand. What is the kb cache?
02:59:37.120 | All right. So let's start coding this method. So
02:59:40.800 | Let me go also here in the code that I have already written. So I will code it piece by piece
02:59:47.440 | So that we don't get lost in the explanation
02:59:51.760 | So we create this method which has this signature
02:59:54.340 | If you don't see it all it's this one here
02:59:58.240 | And let's extract. Okay. The first thing we do is we extract some information from the inputs
03:00:05.460 | Which are what is the embedding dimension from the image features because we need to
03:00:11.600 | Which are already resized
03:00:14.400 | Because we pass them after sending them through this multimodal projector
03:00:18.320 | So they have already been resized to the same size of the text tokens
03:00:22.000 | Then we have these input ids which tells us how many tokens we have the input ids
03:00:26.480 | If you remember correctly is the not the embedding of each token
03:00:30.080 | It's the number indicating the position of each token in the vocabulary
03:00:33.060 | While the input embeddings are the embedding of each token after they have been extracted from the embedding layer of the language model
03:00:40.880 | And that's why we have this
03:00:43.680 | It's a vector
03:00:48.400 | The first thing that we do is we scale these image features
03:00:54.000 | We scale these image features which also helps. It's like the same kind of scaling that we use in
03:01:00.400 | In the attention mechanism, so we do query multiply by transpose of the key divided by the
03:01:06.080 | Square root of the model here. We do the simple the same kind of scaling
03:01:11.760 | Because probably they have tried multiple variations of the model and we want the magnitude of the numbers to remain the same
03:01:18.480 | That's why we divide it by the the size of the hidden side. So if they if you want to double the for example the embedding
03:01:25.440 | Size of the image features you want the magnitude of numbers more or less to remain the same. That's why you you scale them
03:01:35.360 | Now the first thing that we need to do is to create the final tensor that will hold the combined
03:01:41.380 | Features of the image tokens and the text tokens and this is and it's this tensor here
03:01:46.560 | It's made up of zeros and it has the size of batch size
03:01:50.000 | Sequence length. So what is sequence length? The sequence length is the number of input ids we have
03:01:55.520 | What are these input ids? The input ids that are coming from this processing polygamma
03:02:00.340 | class
03:02:02.720 | which are the placeholder for the image tokens the
03:02:06.000 | beginning of sentence text the
03:02:08.640 | tokens of the prompt and the new line character
03:02:11.760 | So the token corresponding to the new line character
03:02:15.140 | So we create this sequence of empty embeddings of which size of embedding size dimension
03:02:23.140 | Embedding dimension which is the same size of the embedding vector of language model because the image
03:02:29.120 | Tokens and the text token will have the same size which is embedded dim here
03:02:33.680 | We want to be of the same size of the same d type
03:02:37.520 | So if it's floating point 32 of the input embeds and we put it on the same device
03:02:43.120 | The first thing that we do is we create some masks that will be useful for understanding which is a placeholder token
03:02:50.160 | Which is a text token and which is a padding token, even though we will not be using any padding
03:02:54.640 | So I just took the original implementation, which was already handling the padding, but we will actually never have padding tokens
03:03:00.720 | How to understand which one is a text token?
03:03:03.600 | Well, a text token is something that is not an image placeholder token and it's not a padding token
03:03:08.880 | What is an image token?
03:03:10.560 | Well something that is equal to the image placeholder token and the padding tokens are the tokens that correspond to the padding token id
03:03:18.480 | this mask will be
03:03:21.360 | useful for us to understand where to put the embeddings of the image tokens in this
03:03:25.920 | Final embedding tensor where to put the text token in this final embedding tensor and where to put the padding tokens in this final
03:03:32.080 | embedding tensor
03:03:34.080 | We expand them so
03:03:37.440 | Here we see them and later we will see why we need to expand them. So basically we are creating I believe the
03:03:44.160 | few dimensions more
03:03:46.960 | because we need to create the
03:03:49.120 | batch size dimension and the sequence dimension
03:03:52.100 | I don't know. We already have the sequence dimension because it's already given by the input ids
03:03:59.600 | We are creating the batch dimension and then we are expanding it to this embed
03:04:04.320 | dim dimension
03:04:07.200 | Later we will see why we need it. So basically this means that
03:04:10.560 | The text mask here. So let me draw a sample of how it may look like
03:04:17.600 | Oops, what did I do?
03:04:20.400 | the text mask here
03:04:22.400 | Will be something like this. So if suppose that the
03:04:25.520 | The input ids are the tokens corresponding to the image. So suppose that it's the
03:04:31.920 | 567 so we have
03:04:34.780 | So we have many tokens corresponding to the placeholder for the image then we have the beginning of sentence token suppose usually it's the
03:04:42.320 | token number one
03:04:44.480 | Then we have the prompt of the user
03:04:46.880 | So suppose that it's a token number 56 78
03:04:50.180 | and 99 and 21 and 11 then we have the
03:04:55.040 | Slash and token. So it's suppose it's the token number two
03:04:59.760 | What we the text mask here will be basically something that is like this so it will be zero zero zero zero zero
03:05:10.000 | And then it will be one one one one one one and then it will be zero
03:05:16.480 | uh, actually one because the slash n is still part of the
03:05:20.080 | text the image tokens mask will be
03:05:24.400 | one one one one one and then a series of zero because all the others are text tokens
03:05:30.800 | And the padding will be
03:05:34.400 | Equal to all zeros. So I don't write all of them, but you can understand all zero because we don't have any padding token
03:05:43.280 | Then we are expanding them to
03:05:45.280 | This expand basically repeats these zeros and ones along this dimension the embedding dimension that we are adding here with this unsqueeze
03:05:53.940 | And we will need it later for the for another method, which is the wear method
03:05:59.040 | So for now, just keep in mind. We are just expanding this token by repeating this series of zero and one along a new dimension
03:06:05.060 | So the first thing that we do is we copy the text
03:06:09.660 | Embeddings into this final embeddings and we do this by using this method. So we say this final embeddings
03:06:16.000 | This wear method basically says that if this condition is true
03:06:20.620 | It will take the input from the second argument. Otherwise, it will copy the third argument
03:06:26.620 | So if wherever this condition is true, it will copy this stuff here wherever this condition is false. It will copy this stuff here
03:06:36.300 | We are saying that whenever
03:06:40.140 | The the text mask is one
03:06:42.380 | We copy the embedding from the input embeds which correspond to the text inputs plus the placeholder for the image
03:06:49.740 | But we will only be copying the text
03:06:51.740 | Text tokens because for the image image tokens, we will have zero in this mask
03:06:58.940 | Otherwise just keep the final embedding as it is
03:07:02.700 | Then we add the image tokens
03:07:06.860 | As you can see here
03:07:08.860 | Which is using another method called the must scatter and we cannot use the torch dot where because the sequence length of
03:07:17.980 | Image scaled is not equal to the sequence length of the final embedding
03:07:22.300 | But basically this does the same job as the where
03:07:25.500 | So what we are saying is that copy from the scaled image features where this stuff is true
03:07:33.500 | So we are copying the image features where where the image mask is true where the image mask is true
03:07:38.620 | Where we have the placeholder tokens for the image so we are copying in the final embedding the image tokens
03:07:44.140 | Where before we had the placeholders?
03:07:46.640 | Then we copy the padding
03:07:50.620 | And the padding we just zero out everything because we don't care about what is in the paddings
03:07:55.840 | So what we are saying is that wherever the padding mask is true
03:07:59.100 | Just copy a zero a tensor made up of zero. Otherwise keep the final embedding as it is
03:08:03.980 | Now comes the interesting part so for now we have created the final embeddings
03:08:10.620 | What is the final embeddings is this stuff here. So let me show you again from the ipad. It's this stuff here
03:08:16.620 | So now here we have the first image token embedding
03:08:20.140 | second image token embedding third image token embedding blah blah up to
03:08:25.960 | 256 image token embeddings in the base version of polygama if I remember correctly
03:08:30.360 | And then we have the embeddings of the tokens corresponding to the prompt
03:08:35.080 | Plus the padding but the padding we will never have because I excluded it from my implementation
03:08:40.620 | So now we come to the interesting part
03:08:44.440 | Which is the creation of the attention mask and the attention mask has to be created in a particular way
03:08:50.280 | based on
03:08:52.120 | How we are working with the KV cache
03:08:55.320 | And for that I need to introduce the KV cache. So that's why this part is interesting. So let's go
03:08:59.880 | So let's talk about this thing called KV cache
03:09:02.920 | But before we talk about the KV cache, we need to understand what is the problem that the KV cache is solving
03:09:08.280 | So when we train a language model
03:09:10.840 | So as I we saw before the transformer can be thought of as a model as it's a sequence to sequence model
03:09:16.680 | Which means that you feed it a sequence of n tokens and you get as output n tokens
03:09:22.440 | These n tokens as output are not normal tokens anymore
03:09:25.960 | They are contextualized tokens means that each of them is not capturing information only about itself
03:09:30.920 | But also about other tokens which depend on the mask that you use if you use the causal mask
03:09:35.880 | It means that only each token will only capture information about itself and all the previous tokens
03:09:41.320 | If you are not using any causal mask, then each token will encapsulate information about all the other tokens in the sequence
03:09:47.800 | Which is what we do with vision encoders like the image encoder we saw before the Sigleap one
03:09:52.280 | Because the transformer is a sequence to sequence model, so let's open our ipad
03:09:59.400 | Now because the transformer is a sequence to sequence model
03:10:02.760 | It's very useful during training
03:10:05.960 | So suppose that we want to train we train a language model on the following sentence. So it's always the same which is
03:10:15.240 | pepperoni
03:10:17.400 | Pizza
03:10:19.400 | Pardon my calligraphy I write very fast recently we feed it to this black box that we will call the transformer model
03:10:30.040 | Each of these stuff here each of these uh tokens is actually an embedding
03:10:37.880 | So we will get an as output a list of embeddings, but they will be contextualized
03:10:44.260 | Contextualized one for the first token one for the second token. So this is the second embedding
03:10:49.140 | This is the third embedding and this is the fourth embedding
03:10:52.260 | I am again making the simplification that each word is a token and each token is a word
03:10:56.500 | How we train a language model?
03:10:58.900 | Well, we force the language model to predict the next token given the contextualized embedding
03:11:04.980 | So this contextualized embedding here contains information only about the word I in case we are using the causal mask
03:11:11.860 | so let's here is
03:11:14.580 | This only contains information about the token I
03:11:17.060 | This contains information about the token I but also the token love this contains information about the token. I love
03:11:24.820 | Pepperoni pep and this contains information about all the other tokens. I love
03:11:31.700 | pepperoni
03:11:34.800 | Pizza
03:11:36.820 | What labels do we use when training a language model
03:11:39.460 | Well, in this case, we want the first language model that given the prompt it should predict. What is the next token?
03:11:45.460 | So given only I the the language model should predict the word
03:11:49.780 | Love so the the the label here is love
03:11:53.860 | Given only the token love. I love so the prompt. I love that the language model should predict the token pepperoni
03:12:04.580 | Given the token the prompt I love pepperoni the language model should predict pizza
03:12:09.720 | And given all the sentence it should say end of sentence so it means hey i'm done with the generation
03:12:17.080 | Now this is how we train a language model. How do we actually inference a language model is the same way
03:12:24.740 | So we start with what is known as a prompt
03:12:27.220 | so suppose that the user only gives us one token as a prompt the word I
03:12:32.340 | And suppose that our language model has been trained on the sentence before so I love pepperoni pizza
03:12:37.220 | How can we generate the entire sentence? Well, we feed this single token to our black box, which is our transformer
03:12:44.420 | So now I will write it reversed because I don't have space above
03:12:48.100 | transformer
03:12:50.980 | The transformer will generate it's a sequence to sequence model, which means that it takes as input one embedding
03:12:56.960 | Corresponding to our prompt token I and it will generate one contextualized embedding
03:13:02.420 | So it will be one embedding what do we do with the language models we project this single embedding into logits
03:13:10.740 | so we use the linear layer at the
03:13:13.360 | Output of the of the transformer, which is this stuff here
03:13:18.640 | To generate logits for this token. So let's go back here
03:13:26.160 | This this is the output embedding so out
03:13:31.420 | embedding
03:13:33.120 | We convert it through the linear layer
03:13:35.200 | into logits
03:13:38.400 | This logits tell us what is the score assigned by the language model to each token
03:13:45.200 | So how likely that particular token is the next one to convert it into a probability score?
03:13:51.600 | So something that sums up to one we use the softmax. So suppose that we have already applied the softmax
03:13:57.700 | Actually, let's apply it softmax. So
03:14:01.680 | It will remain a single embedding
03:14:04.880 | Sorry a single logits token, but the difference is that now they sum up all to one
03:14:10.960 | Which one we select the one with the highest number usually this is called a greedy strategy
03:14:16.240 | There is another strategy called the top p which means that we sample from the top the tokens with the top score
03:14:23.920 | Up to 90 percent. So suppose that there are three tokens here
03:14:28.240 | Okay, actually the top we will see later when we implement the inference for now
03:14:31.760 | Just think that we are always sampling the one with the highest probability score. So we use the greedy strategy
03:14:38.960 | using the greedy strategy
03:14:40.560 | What will happen is that probably the model if it has been trained well, it will tell us that the next token is very likely the token
03:14:48.080 | Love so this is how we know. What is the next token?
03:14:51.840 | How do we generate then the next next token? We take this token love
03:14:56.400 | This token love and we put it back into the input of the language model
03:15:02.320 | So now we feed a new input to the language model. Let's remove this stuff
03:15:08.560 | Delete
03:15:10.560 | Now we are feeding two tokens to the language model
03:15:13.280 | Language model is our transformer model. So it's a sequence to sequence model
03:15:17.520 | It means that it takes as input two tokens. It will output two tokens
03:15:21.200 | So it's taking as input two embeddings. I am drawing here the text
03:15:25.920 | But actually you need to consider that these are two embeddings of these two tokens
03:15:30.160 | So we feed two embeddings. It will output two embeddings
03:15:33.860 | one corresponding to the token I
03:15:38.320 | One corresponding to the token I love
03:15:40.640 | Very ugly writing. So let me write it better
03:15:45.040 | one corresponds to the token I so the first position one corresponds to the second position which is
03:15:51.040 | Because this is a contextualized embedding. It will include information about
03:15:55.280 | I and love
03:15:58.000 | Now before what we did was to project this output embedding into logits here
03:16:03.920 | We have two embeddings which one should we project into logits? Of course. It's the second one. Why?
03:16:10.560 | because
03:16:13.120 | This embedding includes information about the two tokens. So it's like we are using the entire prompt. So what we do is we
03:16:20.000 | Send it to our linear layer
03:16:23.920 | Linear layer
03:16:27.520 | It will become logits. So let's write actually logits
03:16:31.300 | Then we apply this thing called softmax which will convert this logits into
03:16:37.600 | probability scores
03:16:40.080 | How do we understand what is the next token?
03:16:42.080 | Using I love as prompt. Well, we sample from the softmax which one the one with the highest score. So
03:16:48.240 | We take the one with the highest score as the next token so if the language model has been trained
03:16:54.960 | Well, it will be the token pepperoni
03:16:57.680 | So it will be the token
03:16:59.680 | Pepperoni
03:17:03.680 | Now, what do we do? How do we generate the next next next token? We take this word pepperoni
03:17:08.720 | We feed it back into the language model and we ask again the language model. Hey generate the next token
03:17:14.080 | Let's delete this stuff here I love
03:17:19.520 | Pepperoni
03:17:26.480 | We feed it to the language model
03:17:28.160 | We are feeding three tokens to the language model which are converted into three embeddings then are fed to the transformer
03:17:33.600 | The transformer will output three output embeddings
03:17:36.660 | one corresponding to each position
03:17:39.280 | Now without writing the first position will correspond to a contextualized embedding that only includes information about the token I
03:17:48.560 | the second
03:17:50.560 | Embedding contextualized embedding will include information about I and the love the third contextualized embedding will include information about I love
03:17:58.240 | Pepperoni, which one should we project?
03:18:00.540 | Of course the third one because it's the one that encapsulates information about all the prompt
03:18:05.760 | So this way we keep doing this way and we generate
03:18:11.360 | Now, what is the problem here? The problem is that at every step of inference
03:18:15.280 | We are generating a lot of embeddings. Suppose that the prompt is very large
03:18:20.320 | A lot of embeddings that we are not using so we are creating them because the transformer is a sequence to sequence model
03:18:25.680 | It's generating them
03:18:26.960 | But then we are only projecting one single embedding to the logits and then to the softmax to understand what is the next token
03:18:33.760 | And as you know, the transformer model uses this thing called attention mechanism and the attention mechanism generates this matrix
03:18:40.800 | That is a sequence by sequence, which is the attention scores matrix that we saw before
03:18:44.560 | which means that when you have a thousand tokens
03:18:48.960 | It will generate a matrix that is a thousand by one thousand. So it's a one million numbers in that way
03:18:54.240 | So it's a huge matrix and then you only need to use a part of this matrix that will generate this embedding here
03:19:00.480 | So is there a way to not generate the embeddings that we are not going to project into logits?
03:19:06.160 | But only generate the one that we only need to generate the next token
03:19:10.320 | Yes, and it's possible through what is known as the kb cache and the trick is here. So now let's open this other slide
03:19:18.000 | The trick is this one. So when we calculate the
03:19:21.200 | attention matrix, so the query multiplied by the transpose of the keys divided by the square root of d
03:19:27.360 | Model or d head in case we have a multi multi head attention
03:19:31.040 | What we are getting is suppose that we want to generate the word pizza by using the prompt I love pepperoni
03:19:39.060 | If we do it naively we will pass all these
03:19:44.620 | Embeddings, so I love and pepperoni to the transformer. The transformer will convert them into query key and values using the projection
03:19:51.340 | wq wk and wv
03:19:53.480 | Let me check if my yeah, it's still working
03:19:58.700 | It will convert them into query key and values and now then we use the query key and values to calculate this
03:20:04.940 | Matrix here. So the query multiplied by the transpose of the keys, which is this matrix here
03:20:10.860 | Then we multiply this matrix by the v matrix with by the v sequence and it will give us the output
03:20:17.180 | of the
03:20:19.240 | Attention, which is contextualized embedding you can see here and we saw also before that when we multiply by v
03:20:24.460 | We are doing what is known as a weighted sum using these weights as weights in this weighted sum
03:20:32.780 | When this is the input of the model
03:20:34.620 | So the input of the model is I love pepperoni and the output that we are getting is a three contextualized
03:20:39.440 | Embeddings so the embedding corresponding to only to the word I the embedding corresponding to the word
03:20:45.020 | I love and the embedding corresponding to the I love pepperoni
03:20:47.760 | We know that we only need this one here because this is the only one that we need to project into logits
03:20:53.980 | And then to generate the next token. So is there a way to not compute these two stuff here that we will not be using?
03:21:00.940 | Yes, and the trick is here
03:21:03.420 | The trick is this
03:21:05.420 | Embedding contextualized embedding here is the result of the multiplication of this matrix by this matrix
03:21:12.700 | but not all of this matrix by the v sequence, but only the last row of this matrix by the v sequence because
03:21:20.700 | This number here comes the the number
03:21:24.300 | Let me okay
03:21:25.500 | Then this number here comes from the result of the dot product of this row here
03:21:31.820 | With all the columns of this matrix here
03:21:34.220 | So this number here comes from the dot product of the first
03:21:38.700 | The last row of this matrix with the first column of this matrix the second number in this matrix output
03:21:44.540 | Vector comes from the dot product of the last row of this matrix with the second column of this matrix
03:21:51.580 | the third number here comes from the
03:21:55.500 | Dot product of the last row of this matrix with the third column of this matrix, etc, etc for all the 128 dimensions
03:22:02.720 | So what we need to generate only this one is the last row of this matrix, but all the v sequence
03:22:09.420 | So basically to have
03:22:15.500 | Because the attention matrix as we saw before we can consider the rows
03:22:20.460 | To be the queries and the columns to be the keys to have only this last row here
03:22:26.540 | We need only the last token as query
03:22:30.060 | But all the previous tokens including itself as keys and we need also all the tokens as values
03:22:37.740 | That's why what we do is the following when we generate text with a language model
03:22:44.460 | What we do is
03:22:46.940 | Imagine we have a prompt
03:22:50.300 | Let me draw in such a way that it's not confusing. So I think we can continue here. So
03:22:55.900 | Imagine we start again the process of generation of text, but this time we do it with the kv cache
03:23:02.380 | So we start with one token. Let me do it
03:23:05.420 | Top to bottom. Otherwise, it gets confusing because before I did top to bottom. So
03:23:10.140 | Okay, we use only the token i as input to the language model
03:23:14.620 | The language model will convert it into an embedding blah blah blah, then we feed it to the transformer
03:23:19.120 | Suppose that it's only made up of one layer. Actually, it's a series of layers
03:23:23.260 | uh this
03:23:26.140 | Single token will be converted into query key and values. So it will be a sequence of tokens
03:23:32.540 | But in this case, we only have one
03:23:34.620 | So the q sequence will be one token. The k sequence will be one token. The v sequence will be one token
03:23:40.380 | We do this thing called self attention
03:23:46.240 | Which will calculate that matrix so the query multiplied by transpose of the keys which will be a matrix that is one by one because
03:23:52.080 | We only have one token
03:23:54.080 | And then we multiply it by v so it will result in only one contextualized embedding as output
03:23:59.920 | So it's this stuff here what we do we project it into logits
03:24:03.700 | Which is another vector then we convert it into softmax which is another vector
03:24:13.920 | And then we sample the next token
03:24:20.720 | The difference with the kv cache is that whenever we pass a token to the input of this self attention
03:24:28.580 | We cache the key sequence and the v sequence into a buffer called the kv cache
03:24:35.280 | so now imagine that there are
03:24:37.600 | There is a box here called the kv cache
03:24:40.240 | That initially is empty. But after we pass the token I
03:24:43.760 | It will contain the embedding. So the q embedding. Sorry the k embedding corresponding to the token I
03:24:50.960 | And also this is the kv cache. So it is made up of the key cache and the v cache
03:24:56.240 | This is the key cache
03:24:59.040 | Then we have the v cache which is initially empty
03:25:01.440 | But after we send in the first token, we save this v sequence. It only contains one token. So we save it here
03:25:08.320 | So it's the token I
03:25:10.320 | We compute the self attention
03:25:14.080 | Using the query key and values. It will result in only one output embedding. We project it into logits
03:25:21.120 | We project it into softmax. We sample. What is the next token? Very probably it will be the token love
03:25:26.640 | What do we do now
03:25:30.560 | What we did before was that we took this word love
03:25:33.920 | Put it back inside of the prompt and then ask the language model again. What is the next token?
03:25:38.480 | But with the kv cache we do something different
03:25:40.640 | With the kv cache. We always take the previously generated token. So in this case is the token love
03:25:47.200 | We use it as input
03:25:50.400 | Only the single token love
03:25:54.880 | Let me delete a little bit here
03:25:59.440 | And we use this single token as input to the language model
03:26:03.520 | Now what happens is that we feed the transform this single token love into its embedding which is an
03:26:10.720 | Uncontextualized embedding we feed it to the first layer of the transformer as a query key and values for now
03:26:16.720 | The query key and value contains only one token the token correspond the embedding corresponding to the token love
03:26:22.960 | however
03:26:25.200 | when doing self attention
03:26:27.280 | We don't use only one single token
03:26:29.600 | for love
03:26:31.760 | For the key for the keys and values we take this single token love we append it to this buffer called
03:26:39.200 | Kv cache. So now it contains love here for the values. Also it contains love
03:26:45.120 | And then we use this buffer as the key and value sequence in the self attention
03:26:50.640 | So we take this token love we convert it into query key and value the query key and values are one single token
03:26:57.600 | But the query the key and value we append them each of them into their respective buffer here
03:27:03.520 | And then we use the content of this buffer to calculate the self attention
03:27:08.400 | What happens is that we have only one query, but now we have two keys and two values
03:27:13.440 | Which will result in exactly the calculation of this last row of this matrix
03:27:21.360 | That the last row that we are interested in to predict only the next token and not generate all the other contextualized embedding
03:27:28.800 | In this case, we are only seeing
03:27:31.520 | Two tokens, but later we will see with the third token. It will be exactly the last row of that matrix
03:27:36.560 | anyway
03:27:39.360 | The output of this self attention because we have one query two keys and two values
03:27:43.680 | I can guarantee mathematically it will be one single embedding you can verify by yourself
03:27:48.800 | But basically if you have one query as you saw before the self attention mechanism
03:27:52.820 | Will generate a matrix that is a sequence by sequence
03:27:55.760 | But in this case, it's the the roles of this matrix are defined by how many queries you have. So we have only one
03:28:01.360 | And we have however two keys
03:28:04.240 | So the key number one and the key number two
03:28:06.400 | So it will be a matrix that is one by two and it will result in only one output embedding token when you multiply it by b
03:28:16.240 | And we saw that before actually when we calculated the dimensions of the output embedding
03:28:20.800 | We saw that it's only the last row that generates the last embeddings and this is exactly what we are doing here
03:28:26.320 | Anyway, this the self attention calculated like this
03:28:30.240 | So using the query the single token, but as keys and value the content of the buffers the keys and the kv cache
03:28:36.960 | To calculate the self attention we result in only one output embedding
03:28:41.200 | Which is exactly the contextualized embedding that we are interested in to generate the next token
03:28:46.160 | We project it into logics. We'll project it to the softwares and it will result in the next token being
03:28:50.560 | pepperoni
03:28:53.500 | Naively what we did before was take this for the pepperoni and feed it back into the prompt and then feed all the prompt to
03:29:00.240 | The language model but with the kv cache it's different. So we use the last generated token pepperoni
03:29:08.960 | Let me write it all pepperoni
03:29:10.960 | We feed it to we convert it into a single embedding
03:29:15.140 | So the query key and value here are one single token
03:29:20.080 | But before computing the self attention, we put this key and value inside each of their buffers
03:29:27.520 | So now the buffer for the k contains pepperoni as well
03:29:31.280 | And also the v contains pepperoni
03:29:36.080 | Then to calculate the self attention we don't use this key and v we use the content of the kv cache because it contains three tokens
03:29:43.360 | So as query we use only one token, which is the word pepperoni
03:29:46.660 | But as key and v we use the content of the kv cache. So it will result in a matrix that is
03:29:51.360 | Exactly the last row that we saw here because it's exactly this one now because we have as a query
03:29:58.480 | Only the word pepperoni and as key is the token. I love pepperoni
03:30:03.440 | Which will result when multiplied with the v sequence, which is three tokens because we have also the v cache
03:30:08.640 | Will result exactly in the computation of this output embedding here, which is only one single embedding
03:30:15.780 | Which is exactly the one that we need to predict the next token, which will be
03:30:20.480 | the token pizza, I guess
03:30:23.120 | Etc etc. So this is the kv cache this kv cache basically allow us to during inferences
03:30:30.640 | So during token generation to avoid generating all the embeddings
03:30:34.580 | Of all the input sequence, but only generate the last
03:30:38.400 | Embedding contextualized embedding which is exactly the one that we need to we need to predict the next token
03:30:44.960 | There is another thing that we used to know about kv cache, which is the pre-filling the pre-filling is basically we started here with
03:30:53.280 | With a single token as a prompt of the user
03:30:56.720 | So we only use the word I but usually the prompt is a little longer. So it's not only one token from the user the user
03:31:03.840 | maybe
03:31:04.960 | Suppose that the user uses multiple tokens, so it uses the word I love
03:31:09.280 | What we do is because we have already access to all the tokens of the prompt of the user
03:31:17.040 | We are not generating them. We can pre-fill instantly using all of the prompt
03:31:21.440 | of the user
03:31:23.520 | All the kv cache corresponding to the prompt of the user so we can do instead of doing first adding I and then adding love
03:31:30.320 | We add both of them in the same forward pass. How to do that?
03:31:34.480 | We take we use both of them. We convert them into embeddings
03:31:38.080 | So it will result in two embeddings. We feed it to the language model as query key and values
03:31:42.320 | Initially, the kv cache is empty
03:31:44.720 | This will result in a cool sequence of two tokens the k sequence of two tokens and the v sequence of two tokens
03:31:52.960 | We put the k and the v inside of their respective buffer called the k buffer and the v buffer which comprise the kv cache
03:31:59.920 | So now it contains I and love
03:32:03.120 | this contains I and love
03:32:06.240 | then we
03:32:09.360 | Calculated the self-attention
03:32:10.560 | So now we have two tokens for the query two for the keys two for the values because the content of the kv cache contains
03:32:16.000 | two tokens
03:32:17.440 | Which will result in a two by two matrix, so it will result in two output embeddings
03:32:23.460 | And two output softmax which one we project in the um in the logits only the last one
03:32:32.640 | Because we are we are not interested in predicting the word love. We are only interested in knowing what comes after love. So we only take the
03:32:40.800 | Embedding corresponding to the position of the word love we project it into logits
03:32:47.460 | And we project it into softmax to understand what is the next token
03:32:50.740 | So only during this pre-filling phase we actually allow the generation of multiple output embeddings
03:32:57.960 | And then we discard the one that we don't need
03:33:00.900 | Why do we do it because we don't want to add one single token at a time because it will be too slow
03:33:06.180 | If you have a lot of tokens, you just add them all at once in the kv cache
03:33:10.260 | And then you use this kv cache which is pre-filled now to generate one token at a time
03:33:16.420 | The reason we do it is because the gpu is very fast at parallelizing stuff
03:33:20.740 | So it's very good at parallelizing computations
03:33:22.900 | So actually by doing all of these computations inside of the gpu
03:33:26.740 | Will result in a much less wall clock time instead of adding one token at a time
03:33:30.820 | And this is guys the kv cache. So now we can finally code it
03:33:34.340 | Okay, let's code the next part. So we copy this part here and all of this
03:33:41.380 | And all of this actually let's copy it all
03:33:46.180 | So now that we know what is the kv cache
03:33:48.100 | We know that we have two parts to do when we work with the kv cache
03:33:51.700 | The one part is called pre-filling and one is token generation during the pre-filling. We send all the prompt of the user
03:33:57.220 | to the kv cache
03:34:00.340 | To the model using as a query key and value and this will create the initial cache that will then be used by subsequent
03:34:07.320 | During token generation. So where we generate one token at a time
03:34:11.300 | Why do we do this two phase because we want the the prompt is already available to us
03:34:15.540 | We don't want to edit one token at a time while the token generation
03:34:19.300 | We want to generate one token at a time because we don't have these tokens
03:34:22.100 | so to create the attention mask for the
03:34:25.140 | for working with the kv cache basically, so
03:34:28.500 | when we are working with the pre-filling phase, we will have that the
03:34:32.980 | Number of queries key and value will be the number of the tokens inside of the prompt. So we generate a mask that is
03:34:40.180 | sequence by sequence
03:34:42.180 | Because it will be used in the attention mask. So let's visualize it actually
03:34:46.260 | so suppose that we are doing the following so
03:34:50.900 | This suppose that we receive a prompt that is I love pepperoni and we want to generate the next token, which is pizza
03:34:58.180 | The attention calculation will result in the following attention score
03:35:02.660 | So it's a matrix that is three by three in which we want to mask out some
03:35:07.840 | interactions between tokens especially for each query cannot attend to future keys
03:35:12.400 | And the way we do that is we create an attention mask
03:35:16.560 | Of the same size of the attention matrix as you can see so three by three. So sequence by sequence
03:35:21.680 | in which we
03:35:24.400 | Before we apply the softmax. We add this thing called mask to this
03:35:28.560 | matrix
03:35:31.280 | And this mask is made up of minus infinities for all the position in which we don't want any interaction to happen
03:35:38.160 | And this is what we are doing here. So at the beginning we create
03:35:41.920 | We are inserting the prompt of the user and we should mask out future tokens, however
03:35:51.680 | And we create a mask that is a token sequence by sequence
03:35:54.960 | So this is during the pre-filling so when the KB cache is not or the KB cache does not contain any item means that we are
03:36:01.040 | Doing it for the first time. So we are pre-filling the prompt of the user
03:36:04.160 | Now we are not adding any minus infinity value to this KB is to this attention mask during the pre-filling. Why?
03:36:12.560 | For to understand that we need to understand how polygamma attends to the
03:36:17.120 | Image tokens and to the prompt of the user. So for that, let's open the page of
03:36:24.540 | polygamma
03:36:26.540 | And here we can see the attention mask
03:36:28.540 | So a prompt in polygamma is made up of the image tokens, which are 256 in the case of the smallest polygamma
03:36:37.760 | Then we have the prompt of the user which is a beginning of sentence token plus the prompt of the user
03:36:43.180 | So for example, the prompt of the user may say extract where the photographer is in this picture
03:36:48.060 | And then we have a separator token, which is the new line token we saw before
03:36:53.420 | As you can see the attention mask here is not masking out anything for the part that corresponds to the
03:37:00.300 | Prompt because the prompt of the user is made up of the prompt
03:37:04.220 | So the textual prompt plus the image and we don't mask out anything. Why? Because and it's quite
03:37:11.420 | and it's different than what we usually do with language models because
03:37:15.500 | for the image tokens
03:37:17.900 | We can understand that we don't mask out anything because each text token that we will generate needs to access all the image tokens
03:37:25.020 | So it will be conditioned on all the image tokens. That's why it's called conditional generation
03:37:29.120 | And that's fine because we saw that each image is each image feature each image embedding is encoding
03:37:36.080 | Not only itself
03:37:37.740 | But also all the other embeddings and we want each text token to watch all the image to be predicted and that's fine
03:37:43.500 | the point is why in the
03:37:46.540 | The prompt is not causal
03:37:49.740 | So as you can see the first token of the prompt, which is this one
03:37:53.500 | so suppose that the prompt is two tokens, for example, I love and
03:37:56.940 | We want to generate the word pepperoni and pizza, which should be the first output token and the second output token you can see here
03:38:05.180 | Why are we not applying any causal mask to the tokens of the textual prompt?
03:38:14.780 | Because the textual prompt is usually very short
03:38:17.420 | And we want and it usually describes what is the task that we want the vision language model to perform
03:38:23.660 | and it's a choice that the palygamma
03:38:26.720 | authors made which is
03:38:29.420 | because usually this
03:38:31.180 | This prompt represents the task that we want the language model to perform
03:38:34.380 | We want all the tokens that will be generated to watch all of the
03:38:38.940 | tokens in the prompt
03:38:41.480 | Moreover, we want each token in the prompt to watch even future tokens of the prompt itself
03:38:47.400 | So you can think of this
03:38:50.600 | As the query this one as the keys
03:38:55.160 | When we will do prefilling what we will have is the following so we will have
03:39:00.360 | The prompts let's use a different color. So we will have all the tokens of the prompt which are the
03:39:06.440 | Textual prompt which is the textual prompt that we will send to the model
03:39:11.080 | plus the image
03:39:12.760 | tokens
03:39:14.280 | And we do not need to generate any mask here because each
03:39:18.840 | Text prompt can watch even future tokens of the text prompt because you can see that this is the keys
03:39:26.200 | This is the query number one of the text prompt and this is the key number one of the text prompt
03:39:32.360 | This is the key number two of the text prompt and as you can see the query number one of the text prompt
03:39:36.760 | So this beginning of send the token can attend to the key number two of the text token
03:39:41.880 | It's a choice that the palygamum
03:39:45.000 | Authors made so they they said okay, usually the prefix of the
03:39:50.040 | Because we are not generating this prefix, which is the prompt that we send to the model telling what the model needs to do with the image
03:39:57.960 | We do not need to add any causality because we do not
03:40:02.840 | Need the model to be causal with respect to this prefix because we are not going to generate it
03:40:07.960 | however, the only thing that we are going to generate is this thing called suffix which are the
03:40:13.560 | Output tokens predicted by the model using the prompt textual prompt and the image
03:40:18.600 | And this needs to be causal
03:40:20.920 | So the first token output by the model needs to attend all the previous keys, which are the image token
03:40:27.480 | So these three image tokens plus the four tokens of the text prompt
03:40:32.760 | Then the next token predicted by the model should be able to access again all the image tokens
03:40:39.000 | So the first three tokens then the four tokens of the textual prompt plus the last generated
03:40:45.080 | token
03:40:46.760 | By the model then when we generated the next next token, it will need to access
03:40:53.560 | First three image tokens then the next four text tokens of the prompt
03:40:58.280 | And the two tokens predicted by the model before so it is causal only in the generated text not in the prefix part
03:41:07.240 | Which is different than normal language models in normal language models when we prefill even the
03:41:13.080 | When we prefill the
03:41:18.120 | the prompt the prompt
03:41:20.840 | Itself is prefilled using the causal mask because the the prompt is just
03:41:25.160 | A part of what the model would generate if it would start with only the first token
03:41:30.440 | But this is not the case in PaliGamma. It's a choice that the PaliGamma team made
03:41:35.240 | So it's not like the language model has to work in this way or there is any advantage or disadvantage
03:41:40.700 | The only advantage if we want to say is that the information about the prompt
03:41:45.880 | Is replicated in each of these tokens because each of these tokens basically
03:41:50.440 | Includes information also about future tokens that are part of the prompt and this happened when they train the model
03:41:56.120 | so when you train the model also you don't mask out the
03:41:59.320 | The future tokens inside of the
03:42:03.320 | Textual prompt you only mask out what you expect the model to generate
03:42:09.340 | Using the image token and the textual prompt. So to rehearse
03:42:15.160 | Let's go back to this image. What is the text prompt? So when we inference a language model we provide a
03:42:20.920 | Visual text visual language model. We provide an image as condition and then we provide some
03:42:27.080 | Text prompt which is a description of what we want the language model to do with this image
03:42:32.280 | For example tell us where is the photographer in this picture?
03:42:34.920 | And then the model will generate some tokens as outputs telling us where the photographer in this case is
03:42:41.880 | and what we do when we train this language model is that
03:42:45.080 | Let's go back here
03:42:47.800 | We do not mask the tokens of the textual prompt
03:42:51.560 | So when we ask the language model what to do with this image
03:42:54.360 | We do not mask out during training and also during inference, of course because the model needs to work in the same way
03:42:59.560 | But we mask out only what we expect the model to generate
03:43:03.800 | So the causality is only in the generated tokens and it's a choice that you make with the language model
03:43:09.000 | It's not necessarily it has to work with this way because normal language models
03:43:13.640 | They actually mask out all the tokens
03:43:15.720 | There is no like not masking out of the prompt because usually the prompt itself
03:43:20.120 | You can consider it as something generated by the model, even if it's not
03:43:23.480 | So this is a more of a philosophical question that's a technical
03:43:28.200 | But the reason is that it's a choice made by the polygamous authors also in visual language model
03:43:32.920 | Especially like polygamous the task so the prompt the textual prompt is usually very short
03:43:37.800 | It tells the model what to do with the image that it's being fed
03:43:40.760 | so for example localize where is the cat in this image or
03:43:43.480 | Extract all the numbers or tell me where is the photographer in this image, etc, etc
03:43:50.200 | And also the usually the generated output of the model is very short
03:43:53.960 | So we don't use at least polygamous models like polygamous are not used for generating very long
03:43:59.320 | Content but they can be of course fine-tuned to do it
03:44:04.520 | So, let me delete this part. Otherwise it remains here forever
03:44:11.320 | All right, so now we have seen how we generate the
03:44:14.200 | The the mask for the pre-filling so in the past for the pre-filling
03:44:18.360 | We do not mask out anything because we do not mask out the text prompt and we do not mask out the image prompt
03:44:24.520 | The interesting part is that when we generate the text we have we generate one token at a time with the KB cache
03:44:32.280 | Which is this this else part here
03:44:35.160 | We also do not mask out anything. Why?
03:44:38.440 | Because let's go back to the polygama here picture. So here
03:44:43.640 | When you generate the first token, the first token needs to access all the image tokens and the text tokens
03:44:50.360 | So does not we don't need to mask out anything
03:44:52.840 | When we generate the next token as you can see it needs to access all the image tokens and all the text tokens
03:44:59.320 | Plus the last generated token here. So we do not need to mask out anything then again for the next next token
03:45:05.320 | We need to access all the previous tokens plus the two previously generated tokens
03:45:09.960 | So we do not need to mask out anything because we are generating one token at a time
03:45:13.800 | So it needs to access all the previous tokens plus the image tokens plus the textual prompt
03:45:18.920 | So we never need to mask out anything. So you may be wondering why are we never masking out anything?
03:45:25.000 | Because we are working with the KB cache and with the KB cache
03:45:27.480 | We only generate one single row of this matrix at a time
03:45:32.040 | And as you can see
03:45:33.320 | We always generate the last row and the last row is always the last token that needs to access all the previous tokens
03:45:38.920 | So we never need to mask out anything. However during training
03:45:42.200 | when you train a model
03:45:44.920 | on something then you need to mask out because the model will generate all the
03:45:48.920 | Contextualized embedding in parallel and you want each contextualized embedding to only be contextualized on the previous token
03:45:54.600 | So you need to mask out. So during training we will have a causal mask, but during inference, which is our case
03:46:00.200 | We don't have any causal mask at least when working with the KB cache and at least
03:46:04.040 | When working with models like polygamma if you work with a normal language model like normal like llama
03:46:09.880 | For example when you do the pre-filling you actually need to mask out the pre-filling part
03:46:14.200 | But in the case of polygamma because of the choices made by the polygamma team. We do not need to mask out anything
03:46:19.640 | And this is why we do not need to mask out anything
03:46:22.840 | So when we will in the future plan to make another video on how to fine-tune this model that we have made
03:46:27.720 | And we will see that we will need to introduce some kind of mask
03:46:31.080 | And the mask will have to be generated exactly like shown by the polygamma paper. So let me check if my it's still working
03:46:39.080 | Sometimes I lose connection with my cam. So I need to check every once in a while. So
03:46:43.160 | We add then okay
03:46:46.920 | we have created this mask which is filled with zeros because
03:46:49.480 | We need to fill up minus infinities to all the positions where we want to mask out something
03:46:55.000 | But we never mask out anything. So we always make this tensor full of zeros
03:46:58.920 | when we are pre-filling we generate a sequence by sequence mask, but when we are
03:47:04.760 | Generating tokens, we only generated the last row of that metric. So we have only one
03:47:11.080 | Query, so as you can see assert query is equal one
03:47:13.800 | So we only have one query and then we have how many keys we want which is how many keys there are in the KVCache
03:47:19.720 | We add the plus one to this KVCache because before using the KVCache we add this current token
03:47:25.480 | So the query token inside of the KVCache then we extract it before calculating the self-attention like we saw before
03:47:31.000 | As you know the KVCache when we do the attention computation, we have one attention computation for each head
03:47:37.720 | So we need to add the head dimension because there will be one attention matrix for each head
03:47:42.120 | And that's why we add this head dimension here
03:47:44.440 | Okay. Now we have generated the KVCache
03:47:47.240 | Let me check what else we need to do
03:47:50.200 | We need to generate the positions of the tokens that will be used by the rotary positional encodings
03:47:56.380 | So when we are working with the pre-filling part of the KVCache
03:48:01.240 | It means that we have n tokens that are part of the prompt of the user which are the image tokens plus the text tokens
03:48:07.720 | Then we need to generate enough positions to apply the rotary positional encoding. So which the positional encoding
03:48:13.480 | How many of them we need we need up to how many tokens there are in the prompt
03:48:20.360 | Which is indicated also by the number of ones in the attention mask which is generated by this processing polygamma code
03:48:26.520 | So when you generate the tokenized text
03:48:28.840 | It will give you the input IDs and another tensor of the same size as the input IDs with all ones
03:48:35.300 | Indicating that we do not mask out anything and if you count the number of ones it also gives you how many tokens there are
03:48:41.140 | In the input IDs, so that's what we are doing here
03:48:43.380 | We generate enough positions. So when we are doing the pre-filling suppose that the pre-filling is made up of 256 image tokens
03:48:52.660 | And then three tokens of the textual prompt. So what we will this will generate basically 0, 1, 2, blah, blah, blah
03:49:00.800 | 255, 256, 257, and 258
03:49:05.520 | A sequence like this. This sequence will be then used to understand which
03:49:09.920 | Rotary positional encoding we need to apply to each token
03:49:13.040 | when we are however doing the
03:49:15.760 | Token generation we only have one single query to which we need to apply the positional encoding
03:49:23.700 | And for that we only take the one token
03:49:27.040 | So this will generate only a one single mask, which is the position corresponding to the last
03:49:31.680 | To the last token
03:49:37.360 | So when we do token generation basically we have some tokens that are already saved in the KV cache
03:49:41.840 | And then we have one new token, which is the last predicted token, which we use as a query
03:49:46.080 | To understand what is the position of this token
03:49:49.120 | We also pass the attention mask in the case of the attention mask
03:49:52.640 | It will indicate that it's all made up of ones how many ones well indicate well
03:49:57.200 | Based on how many tokens there are in the KV cache
03:50:00.000 | Plus one because we also have the new token that we need to add to the KV cache before doing the self attention
03:50:05.040 | So what we are doing here is the same. So we are counting how many ones there are in the KV cache
03:50:09.920 | Which is already plus one
03:50:12.000 | And then we take this last number
03:50:15.120 | And we this is how we generate the position IDs
03:50:21.280 | And then we return this stuff here, so let me return this stuff
03:50:26.720 | Okay, so we have implemented this method
03:50:29.840 | So what this does this method do this method basically takes as input the image features
03:50:34.800 | It takes as input the input IDs and the input embeddings
03:50:38.240 | What are the input embeddings are the image the embeddings of the image placeholder, which we will not use
03:50:45.280 | And then the image features our goal is to put all the image features in the right places in this input embeddings based on where
03:50:52.240 | Are these image embeddings placeholder positions?
03:50:55.300 | And we did we do it here
03:50:59.200 | Here actually then we create the attention mask, which is basically just made up of zeros which
03:51:04.720 | Do not confuse the zeros in the attention mask
03:51:07.520 | We are creating here with what we are probably commonly used to see in the attention mask
03:51:11.920 | So let me show you actually this one also
03:51:15.120 | So usually you are probably used to see the attention mask as a bunch of num ones and zero and the zero indicates which number
03:51:21.440 | Should be masked and the one which indicates what is the number that should not be masked
03:51:25.600 | This ones and zero is actually then converted into a number of in a series of minus infinities and zeros before
03:51:33.200 | Being added to the attention matrix
03:51:36.000 | Instead of creating a ones and zero which then converted into minus infinities and zeros
03:51:41.200 | We are already creating the mask that can be directly added to the attention mask
03:51:45.280 | So we are creating a bunch of zeros, which basically means that
03:51:48.480 | You add a bunch of zeros to this matrix
03:51:51.280 | So it's like you are not masking out anything
03:51:53.440 | If you want to mask out something then you need to add some minus infinities in this mask, but we never add any minus infinities
03:52:00.240 | So we are not masking out anything
03:52:02.240 | And this is our method that combines the image features with the text tokens
03:52:07.680 | Our next goal is to create the structure of the polygama
03:52:11.220 | Actually, we can create this polygama multimodal projector. Yeah
03:52:15.280 | All right. So let's create this polygama multimodal projector. Let me put away this stuff here
03:52:21.840 | We just copy it. It's very simple. I just I don't even need to copy first the constructor and then
03:52:28.400 | So the polygama multimodal projector is just that linear layer that converts the size of the image features
03:52:34.620 | Extracted from the vision encoder into the same size of the embedding size that is used by the language model
03:52:41.900 | So it's just a linear layer that converts the hidden size of the vision model into the projection dimension, which is equal to the
03:52:49.900 | embedding size of the text
03:52:52.440 | text model here
03:52:55.420 | So this project projection dim is equal to the you can see it here is equal to the hidden size
03:53:01.740 | That is been then used by the language model
03:53:04.460 | So it's basically resizing the the embeddings so that they can be concatenated with the text tokens
03:53:11.020 | Let's go back here. So as you can see, we are just applying this linear layer
03:53:15.980 | Our next step is to code the language model itself. So the language model the gamma language model is a transformer model
03:53:23.420 | So it it will code a language model. So
03:53:25.900 | Transformer model so we create this gamma for causal language modeling
03:53:30.860 | Which takes the configuration of the gamma model as input and the gamma model, which we will create later
03:53:36.060 | Basically in the hugging phase whenever you see something something for causal language modeling
03:53:42.860 | It is a transformer model plus a language modeling head, which is the linear layer in the transformer that projects each embedding into
03:53:49.820 | logits
03:53:52.300 | So this is basically the transformer model this gamma model and then this is gamma for causal lm is the gamma model plus a linear layer
03:53:59.820 | That's why we are reusing this instance plus a linear layer. So the forward method will be very simple
03:54:05.180 | We need to implement these two
03:54:08.620 | methods which are used for the
03:54:11.020 | Weight tying so we saw before that weight tying basically means that we share the weights of the embedding back
03:54:17.180 | Layer with the logits layer. So this is what we are doing
03:54:20.380 | So when we type weights, we just copy from the embeddings to the language modeling head
03:54:25.420 | Which is the linear layer that converts the embedding into logits
03:54:29.920 | Then we have the forward method which is also very simple because it will not do anything except for
03:54:36.480 | Applying sending the stuff to the language model and then applying this
03:54:40.400 | Linear language modeling head which is the linear layer to convert into logits
03:54:44.800 | So as you can see here
03:54:47.680 | We send the input directly
03:54:50.000 | So the attention mask the position IDs the input embeddings the kvcache we send it to this language model, which we will implement later
03:54:56.960 | The output of this language model will be a series of embeddings, but we do not want embeddings. We want logits. So
03:55:02.720 | This is what we do
03:55:04.880 | We take the outputs. We take the hidden states from these outputs, which are the series of embeddings
03:55:10.560 | We apply the language modeling head. So it's the linear layer. We make sure it's a floating point numbers
03:55:15.920 | we return and return whatever
03:55:19.920 | Result is it so we return the logits and if the user specified the kvcache, we also return the updated kvcache. That's it
03:55:27.680 | Because here there is no logic the logic will be here in gamma model
03:55:32.320 | Yeah, so let's go to implement the gamma model, all right
03:55:37.120 | So what is a language model a language model is an embedding layer plus a series of transformer layers
03:55:44.000 | And then we have the language modeling head. The language modeling head is already implemented here in gamma for causal language modeling
03:55:50.160 | So we just need to create the other part which is the embedding layer and the list of transformer layers
03:55:55.440 | Let's do that. So we create first the constructor. So this
03:55:59.680 | gamma model
03:56:02.000 | which takes the configuration some
03:56:04.000 | Information that it needs so the vocabulary size why we need a couple vocabulary size because we need to create the embeddings how many embeddings
03:56:10.880 | we have
03:56:12.480 | Depending on the number of tokens in our vocabulary each embedding vector will be of size a hidden size
03:56:19.600 | This indicates the position of the embedding token inside of the vocabulary
03:56:23.060 | And basically I think the embedding layer takes it as input so that it does not update the gradient for this token here
03:56:30.000 | And then we have a list of layers
03:56:32.880 | for the
03:56:34.960 | For our transformer
03:56:37.440 | These are called here are called gamma decoder layers. So they are the transformer layers. We have how many of them we have
03:56:45.440 | Depending on this parameter num_hidden_layers. And then we have a final normalization, which is a rms normalization, which I will describe later
03:56:52.880 | What is it and why it's different from a layer normalization?
03:56:56.020 | We need to implement this method here get_input_embeddings, which is used by the language modeling head. So as you can see we use it
03:57:07.760 | We use it here to extract the initial embeddings
03:57:10.960 | From the language model which are then combined with the image features we saw before here and then send to the language model
03:57:16.800 | So the language model here is receiving not the input IDs, but it's receiving the embeddings already
03:57:21.840 | So the image embeddings plus the text embeddings
03:57:24.420 | Which is the same embeddings that we will receive here in the forward method of gamma model
03:57:30.000 | Now, let's make the forward method
03:57:33.280 | Which is also very simple because we do not implement much logic here
03:57:39.600 | So we receive the attention_mask, the position_ids, which are the position that we will apply for each token
03:57:45.200 | How to apply the positional encoding to each token
03:57:48.800 | We didn't talk about the positional encoding yet because we apply the rotary positional encoding in this case, which are applied
03:57:55.120 | During the calculation of the attention
03:57:57.200 | So they are not applied at the beginning like we saw before with the Sigleap or with the vanilla transformer
03:58:02.320 | But they are applied just before calculating the attention
03:58:06.160 | We have the input embeddings which we saw before are the image features plus the text tokens
03:58:11.520 | And in case we have the KB cache also the instance of the KB cache, which we didn't implement yet
03:58:16.320 | But we already know how it works
03:58:20.960 | Let's do it. So the first thing that it does it is
03:58:24.560 | Taking and applying some kind of normalization, which is the same reason we apply
03:58:31.020 | Normalization also to the input of the image features
03:58:33.500 | We want the kind of the magnitude of the numbers to remain the same even if the number of dimensions increases
03:58:38.560 | then this language model is made up of a series of layers of
03:58:43.660 | Transformer layers. So what we do is the output of one layer becomes the input of the next one
03:58:49.340 | And that's what we are going to do here
03:58:51.340 | Oops, I've copied it
03:58:56.860 | So we take the decoder layer we send it the first hidden state which is the input of this forward after it's been normalized
03:59:04.160 | We send the attention mask. We send the positional encodings the KB cache and it will return something which is
03:59:10.380 | Contextualized embeddings which become the input of the next layer
03:59:14.860 | So we replace basically these hidden states with the output of the first layer so that it becomes the input of the next layer
03:59:20.940 | And we do it for all the layers
03:59:24.380 | The output of the last layer we send it to a normalization
03:59:28.240 | Layer which is the rms normalization, which we didn't see yet, but we will talk shortly
03:59:35.740 | So I want to actually redraw what we are doing so far. So we have arrived
03:59:41.580 | So for that, let's go back to the ipad
03:59:45.840 | All right, so
03:59:53.340 | What we are doing basically is this so we have created the
03:59:56.620 | Embeddings before we have merged them with the image tokens and the text tokens
04:00:01.420 | We did not apply any positional encodings because we are doing the rotary positional encodings
04:00:06.800 | Which are applied exactly when we calculate the attention
04:00:10.560 | So if we were to draw the the gamma architecture, it would be like this. So we have the
04:00:15.820 | embeddings
04:00:22.620 | Then I remember there is some kind of normalization
04:00:25.040 | Doing but it's not a linear not a normalization layer. It's just we are normalizing the embedding
04:00:31.420 | So it's not a layer actually so we do not have to draw it
04:00:34.300 | Then we have a series of layers and we have n of them
04:00:37.900 | Each of these layers is made up of a normalization
04:00:42.160 | RMS normalization
04:00:46.700 | Then we have self-attention
04:00:48.700 | So attention
04:00:51.420 | then we have a
04:00:53.340 | Plus so a skip connection here
04:00:55.660 | Uh, I think I made it too small. So let's make it a bigger
04:01:00.300 | This layer
04:01:03.180 | Then we take the output of this one and send it to another normalization, which is an again in rms normalization
04:01:09.280 | Then we send it to a feed forward network
04:01:12.640 | The output of this one is sent again to another
04:01:20.460 | Skip connection
04:01:22.060 | Then the output of the last layer will be sent to again another normalization, which is the rms normalization
04:01:28.640 | Then we send it to a linear layer for the logits
04:01:32.480 | Linear and let me shift it down and then we have the softmax so so far
04:01:43.820 | So far what we have made is basically we are now creating this structure here, but without coding the single block
04:01:50.540 | So we are just creating this
04:01:52.620 | Forward method that will run the output of the embeddings to each of this layer one after another and will apply the final normalization
04:02:00.800 | Rms normalization, which is this stuff here
04:02:04.380 | And then it will be sent to the linear layer when it will be sent to this linear layer
04:02:09.900 | With gamma for causal lm because as you can see gamma for causal lm will take the output of this
04:02:14.860 | Model, what is this model?
04:02:16.940 | It's everything
04:02:18.540 | Except the linear layer and then we'll apply this linear layer called the language modeling head which will convert it into logits
04:02:25.420 | And after we will apply the softmax, but that is for sampling
04:02:29.020 | So now we need to create this decoder layer. So what is this decoder layer?
04:02:32.940 | This decoder layer is this stuff here. We need to code the normalization. We need to code the attention mechanism
04:02:38.940 | We need to code the field forward network and of course all the skip connections. So let's do it
04:02:43.580 | All right. The first thing that we can implement actually very easily is the rms normalization. So let's explore it
04:02:50.060 | So I have a slide ready there for that
04:02:52.220 | So as we saw before with layer normalization
04:02:54.620 | What we are doing is that we are normalizing each value using some statistic collected from the value from each item itself in the batch
04:03:03.100 | So each item in the batch suppose
04:03:05.500 | It's a batch of pictures and the first picture is that of the cat in the layer normalization
04:03:09.740 | What we are doing is for each dimension of this vector
04:03:12.620 | We calculate a statistic using this vector which is the mean and the standard deviation
04:03:18.480 | And then we normalize each value in this vector using these two statistics. How do we normalize? Well, we recenter it around
04:03:27.260 | Here it's not written, but I can show you the formula here
04:03:30.300 | You basically subtract the mean that you calculated and you divide it by the standard deviation
04:03:35.760 | And the layer normalization actually works fine
04:03:39.980 | But recently in most language models, we are seeing another kind of normalization that is known as root mean square normalization
04:03:47.120 | Basically what we do with this normalization is that each of these features in this
04:03:55.900 | Each item of the batch
04:03:57.740 | We are normalizing it in such a way that it becomes like it's coming out from a distribution
04:04:03.120 | Gaussian distribution with a center of zero and a variance of one
04:04:08.620 | What they claim in the root mean square normalization paper is that they say
04:04:14.300 | that the success of the
04:04:17.260 | Layer normalization is not because of its recentering invariance, but because of its rescaling invariance
04:04:25.340 | which means that
04:04:26.860 | To actually reduce this internal covariate shift, which is the reason we use normalization
04:04:32.240 | The model does not need to see the values
04:04:36.700 | Centered around zero. It just needs to see the values mostly surrounded around whatever mean they are centered upon
04:04:45.420 | So the values of this cat, for example, they do not need to be all around zero
04:04:51.900 | They could be all around 500 or all around minus 100 as long as they are more or less around
04:04:58.300 | 500 or more or less around minus 100 all of them
04:05:02.060 | That's the meaning of reducing the variance to one
04:05:06.140 | So we want most of the values to be around whatever mean it is
04:05:09.900 | And this is a hypothesis
04:05:12.700 | Made by this paper and it's actually verified because most language models right now
04:05:18.060 | They do not suffer from the internal covariance shift because they can be trained successfully very fast just like the layer normalization ones
04:05:25.420 | But by using this root mean square normalization, why it is advantageous
04:05:31.660 | Instead of layer normalization because instead of computing two statistics for the mean and the variance
04:05:39.100 | We only need to compute one statistic, which is this root mean square statistic
04:05:44.380 | Why we do not compute just the standard deviation like we do with the layer normalization because to compute the standard deviation
04:05:51.180 | You need to have the mean
04:05:53.260 | But we do not want to compute the mean because we do not want to recenter them
04:05:58.620 | So we do and because we don't compute the mean we cannot compute the
04:06:02.940 | the standard deviation
04:06:05.900 | So we replace this standard deviation with another statistic that allow us to
04:06:11.100 | Reduce the variance, which is this root mean square statistic
04:06:14.640 | Which is calculated as follows. So we take each item in this vector
04:06:19.660 | So this item, this item, this item, this item, this item, this item
04:06:22.540 | We make the power of two of each of this item. We sum them up all together. We calculate
04:06:29.120 | The mean of this summation so divide by n basically
04:06:33.020 | Square root and this gives us the square root mean
04:06:38.540 | Square statistic for this item then we take each of this item and we divide it by this statistic
04:06:44.380 | Multiplied by a learnable parameter called gamma, which is one for each feature
04:06:50.380 | So basically with root mean square normalization, we are obtaining the same
04:06:56.620 | covariate, internal covariate shift
04:06:59.580 | I mean, it solves the same problem of the internal covariate shift as layer normalization, but by computing one less statistic
04:07:07.340 | So we compute less statistics. So it is faster basically
04:07:12.940 | Okay. Yeah, so let's implement it. Let me put away this stuff
04:07:17.740 | All right, so now we copy this class we put it here
04:07:24.940 | Then we later we explain it it's very simple
04:07:30.220 | Uh, let me copy all the forward method
04:07:35.180 | It's very simple. Okay. So what we are doing with rms normalization is that okay
04:07:39.740 | we are creating a weight matrix, which is the
04:07:41.820 | number of parameters one for each feature in the vector to which we apply this root mean normalization how many
04:07:48.700 | Dimensions will have this vector well the same as the tokens because we are we will go we're going to normalize tokens
04:07:55.820 | So this dim will be the hidden dimension of our language model
04:08:00.300 | We compute this root mean square statistic as follows. So we calculate the power of two of each item
04:08:06.060 | We compute the mean of this
04:08:08.220 | Power of two. So what we are calculating here is basically this term here. So let me
04:08:14.060 | Show you this term here
04:08:16.700 | Then we do one the square root of this which is this r sqrt
04:08:21.340 | but actually we are not doing the square root we are actually calculating the
04:08:25.500 | One over the square root of whatever is the argument of the r sqrt. So stuff here
04:08:31.260 | And instead of dividing each item we are multiplying with one over sqrt, which is exactly like dividing by one
04:08:39.340 | by the square root
04:08:41.820 | Why do we have this item here plus self dot eps in the argument of the square the square root
04:08:51.900 | Well, because this r sqrt is one over the square root of
04:08:56.380 | Whatever is inside
04:08:58.780 | But if the computation of this statistic produces a number that is very close to zero in this division
04:09:05.500 | We are basically dividing by zero which will make the output of this division this number here very big. So instead of
04:09:12.780 | To avoid this division by zero we add to the denominator of this division. So this denominator we add a very small number called eps
04:09:22.080 | As you can see, it's a very small number to avoid this division by zero
04:09:25.200 | And it's the same parameter that we also pass in the layer normalization as you can see here
04:09:29.600 | We pass this parameter, which is a very small number to avoid this division by zero
04:09:33.280 | So the forward method is basically just doing this normalization and then we multiply each of this number by this gamma parameter
04:09:41.120 | Which is a learnable parameter as you can see
04:09:43.920 | Here, so we have here we have this gamma parameter
04:09:49.840 | And then we return it
04:09:51.840 | That's it. This is normalization
04:09:53.840 | Now we can move to the next part, which is the coding of this decoder layers
04:09:59.440 | All right
04:10:01.920 | Let me check gamma model so we can create the decoder layer. So let's copy some code
04:10:13.440 | All right, so the decoder layer as we saw before it's this stuff here
04:10:17.680 | So we need to create something that manages all these blocks here
04:10:22.400 | So something that takes an input a list of embeddings apply a normalization then apply a transformer
04:10:27.940 | Attention, sorry
04:10:29.600 | Then it applies a skip connection
04:10:31.200 | Then the output is sent to another normalization then to a feedforward layer block then again another skip connection then produces some output
04:10:38.240 | So we will just create this simple block which is the same structure as the decoder layer that we have the encoder layer that
04:10:44.080 | We have created in cglib. So it's the equivalent of
04:10:46.560 | This block here the encoder layer. It will be doing the same job
04:10:50.880 | So, let's do it
04:10:54.640 | So what we are doing is we are saving some stuff
04:10:57.520 | So the hidden size of the model then we are creating the attention
04:11:00.800 | Block, which we will code later the multi-layer perceptron, which is the feedforward network block
04:11:06.240 | The first normalization and the second normalization because in the decoder block we have two normalizations
04:11:10.900 | So as you can see here, we have one normalization here and one here
04:11:14.640 | So the forward method is the same very similar to the one we have coded for cglib
04:11:20.800 | so we take some hidden states, which is the
04:11:23.440 | Input to this layer the attention mask, which will be sent to the attention mechanism the position
04:11:28.800 | Ids which also will be sent to the attention mechanism because we are using the rotary positional encodings
04:11:33.920 | And the kb cache which also will be sent to the attention mechanism
04:11:36.660 | So let's actually let me just copy it and then I explain it because it's the same as the encoder
04:11:42.960 | So we take the input we apply the first normalization to this input which is
04:11:47.680 | This stuff here this normalization
04:11:50.720 | Then we send the output of the normalization
04:11:53.460 | This hidden state we send it to the self-attention block along with the attention mask the positional encodings and the kb cache
04:12:00.320 | And this will produce an output which will be then summed up with the skip connection here, which is this stuff here
04:12:06.080 | So we take the output which is hidden states plus this residual which we saved before to create the skip connection
04:12:11.840 | then we create another skip connection and we send the output of the
04:12:16.960 | of the
04:12:19.660 | Self-attention to the second normalization, which is this stuff here this normalization
04:12:25.060 | The output of the normalization is sent to the multi-layer perceptron, which is this one here
04:12:30.880 | And then we take the output of the multi-layer perceptron
04:12:33.600 | Which is the feed forward network plus the skip connection that we saved before which is this residual stuff here
04:12:38.960 | And that's this plus sign here and the output is then returned and this is the decoder layer
04:12:45.280 | Now we need to code the multi-layer perceptron and the self-attention
04:12:49.620 | Block, I believe the the faster stuff to do is the multi-layer perceptron. So let's do that first
04:12:58.160 | So let me go there
04:13:00.160 | It's also very similar to the multi-layer perceptron that we have already coded for the
04:13:05.520 | Sigleap, but it's slightly different
04:13:08.240 | So the multi-layer perceptron here, which is also known as feed forward network is basically as we saw before in the sigleap
04:13:14.560 | It is something that two linear layers that first expands the embedding
04:13:20.000 | Vector applies some non-linearity and then reduces it back to the original size and this is what is done here
04:13:27.520 | But in this case, we also have another linear layer called the gate projection
04:13:32.580 | Which is used by the activation function that this gamma language model is using
04:13:37.600 | We saw that different language models have different activation functions, which is based mostly on heuristics on how they work
04:13:45.520 | So let's implement the forward method, which is very simple here and we will see why we need this gate projection
04:13:53.360 | I made a code to convert this very long. I mean this very long this this this line into
04:14:00.000 | Series of steps so that you can see each single step being done independently
04:14:04.980 | but let me describe it what we are doing here basically is
04:14:08.480 | First we are applying the gate projection to the input to this feed forward network, which is a list of embeddings as we saw before
04:14:17.920 | And the function that we are using is the gelu function, which I believe is the same that we are using also for the sigleap
04:14:23.680 | Let me check
04:14:26.560 | Uh, yeah the same function
04:14:30.000 | But we also have this gate projection here
04:14:33.600 | So basically it's adding some learnable parameters before sending it to this activation function
04:14:39.600 | We multiply the output of this activation function with the up projection
04:14:45.120 | The up projection is basically the one that takes the embedding size from the original embedding to the intermediate size
04:14:51.040 | So it's expanded size
04:14:53.120 | And then the result of this multiplication, which is a vector
04:14:57.920 | Which is a tensor of size batch size sequence length and the intermediate size is then reduced back to the original size by this
04:15:04.960 | Down projection because with the up projection you are expanding and the down projection you are putting it back to the original size
04:15:11.440 | So the down projection will take the intermediate size back into the hidden size and this is the multi-layer perceptron of gamma
04:15:17.200 | It's slightly different than the other one because we have this gate projection
04:15:21.360 | Which is additional parameters basically
04:15:23.360 | And it's the same kind of gate projection that we also have if I remember correctly in lama in which we have this regular function
04:15:29.520 | With its own gate projection. It's just parameters that are learnable before applying the non-linearity
04:15:35.220 | We also said that the non-linearity is chosen based on heuristic on how they work well in particular case
04:15:41.280 | But also on some properties that we want from them with respect to the gradient. So some
04:15:46.160 | Activation functions allow the gradient to flow for negative values. Some others don't allow it, etc, etc
04:15:52.640 | So it's all based on practical application. Someone trained tried using it so that it works better and then we start using it
04:16:00.560 | Okay, now we also have the multi-layer perceptron now comes the biggest part
04:16:05.600 | And but not the hardest because we are already familiar with the attention mechanism
04:16:09.280 | So we need we need to code the attention mechanism which will comprise the self-attention the use of the KV cache
04:16:14.960 | The grouped query attention which is something new and the rotary positional encoding. So it will be a little bit of learning experience. So let's start
04:16:22.400 | All right. So let's start coding the next part, which is gamma attention. So we start by creating the class
04:16:30.320 | Let me copy it
04:16:33.120 | And I will do it slowly because this one has a lot of innovations
04:16:36.820 | So let's start by creating the constructor, which is our usual constructor
04:16:41.540 | It takes in the configuration of gamma. We also take another parameter, which is the id of the layer
04:16:47.760 | so the position of the layer in the
04:16:50.000 | Transformer because as you know the decoder the gamma is a decoder
04:16:54.080 | Only model it's made up of many layers and each of these layers will have its own KV cache
04:17:02.480 | So to know which KV cache to use because there is one cache for each layer. We need to also pass the layer index
04:17:09.600 | To each layer so it knows where to put its key and values
04:17:15.120 | Then we save some parameters
04:17:18.080 | So the attention dropout which we will not use the hidden size is the size of the embedding vector of each token
04:17:24.560 | the number of attention heads for the
04:17:26.560 | queries
04:17:29.040 | The number of the head dimension which is how many
04:17:33.520 | Dimensions each head will work with
04:17:37.840 | In the multi-head attention
04:17:41.680 | Which is a part of the entire embedding of each token
04:17:45.200 | How many heads we have for the number for the keys and values in the multi-head attention?
04:17:52.320 | And this is different from those for the query because we are going to talk about grouped query attention
04:17:57.280 | So we can calculate how many groups we have in this grouped query attention, but later I will explain how it works
04:18:02.000 | The maximum positional embeddings which are how many positions we can encode in the positional encoding using the rotary positional encoding
04:18:10.400 | And what is the base frequency of the rotary positional encodings?
04:18:12.980 | Now we have some other stuff
04:18:16.640 | So first of all, we make sure that the hidden size is divisible by the number of heads because as you know
04:18:22.880 | Each head has to watch a part of the embedding of the entire token
04:18:26.560 | So it must be divisible by the number of heads
04:18:28.560 | Then we create our projections which are the wq wk and wv projections that we saw in the multi-head attention
04:18:36.960 | But in this case, we can see that we have not hidden size as input as output
04:18:44.480 | number of features
04:18:47.200 | But the number of features are calculated as the number of heads multiplied by the head dimension
04:18:52.320 | Now why this is different from the multi-head attention that we have implemented for Sigleap?
04:18:57.440 | So if we go to look at Sigleap and we look at the attention
04:19:01.840 | you can see that each of these wq wk and wv metrics matrices is a
04:19:06.640 | Hidden size by hidden size here. It's called the embedding dimension, but okay, it's the same thing
04:19:11.200 | So it's the size of the entire token with the output features being also the same number of dimensions
04:19:17.620 | Here, however, it's slightly different. Why?
04:19:22.160 | If we look at what is the numHeads numHeads is the number of heads for the query and this is actually the
04:19:26.960 | the full the number of heads for the query in grouped query attention is
04:19:32.320 | Equal to the is bigger than the number of heads for the than for the keys and values later
04:19:39.280 | We will see why but for now, let's concentrate on the dimensions. So in this case this wq matrix
04:19:44.720 | So it's called the qproj which stands for which is the wq
04:19:48.800 | Matrix in the multi-head attention has an output a number of output features. So suppose that the number of heads
04:19:55.440 | So number of heads is equal to 8 and suppose that the hidden size is equal to 1024
04:20:02.740 | So the wq matrix will be a matrix that is
04:20:07.040 | 1024 by 8 multiplied by the head dimension, but the head dimension is what the head dimension is how many
04:20:15.820 | Dimensions it had will watch by using the number of heads of the query as a reference
04:20:21.020 | So 1024 divided by 8 which is 128
04:20:25.280 | I think so. Yeah
04:20:28.300 | So it's 8 multiplied by 120. So actually the wq matrix is 1024 by 1024
04:20:35.440 | What changes in grouped query attention is the wk and wv projection actually wk actually will be
04:20:43.480 | 4 because that's the hidden size as input and the output features will be the number of heads for the key values
04:20:51.400 | Which actually we can check here
04:20:54.040 | In the configuration we can see that the number of heads for the
04:20:58.440 | Queries is 8 and the number of heads for the key and values is only one
04:21:04.600 | So actually this is the case of not of grouped query attention. It's multi query attention. So
04:21:09.240 | Let's say okay. Suppose that we have only one head here. Also one multiplied by 128. So it's equal to
04:21:15.480 | 1024 by 128
04:21:18.820 | And the same size is also for wv because as you can see the expression in wv is the same
04:21:25.480 | it's the number of heads for the key value multiplied by the head dimension and then we have the output projection, which is a
04:21:33.640 | Hidden size by hidden size because the number of heads multiplied by the head dimension
04:21:37.480 | So it's actually number of heads is 8 which is always referencing the number of heads of the queries
04:21:43.320 | So this is 1024 by 1024
04:21:45.720 | So as you can see the difference with the grouped query attention is that we have less head for the keys and values
04:21:51.160 | Which results in a smaller
04:21:53.400 | Projection for the embedding of each token
04:21:57.320 | When it's used as keys and value. Let's see why so let me open a new
04:22:04.120 | Page and let's switch to the ipad which is here
04:22:07.800 | Okay, when we do um
04:22:10.520 | Normal multi head attention what we have is that each token is divided into multiple groups of dimensions
04:22:17.400 | One dedicated to each head suppose that we have an initial token
04:22:21.800 | Let me use a pen and let's use a smaller size. So imagine that we have a token with
04:22:32.260 | Dimensions in total if we divide that in eight heads
04:22:36.340 | We will have that each of the head will manage 128 dimensions of this token so one to
04:22:45.060 | 128 then the second head will manage
04:22:48.660 | 129 to 256
04:22:52.480 | Etc, etc until the last one which will be I don't know how to do the calculation. Let me check
04:22:58.020 | 896 I guess
04:23:01.900 | 896 up to 1024, right?
04:23:06.820 | 128 yeah should be correct. So this is the head number eight
04:23:13.940 | This is the head
04:23:18.420 | Two and this is the head one
04:23:20.900 | When we do the product query multiplied by the transpose of the keys each of the query is
04:23:29.360 | Multiplied so dot product with each of the keys, but only in the part
04:23:34.800 | Dedicated to each head because each head is working independently
04:23:38.500 | So suppose that this is our query. So this is our query. Let me write it with a different color. So
04:23:45.060 | this is our
04:23:47.540 | Query and then we have some key
04:23:49.860 | And this key also in the normal multi head attention. We have the same number of heads for the query and the keys
04:23:57.220 | So suppose that we have the same number of heads also here so we can copy this stuff, I guess
04:24:02.760 | Too hard to copy
04:24:08.160 | Okay copy paste
04:24:12.640 | So what will happen with the multi head the normal multi head attention is that each head will do the dot product of the first
04:24:22.200 | Head of the head number one. For example, we'll do the dot product of the first
04:24:27.500 | 128 dimensions of the query with each of the keys because you need to think that we don't have one key. We have multiple keys
04:24:35.160 | Because it's a matrix. The matrix is a sequence by sequence. So each head each query is attending to all the past keys
04:24:43.280 | So here we can write
04:24:47.800 | Key number one key number two
04:24:49.680 | So key number one key number two and key number three and this is the query number one and we do it for all the
04:24:54.640 | Queries so for each token each token will attend all the past tokens as keys
04:24:59.600 | At least in the language modeling
04:25:02.400 | So what will happen is that we are doing a dot product
04:25:06.320 | With the first head will do a dot product of the first
04:25:09.560 | 128 dimensions between the query and the key then again between this query and this key and then between
04:25:16.520 | This query and this key in parallel the head number two will do the same stuff
04:25:22.200 | so the head number two will take the next group of
04:25:25.560 | 128 dimensions or the dimensions from 129 to 256 and will do the dot product with the
04:25:32.800 | next group of
04:25:35.080 | 128 dimensions for each of the keys
04:25:37.380 | So it will do the dot product of this query with this key and then this query with this key
04:25:44.560 | And then this query with this key all in
04:25:48.160 | in parallel
04:25:51.080 | Each head is working in parallel
04:25:53.080 | Now what happens is that and we do it for all the heads
04:25:58.560 | The problem with the multi head attention is that the and this was described in the multi query paper
04:26:06.720 | So if you want I can give you the reference to the paper. It's called
04:26:10.400 | multi query paper
04:26:14.200 | Multi-query
04:26:15.640 | attention paper
04:26:17.640 | And it's this one here in this paper
04:26:20.840 | Basically, Noam Shazir described what is the problem with multi head attention at least from a computation point of view
04:26:27.840 | He claims that with multi head attention
04:26:31.320 | The problem is not in the number of computations that we are doing which is the bottleneck of the computation
04:26:37.760 | but rather the number of
04:26:40.480 | Data transfer that is happening in the GPU because of this multi head attention and for that we need to talk about
04:26:46.800 | How the GPUs work so in a GPU what we have
04:26:52.600 | Is this a GPU has a very big memory called the high bandwidth memory
04:26:59.400 | Which is in the order of gigabyte or tens of gigabyte. I think the
04:27:04.880 | 100 goes up to 80 gigabyte. Then we have some smaller memory called local memory. So local
04:27:12.080 | memory
04:27:14.360 | And this one is in the order of the megabyte. I don't know if it's 10 of megabyte
04:27:19.040 | I think in the tens of megabytes, so it's a one a magnitude of order smaller
04:27:24.920 | three magnitudes of order smaller and
04:27:30.040 | Then we have the cores
04:27:32.080 | The cores are many and they all work in parallel all of these cores
04:27:36.600 | So when you do a matrix multiplication, what happens is this
04:27:40.120 | You have the matrix that you are trying to multiply in the high bandwidth memory
04:27:44.820 | The the kernel that manages this matrix multiplication
04:27:50.120 | Which is a CUDA kernel in case you are using an Nvidia
04:27:53.240 | GPU will copy for example the first part of the matrix from the high bandwidth memory to the local memory and
04:28:00.920 | Each core will work with a part of this big matrix to compute this matrix multiplication in parallel
04:28:09.040 | So each one is will be working with a smaller part of this matrix to calculate this this part in parallel
04:28:14.680 | it's much easier to visualize with the summation because for example if you are summing two matrices like this matrix and this matrix and
04:28:21.880 | You get this matrix as output. What happens if you divide it into four parts is that
04:28:28.920 | The result of this part of the matrix only depends on these numbers and these numbers
04:28:33.620 | So the first head can work with these two parts the second core
04:28:37.960 | Sorry, not head the second core can work with these two parts
04:28:41.480 | sum them up to produce this one the third core can work with these two parts and
04:28:47.840 | Resulting in this part and then the last core can work on this part which will result in this part of the matrix
04:28:54.800 | So as you can see the metric summation can be done in parallel by multiple cores each working with a part of the matrix
04:29:01.120 | What happens when we do multi head attention is that?
04:29:08.960 | The dimension suppose that because the heads are working in parallel
04:29:12.940 | the first head will copy the first
04:29:16.560 | 128 dimensions of the query to the local memory of the GPU which will then be
04:29:23.720 | Accessed by the cores to compute these dot products
04:29:26.880 | Meanwhile the second head at the same time needs to copy the second
04:29:33.100 | 128 dimension of the each token to the local memory and
04:29:38.680 | Then needs to also copy for each query the second
04:29:42.640 | 128 dimensions from the high bandwidth memory to the local memory so that the cores can work with it
04:29:49.880 | Now what happens in the multi query attention paper. So this paper here what they say is that
04:29:55.680 | The bottleneck of the computation of the attention is not in how many dot products we are doing
04:30:02.960 | But how much it how much time it takes to copy the memory from the high bandwidth
04:30:08.200 | Bandwidth memory to the local memory so that the cores can work with it
04:30:12.160 | Why because in the GPU we have a lot of cores that are very fast at computing computation
04:30:18.240 | But the GPU is not so fast at copying stuff around so the memory copying is very slow compared to how much
04:30:25.040 | Computations it can perform. For example, let's open the
04:30:28.760 | A100 GPU data sheet
04:30:31.680 | It's here you can see that the A100 has okay 80 gigabyte of memory in the high bandwidth memory
04:30:40.840 | And it can do this kind of
04:30:46.160 | Teraflops operations per second if you are working with the 32-bit
04:30:50.080 | But as you can see the GPU memory bandwidth is much slower than the number of operations it can do
04:30:57.060 | Because the teraflow floating-point operations per second means
04:31:01.320 | billions
04:31:03.640 | thousands of millions of
04:31:05.640 | Billions of operations per second so it means thousands of giga operations per second while here we have only
04:31:12.840 | 2,000 gigabyte per second of memory
04:31:15.880 | transfer speed
04:31:18.640 | So basically in in a lot of computations that we do in the GPU
04:31:22.320 | The bottleneck is not how much compute we are using but how much data transfer is happening for this compute and as a matter of fact
04:31:29.560 | Flash attention basically exploits this difference in computation and memory transfer
04:31:36.760 | To reduce the memory transfer and redo computations because you it's faster than to redo computations twice instead of copying
04:31:44.560 | a different stuff from the GPU
04:31:49.480 | For the computation. So basically what we do is we are willing to sacrifice computation
04:31:55.160 | To reduce the data transfer. This is what we do with flash attention
04:31:59.400 | This is also one of the reason we use the gradient checkpointing
04:32:02.800 | So gradient checkpointing basically means that during the backward pass we redo some
04:32:06.720 | computations instead of saving them because if we save them then we need to recopy them from the high bandwidth memory to the local
04:32:12.380 | Memory, so it's faster to redo them instead of copying them the already processed one
04:32:17.520 | To speed up the computation
04:32:21.180 | So the one clock time which means the total time to compute the attention is determined
04:32:26.080 | Actually is bottlenecked not by the number of dot products that we are doing but how much data transfer happens
04:32:31.800 | So how to reduce the data transfer that we do when we do the multi head attention
04:32:36.560 | One way is to use less heads for the keys
04:32:41.280 | so what will happen is that the first head imagine we only use one head for the
04:32:47.800 | keys instead of
04:32:50.240 | Having multi head also for the keys and values. So we don't have this part anymore
04:32:54.400 | we only have a multi head for the we have many heads for the
04:33:00.480 | Let's see
04:33:02.480 | We only have one we only have multi head for the queries
04:33:06.840 | So we don't have multi head for the keys or we have less heads for the key
04:33:11.000 | Imagine that we are in the extreme case in which we only have one head for the key and value
04:33:16.080 | But we have multi head for the query. What will happen is that the first core will copy the first
04:33:21.080 | 128 dimensions for the queries from the high bandwidth memory to the local memory and also the
04:33:28.440 | 128 dimensions for each token for the keys
04:33:31.720 | It will perform the computation now. Meanwhile, the also the second head needs to do its computation. So in parallel
04:33:39.200 | So, how can it do it needs to copy the 128 dimensions for the query?
04:33:43.640 | but it does not need to copy then the
04:33:47.440 | next group of 128 heads
04:33:50.740 | Dimensions from for each of the keys because it can be it can reuse the one for the keys
04:33:57.440 | so they each group of
04:33:59.440 | Heads of the queries is sharing some heads for the keys so that they don't need to copy
04:34:06.080 | Again for different heads these dimensions, but they can share the already copied ones
04:34:12.480 | So this is the extreme case of having only one head for the keys, but we can have a group of heads
04:34:19.440 | So we can do for example that
04:34:22.560 | Instead of we have eight heads for the query and then we have four heads for the keys
04:34:27.740 | so the head number one and two for example for the query will share this head here and
04:34:33.720 | Then the head number three and four will share this head here
04:34:38.720 | So the head number one and two for the query will share this head here so that the total amount of transfer for the keys
04:34:44.720 | is only this part here and
04:34:47.080 | Then the head number let's add here add number three and the head number four will share a different
04:34:53.720 | Head of the keys, but it's shared as you can see every two head. We are sharing one head of the keys
04:35:00.880 | So these two head will not need to copy
04:35:03.520 | 128 dimensions each but
04:35:07.080 | 128 dimensions in total for both of these heads
04:35:10.440 | This reduces data transfer which speeds up the computation of the attention
04:35:15.120 | And this is the reason we have here in the computation of the attention the projection for the WK and WV
04:35:23.000 | Has less parameters because we are trying to compress these
04:35:27.120 | tokens into smaller
04:35:30.240 | tokens
04:35:32.720 | Equal to the number of heads that we need for this projection
04:35:37.360 | So for the keys, for example, if we have only two heads for the keys
04:35:41.920 | we will compress these tokens into
04:35:44.420 | 256 dimensions so that
04:35:47.520 | every
04:35:49.960 | Four heads of the query will have one head for the key
04:35:54.300 | Imagine we have four heads for the keys and values then we will have this one will be four
04:35:59.000 | So what will happen is that every two heads of the query will be using one and this one will become 512
04:36:06.640 | Every two head of the query will share one head of the keys. So the total data transfer is reduced
04:36:13.240 | So we speed up the computation of the attention
04:36:15.920 | Of course, you may be wondering but this should also reduce the quality of the model because we have less parameters
04:36:22.120 | We have less expressive power for the keys and values and it's true
04:36:26.040 | So if you look at the paper, they say that in the multi query attention
04:36:30.300 | It reduces the quality of the model, but not much so it's something that we can afford to lose
04:36:36.360 | and the group query attention is basically a
04:36:39.080 | Let's check group query attention paper, which is this one
04:36:44.560 | So in the multi query attention, you have one head for the keys and values
04:36:50.120 | Which is shared for all the heads of the queries in the group query attention
04:36:54.480 | We have a group of heads for the queries sharing one head of the key
04:37:01.280 | So when you have multi query attention, you have only one head here for the query and the keys and values
04:37:07.440 | When you have a group query attention, you have multiple heads
04:37:11.160 | Of the keys sharing one head of the queries sharing one head of the keys and values
04:37:16.760 | So basically the multi query attention
04:37:19.260 | Multi query attention, which is only using one head for the keys and values reduces a lot of the quality
04:37:24.720 | a good compromise is between the full multi head attention and multi query attention is the group query attention which reduces
04:37:32.700 | Slightly less the quality of the model, but still gives you this computational advantage of reducing the quantity of data transfer
04:37:39.960 | another very big advantage of
04:37:42.440 | Group query attention is that you reduce the size of the KB cache because as you remember
04:37:47.480 | We have one KB cache for each layer and in each
04:37:50.720 | KB cache we need to save each token
04:37:54.320 | so if we compress these tokens the total amount of memory required for the KB cache reduces them and
04:38:01.120 | Actually, the KB cache is also one of the bottlenecks in today's language model
04:38:06.280 | So we have these big language models that are like 70 billion parameters or whatever
04:38:12.940 | But the the problem using them is not even actually the GPU memory requirement just for storing the model
04:38:20.700 | But actually for storing this big KB cache because you have to store each single token in each of the layers of the model
04:38:26.920 | Which actually grows very fast if you have a lot of tokens
04:38:29.960 | Okay. Now that we have seen how the group query attention works, we can proceed further
04:38:35.520 | Let's continue our journey
04:38:40.400 | So the next part that we need is this beautiful thing called the rotary positional
04:38:45.560 | Encodings that I will not explain right now. We I will explain them after
04:38:49.260 | Explaining completing the attention module
04:38:52.380 | for now, we just consider them as a black box that adds some information encodes the information of
04:38:58.300 | Position in the tokens and later we will see how it works
04:39:01.700 | Let's implement the forward method. So the forward method is this one
04:39:07.140 | so basically it takes the hidden states, which is the input to the
04:39:12.140 | After the in the decoder layer is the output of the first
04:39:16.300 | RMS normalization
04:39:19.180 | Then we have the attention mask the position in the position that we need to apply to each token because we need to apply the positional
04:39:25.300 | Encodings and then the KB cache in case we are using it and now we will implement it
04:39:30.020 | So the computation of the attention is the same as before
04:39:33.220 | Let me copy a big part. So like this
04:39:38.180 | The first thing we do is we extract the batch size and what how many
04:39:41.880 | What is the length of the queries?
04:39:45.120 | So what is the length of the input sequence because as you remember when we do token generation
04:39:50.220 | During the prefilling the QLAN will be all the inputs prompt
04:39:54.900 | But then during token generation the Q will only be one single token because we want to
04:40:00.060 | Generate all the last part of the attention matrix. So the last row so we need only one query
04:40:05.580 | But how can we have all the keys to attend to because we have something called the KB cache which will store all the keys
04:40:11.580 | So what we are computing here is the same as before
04:40:15.860 | So we are converting the input sequence into query key and values and then we are splitting this
04:40:22.300 | Embeddings into groups of dimensions based on how many heads we have for the query key and values
04:40:31.020 | For the query, we will split it into numHeads number of groups
04:40:35.100 | Each number or each group will have headDim number of dimensions and for the keys and values
04:40:41.420 | We will have numKeyValueHeads number of groups and each group will have headDim number of dimensions to manage
04:40:48.620 | Then we do this transposition so I can show you again. What does this transposition do? So let's do it
04:40:55.500 | Let's go back to our
04:41:00.980 | So the first part that we are doing here big up to the transposition is this one
04:41:06.740 | So we are multiplying the input sequence with WQWK and WV and splitting these
04:41:13.220 | embeddings into heads
04:41:15.780 | So that each embedding is a group is a list of groups where each group is managing some dimensions
04:41:23.780 | So now what we end up is basically a sequence of what?
04:41:28.100 | Tokens where each token is made up of groups and each group is managing for example
04:41:33.020 | 128 dimensions
04:41:35.420 | Then we use this transposition because we want to have at the first dimension the heads dimension
04:41:42.660 | So that we have a structure like this
04:41:45.260 | So instead of having a sequence of tokens where each token has groups of dimensions
04:41:51.300 | We want a list of groups where each group is a head
04:41:55.420 | Each head has some tokens how many equal to the sequence length and each token is a mini token
04:42:03.840 | Which is the dimensions dedicated to that specific head. So the head number one will have
04:42:09.460 | 128 dimensions the head number two will have the next number group of
04:42:14.420 | 120 dimensions etc until the last one which will have the last group of 128 dimensions
04:42:20.580 | This allow us to compute the multi-head attention this for this using this
04:42:26.180 | Sequence this sequence this sequence and this sequence all in parallel
04:42:31.620 | Okay, and this is the meaning of this transposition
04:42:37.520 | Transpose the next thing that we do is we apply the rotary positional encodings and now
04:42:44.020 | We didn't talk about the rotary positional encodings and we will talk about later
04:42:48.540 | But for now, you need to think that we are not changing the shape of these keys and queries and values
04:42:55.220 | We are just
04:42:57.100 | modifying them by adding some information that
04:43:00.540 | Encodes their position and it will be done by this method called apply rotary positional embedding
04:43:06.580 | We will see later how it works for now
04:43:10.060 | just think that in the query and the keys we have encoded some information which will be leveraged by the attention mechanism to
04:43:18.020 | Relate tokens to each other differently based on their position basically, but we will see that later. So
04:43:24.100 | Suppose that we have already encoded the positional
04:43:27.200 | Information. So now we need to as you remember when we do work with the KV cache
04:43:32.460 | we pass only one single token as input to the layers of the
04:43:38.620 | Transformer and this single token is added to the KV cache in the keys and the values cache of this
04:43:47.020 | Particular layer then we retrieve the content of this KV cache which includes the newly added the token and all the previously saved
04:43:54.660 | token and then we use this
04:43:56.940 | Output of this KV cache to calculate the attention. So let's implement this KV cache
04:44:02.980 | so it's very simple because it's only one method to implement which basically will just take the
04:44:08.500 | Single token that we are sending in which is this key states will add it to the key cache
04:44:13.940 | will take this value states which is one single token add it to the value cache and then retrieve all the content of the cache as
04:44:20.860 | Output so all the past token it has seen plus the current one
04:44:25.060 | So let's implement it and we go to the beginning of the file
04:44:33.500 | Class KV cache. Let's do it like this
04:44:38.740 | So we create a constructor as you can see it is a kind of a buffer where that includes one buffer for each layer of
04:44:44.940 | the model one for the keys and one for the values
04:44:48.980 | We also have this helper method that allow that tells us how many items the KV cache currently stores
04:44:56.780 | So if this KV cache does not contain any item we say zero if it contains something then we return
04:45:03.060 | What is the number of items it stores which as you remember when we add the something to the KV cache we are adding
04:45:10.100 | This tensor here, which is the key value states and value states which are tensors of this shape
04:45:17.700 | So batch size and number of heads sequence length and head dimension
04:45:21.540 | Which means that the sequence length is the second last dimension. So that's why
04:45:27.700 | We return the second last dimensions to retrieve the sequence lengths currently stored in the KV cache
04:45:33.060 | We then implement the update method which is also very simple and I added some
04:45:39.540 | comments to it to make it simple
04:45:41.900 | So basically it means that it this will add the content of this key states and value states to the KV cache of this layer
04:45:49.620 | And then it will return whatever is stored for this layer
04:45:53.820 | So if we have never added anything to the KV cache of this layer, then we create it. So we basically append this tensors
04:46:00.900 | It means that we have nothing else to concatenate it with
04:46:04.660 | However, if we otherwise we are we already have some tokens in the key cache and the value cache of this particular layer
04:46:11.540 | Then we concatenate whatever is already present with the newly incoming token along which dimension along the sequence dimension and the sequence dimension
04:46:19.620 | We saw before is the dimension -2. That's why we concatenate them along the dimension -2
04:46:24.960 | so after concatenating them we retrieve all the content of the
04:46:29.340 | K and V cache and return it for the current layer and this is what is happening here
04:46:35.340 | Here so we add this incoming key values and key states and value states to the KV cache
04:46:43.420 | Then we retrieve them and we use them to compute the attention
04:46:46.900 | Now you need to remember that when we do use the KV cache
04:46:50.700 | There are two phases when working with the model with the KV cache
04:46:54.700 | There is one part called the prefilling in which we have the prompt the prompt in our case will be the image tokens plus
04:47:00.640 | The user prompt so the what the user wants the model to do with this image
04:47:05.920 | It will be a list of tokens. So this key states and this value states will be a list of tokens
04:47:12.220 | So they will be all added to the cache for the first time because initially the cache will be empty and will be retrieved here
04:47:18.380 | When we do token generation, we use the last token output by the model and
04:47:23.020 | We add it one at a time to the KV cache
04:47:26.660 | But we always retrieve all the content of the KV cache to compute the attention because the each query needs to attend all the past
04:47:33.620 | keys and values
04:47:35.620 | It needs to attend all the past keys which are then used to compute the weighted sum using the values
04:47:42.760 | Um, okay, what is the next part of the computation of the attention? Well, well, well here
04:47:50.300 | we have this
04:47:53.180 | repeat
04:47:55.400 | Now we need this method called the repeat KV which basically will repeat the
04:48:00.560 | dimension of the
04:48:03.560 | Of the keys and values that are missing for the heads of the query
04:48:11.880 | Um, okay, let me explain it with the iPad because it's much easier to draw than to explain by words. So let's go here
04:48:19.880 | Let's go here
04:48:23.980 | Okay. So what happens with this repeat method is that we have the projection
04:48:30.160 | Through WK and WV of the token that results in a smaller token
04:48:36.680 | Which gives us some benefit from the KV cache point of view for example
04:48:40.500 | But to compute the attention each head needs to share the heads
04:48:45.360 | Each query heads needs to share the head with other query heads when working with the keys
04:48:52.440 | so for example
04:48:53.760 | The first two heads of the query needs to share one head for the keys
04:48:57.920 | Then the second two heads for the query needs to share one head for the keys
04:49:01.920 | what we do is basically we
04:49:05.360 | Repeat this because we are working with the naive implementation of the attention which does not really
04:49:12.040 | Actually benefit from this optimization. So what we do is basically we just repeat the missing heads as
04:49:18.940 | You can see here. So we we take the heads that are missing and we just repeat them to match the heads
04:49:26.360 | to match the heads of the query so
04:49:31.580 | Like this one so that it's like each head each query head which has its own head also for the keys
04:49:37.480 | This is because actually we are not creating a custom CUDA kernel for the computation of the attention
04:49:43.240 | So we repeat it and we just pretend like the grouped query attention never happened
04:49:49.840 | but for example
04:49:50.760 | If you use a flash attention flash attention actually leverages the reduced number of heads of the keys and values to optimize the computation
04:49:58.680 | of the attention
04:50:00.560 | So basically we are kind of reversing the effect of grouped query attention when calculating the attention because we don't have this
04:50:07.440 | Custom CUDA kernel that can leverage this by not copying the missing heads
04:50:11.720 | The repeatKV function is very simple
04:50:16.680 | So we can implement that as well because it will just repeat the heads that are missing for the keys and values
04:50:23.000 | So let's implement it here
04:50:27.360 | As you can see if we have a tensor and we know that this tensor has the following shape
04:50:32.920 | So the batch the number of heads the sequence length and the head dimension
04:50:36.840 | If we only need to repeat it once then we just return it because we don't have to repeat anything
04:50:41.720 | otherwise, we introduce a new dimension, which is how many times we want to repeat this number of heads and then we
04:50:49.180 | We do this reshaping which will basically repeat this number of heads that much number of time
04:50:57.040 | Actually, the repetition is done by the expand method here. So we introduce a new dimension here
04:51:02.640 | Which is the number of repetitions and then we expand it. This expansion basically repeats whatever content is
04:51:09.440 | This content here for each of the heads in the nrep heads
04:51:15.540 | So basically we are repeating whatever comes after these two dimensions this number of times
04:51:22.680 | and then we remove this helper dimension that we have created the nrep dimension that we only created to repeat the number of heads and
04:51:30.680 | How do we do it? We must multiply the number of repetitions that we need with the number of key value heads
04:51:37.320 | So at the output of this method the number of heads that you will have is the same as the number of heads of the query
04:51:43.600 | So let's go back here
04:51:45.920 | So now it will this key states and value states will have the same number of heads as the query
04:51:51.400 | So now we can just compute the attention like we have always been doing so by doing the query
04:51:55.640 | Multiplied by the transpose of the keys divided by the square root of the model, etc, etc
04:51:59.760 | So let's do it
04:52:02.440 | We also add the attention mask
04:52:05.880 | so we compute the attention weights just like this standard formula query multiplied by the transpose of the keys divided by the square root of
04:52:12.520 | The D model the model is the number of dimensions
04:52:15.200 | managed by each head
04:52:18.120 | We then add the attention mask right before
04:52:21.800 | Using the softmax. So the attention mask
04:52:25.360 | That's why we in our case will always be made of zeros because we don't have any padding
04:52:30.440 | so we don't need to mask anything and also during the prefilling we don't mask anything because
04:52:34.360 | We always let the prompt the user prompt. So the text prompt to also attend feature tokens. Why? Because the polygamma
04:52:42.500 | Autors made this decision and
04:52:46.800 | They decided that the prompt the user prompt or the task prompt does not need to be causal because anyway
04:52:53.480 | It will never be generated by the model. It will always be
04:52:55.840 | set by the user
04:52:58.680 | So we apply the softmax and then the dropout but the dropout we never have so this stuff here is very simple
04:53:06.560 | So we apply the softmax
04:53:08.480 | Row by row then we apply the dropout but the dropout is always zero and we as you know
04:53:13.380 | The dropout is only applied during training but just ignore it like it's not there
04:53:17.960 | Then the output of the multi head attention is multiplied by the value states
04:53:24.160 | So this attention weights is multiplied by the value state value matrix, which will result in that
04:53:31.700 | weighted sum we saw before so each token is
04:53:35.920 | an aggregation of previous tokens based on the
04:53:40.440 | Score defined in the attention matrix. So if you want to visualize it again, I can show it to you again. So let's go here
04:53:47.640 | When we do the multiplication with the V which is here
04:53:52.860 | Basically this output token
04:53:56.240 | Let's say this one here is a contextualized token and that will include information about three tokens. I love pepperoni and
04:54:03.640 | It will be a weighted sum of these three tokens
04:54:08.240 | So I love pepperoni based on the following weights
04:54:11.440 | So basically the token I will contribute to 20% of information the token love will contribute to 40% of information
04:54:18.640 | The token pepperoni will contribute 40% of information and the last token will not contribute any information because it has been masked out
04:54:25.880 | So this is what happens when you multiply the V that you are doing a weighted sum using the attention weights as weights
04:54:35.840 | Then what else we need to do we need to check okay the output shape and that's fine I can do that so
04:54:43.060 | we do this one and
04:54:45.680 | Then we transpose back
04:54:48.160 | Like we did before
04:54:51.360 | So we transpose back to have again the sequence length as the second dimension then the num heads as the third dimension
04:54:58.200 | then we
04:55:01.000 | Concatenate all the heads together just like we saw before so now each token is back to the head hidden size
04:55:08.400 | Dimension where this hidden size is the concatenation of the output of each head
04:55:13.740 | but we if you just concatenate the output of these heads then the each embedding will just be an
04:55:21.800 | Independent calculation of each head concatenated together
04:55:25.640 | So we need some kind of mixing mechanism and this mixing mechanism is given by WO which will mix all these
04:55:32.880 | Dimensions with each other so that the result of each head is kind of mixed with each other through this WO projection
04:55:40.240 | So that this output
04:55:42.400 | Token from this multi head attention is not just a concatenation of multiple independent heads
04:55:49.520 | But it's something that is also mixing the results of this independent heads
04:55:54.600 | And then we result will return the result of this multi head attention
04:55:59.240 | Now one thing that we have considered as a black box so far is the rotary positional encoding
04:56:05.280 | So we have said okay
04:56:06.800 | we are encoding somehow the positional encodings in these queries and keys and then the
04:56:13.040 | Multi head attention will leverage it now. It's time to expand on that and understand how it works. So let's do it
04:56:20.320 | All right. So let's talk about positional encoding guys
04:56:23.800 | so traditionally we are used to work with the
04:56:27.720 | Positional encodings applied directly at the entrance of the transformer, which means that we take some embeddings
04:56:34.400 | So we transform we have our tokens which indicates the position of the token in the vocabulary
04:56:40.180 | We convert them into embeddings using the embedding layer, which is this stuff here
04:56:45.040 | And then we add some other
04:56:50.020 | Vectors to these embeddings that encode the position information of each token because otherwise the model has no
04:56:56.200 | notion of position the model treats each token as you as you saw before each head just does a dot product of two tokens and
04:57:04.320 | If the position information is not encoded in these two tokens that the dot product can only access the embeddings
04:57:10.840 | So it does not have any notion of which token comes first and which comes later
04:57:16.180 | So to encode this information, we basically traditionally we are used to add a positional encoding here to the embeddings of each
04:57:24.080 | Token and so that the embeddings basically encode the information of the position in the original transformer paper. They proposed this
04:57:31.540 | sinusoidal positional encodings which are also known as absolute positional encodings because they encode the absolute position in the
04:57:39.240 | Inside each token. So the token number one will have some dimensions some vector that will encode the position number one
04:57:45.980 | The token number five in the sentence will have the position number five added to it, etc, etc
04:57:51.060 | What we use in most language models nowadays is the rotary positional encodings
04:57:57.580 | Which are in the family of the relative positional encodings and they work as follows. So let's open the paper
04:58:03.420 | They were introduced in this paper called the raw former enhanced transformer with rotary positional embedding
04:58:13.740 | Basically the idea with the this
04:58:15.820 | Positional encodings is that we do not add them directly to the embedding of each token
04:58:22.420 | so that each token encodes the information of its position, but they
04:58:26.060 | modify the attention mechanism in such a way that the attention mechanism takes into
04:58:32.100 | Consideration the position of the tokens to relate them differently based on their position. Let's see how they did
04:58:38.580 | So basically in the paper they say okay
04:58:42.580 | We have this multi multi head attention mechanism that uses the dot product as to relate tokens to each other
04:58:49.720 | so they said okay, can we find an
04:58:52.780 | encoding of the embedding vectors of tokens such that
04:58:58.080 | When we do the dot product, which is an inner product. So this sign here means the inner product
04:59:03.980 | So can we find an encoding for the token called FQ for the query and FK for the keys?
04:59:11.380 | that encodes the position information inside the embedding XM for the query and
04:59:17.940 | XN for the keys such that when we do the dot product
04:59:22.900 | So this function G
04:59:24.580 | this dot product, the output of this dot product
04:59:27.140 | Only depends on the embedding of the first token the embedding of the second token and the relative distance between them
04:59:35.120 | So that's why they are called relative positional encodings because they depend the dot product is modified
04:59:40.660 | so the attention mechanism is modified such that the dot product should depend only on the
04:59:46.660 | Embedding of the first token on the embedding of the second token and the relative distance between them
04:59:52.940 | So we need to find a way to encode
04:59:56.120 | information inside of our embedding such that this dot product will depend only on the embedding of the first
05:00:03.740 | embedding of the second and the relative distance
05:00:06.900 | so how to encode this information inside the
05:00:10.060 | Embeddings. Well, they
05:00:12.740 | Proposed the following case for the 2D case. So imagine we have an embedding vector made up of only two dimensions
05:00:20.740 | How to encode the information of the position in this two-dimensional vector as follows
05:00:29.420 | basically, we create a matrix that is a
05:00:34.260 | Rotation matrix. So if you have ever worked with the rotation matrix like when you do rotation of a vector in 2D space
05:00:41.720 | you basically multiply the vector by this matrix here where the
05:00:45.640 | Argument of the cosine and the sine is a multiple of an angle that defines by how much you want to rotate this vector
05:00:55.380 | So if we basically
05:00:58.380 | Multiply the two dimensions of this vector by this matrix here
05:01:03.180 | Which is we will see what is it and then this matrix here, which is a rotation matrix
05:01:08.700 | Then basically we are rotating this vector by some angle defined by this
05:01:14.940 | M theta angle
05:01:18.100 | This will encode the information so the output of this operation
05:01:23.660 | So the output of this operation will be a 2D vector which will encode the information of the position
05:01:30.260 | based on this position M
05:01:32.460 | Such that when we do the dot product of two vectors encoded like this, this dot product is guaranteed to be
05:01:40.740 | To be a function of the embedding of the first vector, embedding of the second vector and the relative distance that was
05:01:50.620 | encoded into them
05:01:52.980 | The difference of the distance that was encoded into them
05:01:56.160 | Basically, but we usually when we have an embedding we do not have a 2D vector
05:02:04.540 | We have a multi-dimensional vector, maybe 1000 dimensions or 2000 dimensions
05:02:09.940 | So they take the 2D case to the general case and the general case basically they say okay instead of
05:02:17.380 | multiplying the
05:02:19.500 | token by
05:02:21.820 | So instead of using this 2D rotation matrix, we need to have this big rotation matrix here for an
05:02:27.900 | D-dimensional vector. So here is the d-dimensional vector
05:02:31.980 | If you look at this vector this matrix here as you can see it is a sparse matrix
05:02:38.820 | Which means that it is mostly made up of zeros and only some elements are non zeros
05:02:44.580 | So if we encode the information using this transformation here by using this matrix here
05:02:50.860 | We will be doing a computation that will result in the following property being verified
05:02:56.380 | which is that the when we do the dot product this dot product will only depend on the
05:03:01.140 | Embedding of the first token the embedding of the second token and the relative distance of the two positions that were that was encoded into
05:03:08.580 | these tokens
05:03:09.980 | But we will be doing a lot of unnecessary computations because a lot of zeros will be
05:03:14.780 | Will be multiplied by other elements which will result in zero. So we are doing a lot of
05:03:20.180 | computation
05:03:21.940 | Uselessly because in a sparse matrix
05:03:23.940 | If most of the elements are non zeros and only some of them are non zeros
05:03:29.780 | That means that you are doing a lot of computations uselessly
05:03:32.620 | Because you already know that in advance that they are going there. They are zeros
05:03:37.220 | So is there a better way to compute this encoding mechanism to reduce this unnecessary?
05:03:44.860 | Computations knowing already that most of them are zeros and we also know where they should be zeros
05:03:50.660 | Well, yes, there is it is possible and they propose another
05:03:54.500 | more computationally efficient
05:03:57.180 | realization of this matrix
05:03:59.540 | Which basically says that if you want to encode the position information inside your tensor inside your embedding
05:04:06.140 | You need to take the embedding
05:04:08.900 | Here this so a d-dimensional vector because we know it's a d-dimensional vector. So where d can be 1000, 2000
05:04:14.940 | Whatever it is. Suppose in our case, it's 1024
05:04:18.180 | You multiply it element wise. So this is element wise multiplication by another matrix constructed as follows
05:04:26.580 | Where the first element is a cosine of m theta 1 and the second element is cosine of m theta 1 etc
05:04:33.460 | Where m is the position that you want to encode in this vector and the theta 1 theta 2 are
05:04:40.540 | Computed using the following formula here. So they show it
05:04:47.500 | Theta I is equal to the 10,000 to the power of minus 2 I
05:04:52.020 | Divide by D where I is from 0 to D divide by 2. I remember correctly
05:04:57.620 | They show it here. Yeah, I goes from 1 to D divide by 2
05:05:05.220 | Let's go back
05:05:07.500 | So basically what we are doing is we are multiplying each dimension of this vector by a cosine
05:05:13.140 | Where where the argument of the cosine is a multiple of a base theta
05:05:19.380 | Multiplied by the position of the token that we want to encode into this token plus
05:05:26.340 | The dimensions of this vector but rotated and with changed signs
05:05:32.940 | Multiplied element wise with the sign of the same arguments that we use for the cosine
05:05:38.780 | And if you encode your vector like this
05:05:42.540 | And when you do the dot product of two vectors encoded like this
05:05:47.060 | What will happen is that the dot product is guaranteed to be
05:05:50.980 | The number that comes out of this dot product
05:05:54.460 | Will be depending on the embedding of the first vector
05:05:58.340 | So the information that was encoded before adding the positional encoding the embedding of the second vector
05:06:04.360 | So the information that was encoded in the vector before adding the positional encoding and the relative distance plus
05:06:10.580 | they also say that
05:06:12.980 | Basically the rotary positional encoding also have a
05:06:17.260 | decaying effect based on the distance between two tokens
05:06:21.260 | which means that the dot product as we know the dot product is converted into a score by the
05:06:27.980 | Softmax, so it tells us how intense is the relationship between two tokens
05:06:33.020 | So the bigger the dot products the more that that token will contribute to the output
05:06:38.940 | Contextualized embedding as we saw before
05:06:41.140 | So each of the attention scores tells us how much information that token will contribute to the output contextualized embedding
05:06:48.900 | So with the rotary positional encoding what happened is that this dot product will modified in such a way
05:06:56.500 | That the dot product will be high when two tokens are close and as they move apart
05:07:03.740 | So the distance between the two tokens for which we are doing the dot products grows
05:07:08.940 | The dot product will decay will decrease in magnitude
05:07:13.820 | So the output number will be smaller and smaller and smaller based on the relative distance between the two tokens
05:07:19.740 | And they give a relative upper bound based on the relative distance between two tokens
05:07:26.380 | So, rehearse, to encode the positional information of a token using
05:07:32.500 | Rotary positional encoding we need to do the following computation where we take the vector of the token
05:07:39.380 | We multiply it by a special matrix constructed like this
05:07:42.820 | plus again the
05:07:45.380 | Vector of the the token itself, but with dimensions changed in position
05:07:51.260 | So first we create a special vector where we put first the second dimension of the vector, but with the change sign
05:07:58.300 | then the first
05:08:00.820 | Dimension then the fourth dimension with its sign change then the third dimension, etc, etc
05:08:06.860 | And then multiplied by a sign this matrix constructed as follows using the theta values
05:08:13.940 | calculated according to this formula here this one here and
05:08:19.820 | The each of this sign and cosine is basically
05:08:24.180 | Working with an argument that is a multiple of this base theta multiplied by the position that we want to encode into this token
05:08:33.460 | And if you want to visualize in the rotary positional encoding paper
05:08:38.940 | They also say what is the meaning of this rotary positional encoding?
05:08:42.860 | So basically each two dimension as you can see from this matrix here
05:08:46.580 | Each two dimension are being rotated by the same angle
05:08:50.300 | So basically it's we are have a token that is made up of many dimensions
05:08:55.780 | So each pair of dimensions is getting rotated like a 2d vector
05:09:00.500 | So each two dimensions are considered like a two dimensional vector
05:09:05.340 | Which is getting rotated by an angle that is a multiple of the base angle
05:09:10.460 | Multiple with respect to the position that you want to encode
05:09:15.020 | And this is the the meaning of the rotary positional encoding. So the rotary positional encoding to rehearse again
05:09:21.820 | modify the attention mechanism in such a way that the attention score that is generated is dependent on the
05:09:29.580 | Relative distance between two tokens and they also prove in the paper that this attention score
05:09:34.940 | Decays as the distance between the token grows
05:09:37.940 | Okay, now that we have seen how it works. Let's code it
05:09:43.940 | And actually in the code that we are going to write you will see that
05:09:46.660 | I am going to use the HuggingFace implementation of the rotary positional encodings
05:09:51.240 | And we will see that the rotary positional encoding that it's implemented in the HuggingFace library. It's slightly different from the
05:09:58.500 | Hugging with the formula that you see
05:10:01.540 | Here this one here
05:10:04.580 | But it according to the authors it results in the same computation. So
05:10:10.020 | They they do it this way
05:10:12.020 | They I will also share the blog post in which they they explain why they do it this way
05:10:17.060 | So it's a slightly difference, but the idea is the same
05:10:20.340 | So it will result in a slightly different calculation, but the effect is the same. So let's do it
05:10:24.980 | All right, let's implement this rotary positional encoding
05:10:28.180 | So the first thing we need to create is this gamma rotary positional encoding class
05:10:33.060 | So for that we can do it. I think here it's same no problem
05:10:39.460 | Let's do it here
05:10:41.460 | Okay, so then we are giving some parameters dim is the head dimensions because each head because the
05:10:48.820 | Rotary positional encodings modify the attention mechanism
05:10:51.720 | The attention mechanism is performed independently for each attention heads
05:10:56.420 | So each head will have its own positional encoding applied to the tokens
05:11:01.540 | So this dim is the set to the head dimension. So the number of dimensions managed by each head in the multi-head attention
05:11:09.060 | Then we have the max positional embeddings, which tells us
05:11:11.700 | What is the maximum number of positions we can encode?
05:11:15.460 | this is
05:11:17.540 | Set to 8000 actually in the gamma configuration here. It's initialized to 2000, but actually it will be overwritten
05:11:23.480 | And then we have the base parameter theta which is set to 10000 also in the original paper
05:11:29.940 | So let me show you from the paper
05:11:32.100 | Let's go here
05:11:39.220 | I think I can find it
05:11:41.220 | Here as you can see, it's 10000 to the power of minus 2 id. So this stuff here
05:11:49.540 | Then we have this inverse frequency. So this inverse frequency is just the formula you can see here
05:11:55.380 | So 10000 to the power of minus 2 i divided by d where i goes from it's written here
05:12:01.700 | i goes from 0 to
05:12:04.100 | 1 to d divided by half
05:12:06.580 | so d divided by 2
05:12:08.580 | And so the formula we are using is actually I think this one here to calculate it
05:12:13.940 | So 10000 to the power of minus 2 i divided by d
05:12:18.260 | So it's 10000 divided
05:12:20.900 | It's 10000 to the power of minus
05:12:24.660 | Minus something but when you have the negative power, it means one over the same thing with the positive power
05:12:32.740 | So that's why we have one over
05:12:36.080 | 10000 to the power of the positive power. So
05:12:38.880 | Let me write it. Actually when you have x to the power of minus 3
05:12:45.440 | It means 1 over x to the power of 3. So that's why you have 1 over
05:12:51.200 | 10000 to the power of something
05:12:54.160 | And what is this something that we are raising to the power 10000 to?
05:12:57.840 | It's a list of numbers that goes from 0 to dimension divided by 2 which is the i
05:13:05.280 | Divide by d where d is the number of dimensions
05:13:09.760 | So of the vector to which we will apply the rotary positional encoding which is according to this formula here
05:13:16.320 | so i goes from 0 to
05:13:18.560 | d divided by
05:13:20.880 | d divided by 2 and d is the number of dimensions of the vector to which we apply the rotary positional encodings in our case
05:13:26.800 | It's equal to the head dimensions because each head will have it positional encodings applied to it
05:13:32.480 | We use this arrangement to generate a list of numbers from 0 to
05:13:37.200 | d divided by 2. So basically it's a 0 to dim by skipping every 2
05:13:43.040 | What else we need to do here I believe we need to go let me check
05:13:51.120 | Okay, so now we can implement the forward method of this
05:13:58.880 | So to calculate the rotary positional encodings we need to generate so now let me check the go back to the paper and then explain the
05:14:06.720 | Forward method so to calculate to apply the rotary positional encodings. We need the vector itself
05:14:14.800 | The vector itself and then we need to multiply each dimensions by some cosine and each dimensions
05:14:21.220 | Rotated and with its change its sign changed with some signs
05:14:26.300 | computed as follows so given some positions we can for each position m compute the cosine and the sine that will be
05:14:34.300 | Needed to multiply by these vectors and this is what we do in the forward method here
05:14:39.420 | We actually extract the cosines and the sines that will be applied to each tokens
05:14:44.140 | Depending on the positions of these tokens. So for each token, we will have a different position
05:14:50.700 | So this m parameter indicates the position of the token
05:14:54.860 | So for each m we can compute the cosines and the sines and this is what we do in the forward method here
05:15:00.220 | So we take the inverse frequency we add another the
05:15:03.820 | Another dimension, which is I believe it's for the batch dimension
05:15:08.880 | And then we
05:15:12.060 | Disable the auto cast so the auto cast in torch is for mixed precision
05:15:16.700 | so I don't want to go too much into the detail of this stuff, but
05:15:21.500 | Mixed precision is basically when you train a when you train a model
05:15:25.500 | You don't have to work with the floating point 32 numbers always because the most modern gpus
05:15:31.580 | They also support working with the 16 bit numbers
05:15:34.940 | Which makes computations faster and also reduces the memory of these computations. Of course, you use a little bit of precision
05:15:42.400 | But the the precision that you need for some operations is not necessary for some operations. You don't need that much precision. So the
05:15:51.240 | multi-automatic
05:15:53.240 | mixed
05:15:54.680 | Precision, I think it's called
05:15:56.680 | Handles this automatically for you
05:16:00.280 | So it will use the smaller precision for the numbers when computing certain operations and higher precision
05:16:06.600 | So 32-bit when computing other operations such that we are kind of we never lose much
05:16:12.760 | quality in the model
05:16:16.280 | Probably here for the rotary positional encodings. We want to retain the full quality of so the full
05:16:22.600 | Precision, so we disable this automatic
05:16:25.720 | Auto custom
05:16:29.400 | Okay, so
05:16:31.960 | We are basically multiplying each frequency by each position that we want to encode because as you can see from the paper
05:16:38.520 | So let's go here
05:16:40.760 | We need to multiply this m by the base frequency. We have already the base frequencies in this
05:16:46.200 | infrequent freq expanded
05:16:48.820 | So we are multiplying it by each m. So we are computing the arguments of this cosines and sines
05:16:58.120 | We concatenate this
05:17:00.120 | Cosines and sines. Why? Because we have them for dim divided by two
05:17:05.640 | So for half the vector, but we need it for the entire vector
05:17:10.520 | And we are concatenating here. Now. This is actually different from what we do in the paper
05:17:15.960 | because in the paper
05:17:18.600 | We need to repeat each argument twice for each successive dimension
05:17:25.240 | So for each two dimension, we need the same argument
05:17:28.200 | what we are doing here with the concatenation is actually we are taking this one then this one then
05:17:35.400 | The theta 3 then theta 4 and then again, we are repeating theta 1 theta 2 theta 3 theta 4 instead of doing theta 1 theta 1
05:17:42.200 | theta 2 theta 2 theta 3 theta 3 so the overall numbers of
05:17:46.200 | Numbers that we will produce in the arguments that we produce is the same
05:17:51.480 | But instead of being like in the paper theta 1 theta 1 theta 2 theta 2 theta 3 theta 3 theta 4 theta 4 blah blah
05:17:59.000 | We are actually doing theta 1 theta 2 theta 3 and then we are repeating them
05:18:05.080 | Theta 1 theta 2 theta 3
05:18:07.720 | Why are we doing this? Now, it's a very long story, but basically it looks like when HuggingFace converted the
05:18:16.200 | Weights of the model for example llama from the original pre-trained model into the HuggingFace
05:18:25.100 | they permuted the
05:18:27.800 | Projection the query and the key projection which is the embedding of the token
05:18:35.320 | Each dimension was permuted
05:18:37.580 | And then to accommodate for this permuted dimension
05:18:42.940 | They are doing again a different computation for the rotary positional encodings
05:18:48.840 | So the overall effect that will result from this computation is the same as the original paper
05:18:54.600 | but they are doing this double permutation because one permutation was already done when doing the
05:19:00.740 | Conversion of the script from the original pre-trained model to the HuggingFace
05:19:04.260 | format
05:19:07.140 | And this issue is explored in the
05:19:09.940 | in the HuggingFace transformer
05:19:13.300 | repository by this user who posted why the positional encodings are done differently than the paper and the
05:19:20.180 | authors the HuggingFace explained saying that
05:19:24.020 | When they converted the weights from the original model to the HuggingFace model
05:19:30.020 | They permuted the dimensions of the wq and wk and wq and wk are the projection metrics that are used to compute
05:19:37.940 | The query and the keys we apply the rotary positional encodings to the query and the keys. So we need to
05:19:43.140 | recompute do another permutation to counter effect the effect of the first permutation. So that's why the
05:19:49.700 | The computation we are doing does not reflect exactly the paper
05:19:53.380 | Let's go forward so we have created the argument of the cosine and the sine
05:19:59.620 | so now we
05:20:00.820 | compute the cosine and the sine
05:20:02.820 | Doing with this argument. So when you calculate call the cosine function on a
05:20:07.700 | tensor it will calculate the cosine using the
05:20:11.140 | Dimensions of this vector as arguments for the cosine and the same we do it for the sine
05:20:16.340 | So the output of this forward method here
05:20:19.860 | in the paper is basically this two thing here that we need for
05:20:25.860 | Applying the rotary positional encoding to each vector and we have computed the cosine and the sine for each
05:20:31.700 | Position that we have in our sequence. So for each m that we have in our sequence
05:20:37.620 | So let me delete this stuff. Otherwise it remains in my notes forever
05:20:41.700 | Let's go forward now. We need to implement another method called apply rotary positional embedding
05:20:48.680 | Which we include here and which I also copied from HuggingFace
05:20:54.580 | What we'll do basically, okay, this will add another dimension, which is the head dimension to these cosines and and sines that we pre-computed
05:21:01.880 | Where did we pre-compute them? Well, we computed them here
05:21:05.540 | So as you can see, we extract the cosines and the sines using the rotary positional encoding class that we have created before
05:21:11.060 | Using the value states is not used. It's just used to extract the data type of the resulting vector
05:21:17.380 | And the position ids that we want to encode
05:21:20.340 | So the m parameter of each of the arguments of the cosine and the sine
05:21:24.420 | So we compute the cosines and the sine and then we use them to apply the rotary positional encoding to the query and the keys
05:21:29.540 | Which will result in the output query and the keys with the rotary positional encoding applied. So now we are implementing this method here
05:21:36.420 | Which will encode the queries
05:21:40.100 | While multiplying the dimension of the vector query with the cosines, which is this part of the formula
05:21:47.380 | So as you can see the vector multiplied by the cosine
05:21:51.140 | And then the rotated vector so with its dimensions changed and the signs changed multiplied by the sign
05:21:58.260 | Which is this part of the formula here. We need to implement this method here rotate half
05:22:03.700 | Which is again not equal to what is in the paper because we need to change the we need to permute the dimensions because the
05:22:11.620 | original vectors so the q and k are permuted by
05:22:15.700 | This query projection and this key projection
05:22:20.980 | This rotate half method basically will take the first part of the
05:22:24.740 | Embedding and then it will take the second part of the embedding with its sign changed. I believe here
05:22:32.820 | And it will concatenate it it's different than the paper because in the paper we need to create
05:22:39.540 | Here we need to create minus x2 then x1 minus x4 x3. But here what we are doing is
05:22:47.540 | minus
05:22:48.900 | Let me check imagine the token is made up of 1000 dimensions. So we are doing minus 500
05:22:54.520 | 5124 dimensions. This is minus 1 513 minus 514 minus 515 blah blah blah
05:23:02.740 | And then we have 0 1 2 3 up to 512
05:23:06.440 | So it's different than this one here
05:23:09.780 | But because of the permutation that was done to the wk and wv wq and wk projections
05:23:18.840 | Okay, now we have also implemented the rotary positional encodings which encode the position information
05:23:24.940 | Right before the attention so that the attention mechanism will reflect this encoded information inside of each token
05:23:32.840 | It matches
05:23:34.840 | with the dot product
05:23:36.840 | What else do we need to build here? I believe we have everything. So let me do a very simple
05:23:45.560 | Check I think we have everything
05:23:47.720 | Guys, I think now we can proceed to the inference code. So we need to use this method
05:23:53.240 | So these classes that we have built to actually inference something. Let's do it
05:23:57.320 | All right, guys, let's go to the inference code. So let's create a new file called inference
05:24:06.520 | I also have prepared the test image that I will be using to
05:24:11.160 | Inference the language model. I will ask the language model
05:24:13.640 | What is this building and the language model should tell me that what is the name of this building?
05:24:17.640 | You can prepare any any image that you like
05:24:20.360 | So I also have this inference.py. So
05:24:24.280 | Let's start by writing some code. I will copy a large amount of code
05:24:29.640 | Because it's very nothing. No much machine learning here
05:24:34.120 | So basically i'm using a library called fire. So let's import stuff first
05:24:40.040 | Uh, oops this one
05:24:43.720 | Let's import some stuff so i'm importing a pill for the image loading torch fire fire is a library that allows you to
05:24:51.640 | Pass the command line arguments to a file to a script as parameters to a function
05:24:59.080 | So it will do automatically the parsing of the command line parameters
05:25:02.780 | And what I need to pass as the as command line is the model path
05:25:07.720 | So what are the weights of the model the prompt that we will be using to inference the model
05:25:12.520 | The image that we'll be using as condition for this prompt
05:25:15.880 | And the max number of tokens to generate the temperature that we want to apply later
05:25:20.520 | We will see what is it the top p
05:25:22.520 | The later we will see the do sample if we don't want to use the greedy strategy
05:25:26.520 | And if you don't want to use the cuda or the nps in case you are on the macbook
05:25:31.180 | So we forced to use the cpu as device for the computation in the neural network
05:25:37.880 | The first thing that this method will do is okay, we'll print which device we use
05:25:41.640 | Then it will load the model using this
05:25:44.600 | Method that we will implement later with the load hugging face model given the path and the device will load the model with the hugging
05:25:52.040 | from the hugging face
05:25:54.440 | By copying each tensor in the right position, but because we kept the name the same as the hugging face model
05:26:00.680 | So we don't need to do any name conversion
05:26:04.120 | We copy some we basically take the input and we process it using this polygamma processor which will take his input the tokenizer
05:26:12.220 | And the the prompt and the image and it will
05:26:18.520 | Transform it is input for our gamma model, which will then decode it
05:26:23.480 | And we will do that this in test inference method
05:26:27.720 | So for now, we are just creating the polygamma processor and the model itself using this load hugging face model, which we will create later
05:26:34.760 | Actually, no, let's do it now. So let's create a new file called utils
05:26:39.100 | And this utils file needs to have the following code
05:26:44.440 | So it's importing some stuff and then it's loading the hugging face method. So it's loading the tokenizer, which I said we will be using the
05:26:54.920 | Hugging face one. So we will not be coding the tokenizer
05:26:57.500 | But the weights of the model we can load them and if you look at the hugging face model
05:27:03.560 | If you go to the repository of the model, you will see that each model is a list of
05:27:09.160 | Safe tensor files each of these safe tensor files is actually a dictionary that contains the weights of the model
05:27:16.120 | So you can actually click on this icon here and it will show you what each of these them contains
05:27:21.800 | As you can see this one contains the multi-modal projector weight and bias
05:27:25.640 | This one contains the vision tower embeddings, encoder layers one, layer two, layer three, etc, etc for all the layers
05:27:32.440 | And for each layer it contains
05:27:34.920 | The wq projection, wk projection, wv projection, the weights and the bias
05:27:40.200 | The weights, the bias of the layer normalization, the weight of the layer normalization, etc, etc
05:27:46.600 | and each file contains a dictionary that contains some part of the
05:27:51.880 | Weights of this model
05:27:53.240 | So what i'm doing here is I find all the safe tensor files and then I load them each of them into a dictionary
05:27:59.020 | And then I use this dictionary to load the state dict of our neural network
05:28:03.960 | I also create the model using the config.json file that is present in the repository of the hugging face
05:28:12.200 | Model, so every hugging face model has this config.json
05:28:17.160 | So we create the configuration that is used to create our model using this configuration file
05:28:22.840 | and then I call tie weights which will copy the weights of the
05:28:27.000 | Embedding layer to the language modeling head which is the linear layer that projects the embeddings into logits
05:28:33.800 | And then we return the model and the tokenizer. So here there is no machine learning. I'm just loading the
05:28:39.720 | The weights of the model from the safe tensor files creating the
05:28:46.920 | Model using the configuration saved in config.json
05:28:49.900 | And then loading this state dict which means that I am loading the weights into our class
05:28:56.120 | This this class here into this model class and then i'm tying the weights and returning the tokenizer and the weights
05:29:02.920 | So now we can launch the inference. So we have the model and the tokenizer. We have created the processor
05:29:08.200 | So we have initialized it then we need to launch the inference
05:29:11.160 | Let's see how the inference works. So let's go back to here
05:29:16.360 | This test inference is also not so hard, but we need to do some
05:29:22.280 | Explanation on some parts
05:29:25.460 | So what we are doing is first of all, we take this
05:29:28.760 | Inputs so the image and the prompt
05:29:32.280 | Which is a text and we pass it to the processor and the processor will give us
05:29:37.560 | As you can see from the processing polygamma, it will return us the pixel values
05:29:42.520 | And the input ids and the attention mask
05:29:45.320 | So we get this
05:29:48.200 | These values from the processor. So we need to create this function which is also a simple helper function that allows to
05:29:55.720 | Get the output from the processor
05:29:59.180 | So we load the image we create the prompt which because the processor expects as input the text
05:30:05.640 | As a list and the image as a list even if it only works with one of them with a list of size one
05:30:11.800 | it takes the output of the processor which is the input ids the attention mask and the
05:30:16.120 | Pixel values of the image then it moves to the right device each of them
05:30:20.680 | So move to the right device is also a simple function that moves each tensor to the device specified by this function
05:30:27.640 | this parameter device
05:30:29.960 | And then returns it so now we have the input ids we have the attention mask we have the pixel values
05:30:36.440 | We create a KV cache, which is empty
05:30:40.040 | And what we do for based on how many tokens we need to generate. Oh, I already removed the label
05:30:45.400 | So let me remove it
05:30:46.920 | Based on how many tokens we want to generate with we launch the inference
05:30:51.640 | At the beginning this input ids only includes the prompt
05:30:56.040 | So it includes the image tokens and the text tokens without of course any output tokens because we need to generate the output
05:31:03.240 | So what we are doing at the first iteration of this for loop is the prefilling
05:31:08.440 | So the KV cache is empty the input ids contain the
05:31:11.560 | image tokens
05:31:13.880 | Placeholders and the text tokens the pixel values contains the image
05:31:19.080 | Loaded as a numpy array and then the attention mask, which is just a list of ones because we are never working with padding now
05:31:26.520 | the model itself
05:31:29.240 | So the polygamma model, which is this here will merge the image features that we are passing
05:31:36.520 | So these pixel values it will run them through the image encoder, which will return some image features
05:31:42.200 | These image features are replaced with them
05:31:46.040 | We replace the image placeholder tokens with the image features extracted from the image encoder. So now we have a list of embeddings
05:31:53.660 | Where the first embeddings are the image embeddings and then the text embeddings
05:31:58.060 | And then we send it to the language model for decoding. So let's go back to the inference
05:32:03.480 | So the first iteration of this for loop is the prefilling
05:32:06.380 | Which means that the query key and values are the same sequence length and they contain the tokens of the prompt
05:32:13.080 | The output of the prefilling is a list of embeddings
05:32:17.820 | Which we project into logits
05:32:20.780 | But we take only the last logit to predict the next token
05:32:25.000 | So that's why we take out the logits and we take only the last logit here
05:32:29.160 | So this is the sequence dimension and we take the last item in this sequence dimension
05:32:33.480 | Now, what do we do with this logits?
05:32:36.040 | So now let's go to the iPad actually because I want to explain how top p works
05:32:42.040 | So let's go. Let me check if this is working. Yeah still working
05:32:47.080 | So now we can do this one
05:32:52.120 | This one. Okay
05:32:55.160 | Let's open a new page. So when you generate logits
05:32:59.320 | basically, it corresponds to a kind of a distribution after you apply the softmax
05:33:05.260 | So the logit is a vector. So let me draw here is a vector
05:33:11.240 | Where the number of dimensions is equal to the vocabulary size
05:33:21.400 | So you have one number for each
05:33:23.480 | Token in the vocabulary and it indicates it's an indication by the model on what the model thinks should be the next token
05:33:32.120 | What we do is we can do to understand. What is the next token?
05:33:36.680 | We need to apply the softmax which will convert each of these numbers. So each of these numbers into a
05:33:42.920 | Probability score so something that sums up to one and it's always non-negative
05:33:49.000 | And we could take for example the highest one to predict what is the net to understand what is the next token
05:33:54.280 | another way is to use the
05:33:56.520 | Sampling method. So this is a list of numbers, right one for each position in the vocabulary
05:34:02.700 | So for example for the token hello, the model could say some score the token pizza
05:34:07.560 | It should give another score for the token
05:34:09.880 | I don't know car it will give another score, etc, etc
05:34:15.240 | We can also
05:34:16.760 | Do sampling which means that we sort all of these numbers that we get
05:34:21.720 | So all of these numbers that we get we sort them in decreasing order
05:34:25.720 | And then we
05:34:30.040 | take the
05:34:31.640 | the top ones
05:34:33.640 | Such that sum up to a probability
05:34:37.180 | Score so with top p what we are doing with the top p of 0.9
05:34:44.060 | Suppose that to the token. Hello, we have assigned the model has assigned a probability
05:34:48.940 | Let's say of 0.2. This one is a 0.5 and this one is 0.1
05:34:53.580 | Then we have some other token. Let's say 0.05 and then another token. That is 0.1
05:34:59.660 | Again, I don't know if this sum up to one but okay and then some other token and some other token
05:35:05.260 | We sort them in decreasing order which means that we sort them like this. So we take hello
05:35:11.500 | zero, oh no, the first one should be pizza
05:35:14.060 | Pizza
05:35:18.060 | 0.5 then we have a hello
05:35:24.700 | And then we have a car
05:35:26.940 | 0.1 then we have something else that 0.1. Then we have something else that is 0.05, etc, etc
05:35:33.660 | With the top p, let's say of 0.0, not 0.9. It's a little bit more
05:35:39.740 | 0.0, not 0.9. It's a little too much. Let's say top p
05:35:43.180 | Of 0.7
05:35:47.020 | We will basically sample
05:35:49.180 | from this distribution by only considering
05:35:53.200 | The token such that their cumulative score is at least this one
05:35:58.780 | So we will take basically all the tokens that when they sum up
05:36:03.820 | We sum them up with their probability score. They sum up to this amount and then we sample from them
05:36:09.420 | sample
05:36:11.500 | kind of a weighted sample in which we
05:36:13.740 | Take into consider for example with the 0.7. We will consider only these two tokens
05:36:18.620 | and then we
05:36:20.780 | Sample from we then rearrange these numbers such that again, they sum up to one
05:36:26.540 | So suppose that after applying again the softmax this sum up to this will be changed
05:36:32.700 | So this will become let's say 0.75 and this will become 0.25
05:36:37.040 | Then we sample again from this distribution
05:36:41.600 | So basically what will happen is that 75% of the time we will choose this token and 25% of the time
05:36:48.700 | We will choose this token. This is the meaning of top p. So among all the tokens we talk we sort them
05:36:54.300 | we take only the one that
05:36:56.780 | With who that with the cumulative probability score that reaches this top p
05:37:02.060 | And then we some sample from them just like they are a distribution by themselves
05:37:06.960 | Before sampling them because they need to be a distribution so we need to apply the softmax again
05:37:13.580 | So this is what we do with the top p instead what we do with greedy is that we just take the highest one
05:37:18.860 | And that's it. But with the top p we are actually
05:37:21.740 | sampling from this distribution
05:37:24.940 | But we are not considering everything
05:37:26.940 | To sample because some of them are basically the model is saying don't use this token because the probability score assigned to it
05:37:34.060 | It's very very slow. So why should we even consider it? So that's why we use top p
05:37:37.740 | We only consider the most likely one chosen by the model
05:37:41.340 | So we don't introduce any noise in the generation process
05:37:45.200 | What else I think nothing so let's go
05:37:50.540 | So what we are doing here we are sampling with the top p if we decided to sample
05:37:54.540 | Otherwise, we just take the one with the highest probability score, which is the greedy strategy if we don't want to sample
05:37:59.740 | There is also this thing called temperature. So what is temperature temperature basically means that we want to divide
05:38:06.160 | The as you can see here we divide the logits before applying the softmax
05:38:13.280 | So that the
05:38:17.020 | Basically what happens is that before we apply the softmax these numbers are not
05:38:21.020 | Probability score so they not sum up to one
05:38:24.620 | So for example, this may be 10. This may be 7. This may be 5. This may be 2. This may be 1
05:38:31.420 | This may be 0.1, etc, etc
05:38:33.980 | When we apply the softmax, we are basically sorry when we are applying the temperature we are
05:38:41.820 | Making the difference between them a little smaller
05:38:45.580 | So we basically if the model is giving us the following distribution
05:38:49.980 | So it's telling us that this token is likely but this is very
05:38:53.100 | Much more likely and this is less likely and this is less likely etc, etc
05:38:58.460 | What we are trying to do with the temperature is basically we are reducing the gap between these peaks
05:39:05.340 | So that the when we do the sampling here
05:39:08.860 | We are more likely to choose more diverse tokens
05:39:12.300 | Because then with the temperature what will happen is that the hello instead of being chosen 25% of the time
05:39:18.700 | It will be chosen. Let's say 33% of the time and this will become
05:39:22.140 | 0.66. So basically we are introducing some noise in the choice that we do
05:39:28.780 | But only restricted to the top 0.70%
05:39:32.460 | tokens chosen by the model
05:39:36.380 | I know it's a little difficult to visualize but basically with the temperature
05:39:40.060 | We are trying to make it more likely to choose more diverse tokens because we are reducing the gaps between the probability scores
05:39:47.500 | of the tokens
05:39:49.580 | And then we do sampling with top p which I will
05:39:52.700 | Put later is a simple method that
05:39:55.900 | Does what we saw before so we sort by descending order and then we sample from the distribution
05:40:01.980 | So actually let's do it. I think it's let's do it one by one
05:40:05.260 | So sample top p
05:40:07.260 | Sample top p we can put it here. So as you can see we are sorting in descending order
05:40:14.300 | We are calculating the cumulative sum. We are only taking the one that have the cumulative sum equal to the p parameter
05:40:20.880 | We do it here
05:40:22.700 | so we mask out all the others and then we
05:40:25.340 | Normalize again so that they sum up to one because we have removed some tokens from this distribution
05:40:32.860 | And then we sample from this distribution using this multinomial and then we take the token
05:40:37.580 | chosen by this sampling operation
05:40:40.300 | So we have applied the top p so now we know what is the next token we
05:40:46.300 | Take this token and we add it to this generated tokens array
05:40:51.340 | If the next token corresponds to the stop token, which is the end of sentence token, then we stop the generation
05:40:57.440 | Otherwise we keep generating
05:41:00.300 | And then we take these input IDs as you can see then as for the next iteration
05:41:06.160 | Because we are using the KVCache
05:41:08.860 | At at each inference step we use as query only the last predicted token
05:41:16.300 | So this is what we are doing here. So at the second iteration of this for loop
05:41:19.900 | Our input IDs will only become one single token
05:41:23.660 | And so the first iteration we are doing the prefilling. So the input IDs is all the tokens of the prompt
05:41:30.140 | So the image tokens and the text tokens of what we want to do with this image
05:41:34.300 | At the second iteration these input IDs will only be one token
05:41:39.260 | So how can the model will work with only one token because the model always has access to all the previous
05:41:45.500 | Keys and values because they are have been saved in the KVCache. So when we calculate the attention the model will add this
05:41:52.540 | Single token to the KVCache retrieve whatever is inside the KVCache and use it to calculate the attention
05:42:00.060 | This way we generate tokens
05:42:02.060 | We keep increasing the attention mask by adding one because we want to attend to all the past token in the KVCache
05:42:08.940 | Because we don't have any padding
05:42:11.020 | Usually you are
05:42:13.580 | You are used to think of the padding as something that is present on the right
05:42:16.860 | But actually padding can also be done on the left
05:42:19.020 | So because on the left, we don't have any padding token
05:42:22.540 | So the attention mask is always made up of ones and also in my implementation. I am not never working with the paddings
05:42:29.580 | We generate these tokens we concatenate them together because we save them into an array
05:42:33.980 | So we need to generate a tensor which is then sent to the tokenizer for decoding and then we print
05:42:39.420 | print the output of the model
05:42:42.140 | And now we can finally run the generation. So the inference so I will copy the script that I have already prepared
05:42:48.960 | And this
05:42:52.140 | I have already saved the weights of the model
05:42:54.620 | So if you want to run this code, you need to download the repository of this model clone it locally
05:43:00.060 | and then you use it as a you send the
05:43:03.100 | you set the
05:43:05.900 | Path to where you save it you give the prompt that my prompt is this building is and the model should tell me
05:43:11.580 | What is this building and the image file is this building here. It's a building in Xi'an, China
05:43:17.260 | And then we use this temperature the top p and we do not sample
05:43:23.020 | I want the greedy strategy and I also want to use CUDA. We run the script like this. So now let's run it
05:43:29.100 | I hope there are no problems
05:43:31.740 | I think yeah should be no problem. So launch inference. Let's see
05:43:38.780 | All right guys, so after I have launched the inference actually my computer went a little crazy
05:43:45.580 | So I had to switch back to using the cpu
05:43:48.780 | And then it worked because I don't know why my CUDA sometimes doesn't work and it blocks all my computer
05:43:54.460 | So if you run the inference using the code that we have made it should give this output
05:44:00.220 | So this building is the oldest clock tower in the world
05:44:03.180 | Which is actually I don't know if it's the oldest tower in the world, but actually this is called the jungle
05:44:08.220 | So it's a clock tower in Xi'an. So it's a very famous building and looks like the output is correct
05:44:13.900 | So thank you guys for watching this video. I know it has been a very very long journey
05:44:18.860 | And I had to do a lot of explanations. I had to kind of improvise sometimes to do this explanation
05:44:24.780 | So there it is possible that may there may be some
05:44:27.660 | Imprecisions in my way of explaining because I don't have a transcript that i'm reading
05:44:32.060 | For all of the things that I have talked about
05:44:34.460 | So sometimes, you know
05:44:36.220 | I just look at the code to try to come up with the right words to how to explain it
05:44:41.420 | And of course you cannot find always the right words immediately
05:44:45.420 | Maybe you need to watch it at least for one minute to get the right words
05:44:50.220 | Hopefully at least 90% of the content is super
05:44:52.780 | Correct and the other 10% maybe will have some noises
05:44:56.220 | So I will try to clarify the things that I have not been explained correctly in the comments or in the description of the video
05:45:02.060 | Thank you guys for watching this video. So please share it with your friends and
05:45:07.020 | Like it if you like it and subscribe to my channel
05:45:11.260 | A lot of people have asked me. What is the best way to contribute economically?
05:45:15.600 | To me to support me, but I believe I I thankful thank god. I don't need any economic support for now
05:45:22.780 | If I would ever need it, I would be the first one to ask
05:45:25.740 | So if you want to help someone economically, there are many people in the world that you can help
05:45:29.740 | So there are people in war areas in palestine in ukraine. You can help them economically
05:45:34.880 | But for me, I just need you guys to follow me and to share my video. This is the best way to help me out
05:45:40.620 | Also, I work at a company known as writer and my team is currently hiring
05:45:44.620 | We are looking for amazing researchers and you can find more
05:45:48.060 | information in the description of the video
05:45:50.940 | We train our own models. We have plenty of gpus
05:45:54.060 | So if you are a researcher in dealing with the language models, but any area of machine learning you are feel free to
05:46:00.220 | Send your resume. So thank you guys and have a nice day