Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

00:00:00.000 | Hello guys, welcome back to my channel today. We are going to code a visual language model from scratch

00:00:04.720 | Now, what do I mean by first of all by visual language model?

00:00:08.000 | And what do I mean for by coding from scratch?

00:00:10.240 | The visual language model that we will be coding is called the polygamma and it's a language model visual language model that came out

00:00:16.640 | From google around two months ago

00:00:18.960 | About the weights, but the paper came out around two weeks ago

00:00:22.960 | So we will be coding it from scratch meaning that we will be coding from scratch the vision encoder

00:00:29.360 | You can see this here. Okay the linear projection, which is just a linear

00:00:32.560 | Layer the language model itself

00:00:35.680 | So which is the transformer language model how to combine the embeddings of the image tokens with the text tokens

00:00:42.160 | And of course how to generate the output using the condition. So what is the language visual language model?

00:00:48.160 | First of all, well visual language model is a language model that can extract information from an image

00:00:52.960 | So if we have an image like this, for example and a prompt like this, for example, where is the photographer resting?

00:00:59.120 | The visual language model can understand where this photographer is resting by looking at the image

00:01:04.640 | And generating a response in this case. The response is in a hammock under a tree on a tropical beach

00:01:10.080 | The topics of today basically are first of all, we will be talking about the vision transformer

00:01:15.760 | Which is the vision encoder that we'll be using to extract information from this image

00:01:19.760 | But this vision transformer has been trained in a particular way called contrastive learning

00:01:25.280 | So we will be talking about a lot about contrastive learning because I want to review not only what is contrastive learning

00:01:30.320 | But also the history of how it works

00:01:32.400 | So the first well-known model is CLIP and then it was transformed into CGLIP by google

00:01:37.760 | So we will be seeing these two models

00:01:40.480 | Then we will be coding the language model itself

00:01:43.040 | So the gamma language model how to combine the embeddings of the vision model and the language model

00:01:49.600 | But this one we'll do it in code

00:01:52.640 | And we will be talking about the KVCache because we want to

00:01:55.600 | Use this language model for inferences

00:01:58.320 | So we want to do it in an optimized way and the best way of course is to use the KVCache

00:02:03.200 | So we will be coding it from scratch

00:02:05.200 | Not only we will be coding it. I will explain step by step how it works

00:02:08.800 | The rotary positional encodings because we need them for the language model and the normalization layers because we have them in the vision model

00:02:15.760 | And also the language model. We will be seeing what is the batch normalization, the layer normalization and the rms normalization

00:02:21.520 | I will be explaining all the math behind them

00:02:23.520 | In this video i'm also using a slightly different approach at teaching let's say

00:02:28.640 | Which is by drawing so I will be drawing every single tensor operations that we'll be doing especially in the attention

00:02:34.800 | Mechanism because I want people to not only look at the code and hope they get something

00:02:39.520 | Like an idea of how it works

00:02:41.920 | But actually I want to show each single tensor how it's changing by drawing it from scratch

00:02:48.640 | I think this helps better visualize what happens in the transformer model, especially during the attention mechanism

00:02:54.340 | So we know what each view operation each reshape operation that we are doing to each tensor and also the matrix

00:03:01.360 | Multiplications that we are doing so we can visualize what happens to the tensors itself

00:03:05.680 | What are the prerequisites for watching this video?

00:03:09.120 | Well, you have a basic knowledge about the transformer. You don't have to be a master about it

00:03:14.880 | It's better if you have watched my previous video on it

00:03:16.960 | Which will give you the background knowledge to understand this video and you have a basic knowledge of neural networks

00:03:22.320 | So at least you know, what is a loss function, you know, what is a linear layer?

00:03:25.440 | And at least you know, what is backpropagation you don't need to know how it works or the mathematics behind it

00:03:32.560 | But at least you know that we train models using backpropagation

00:03:35.460 | Having said that guys, let's jump to work. So the first part I will be explaining is the visual transformer

00:03:44.160 | So this visual encoder we will be seeing what is the contrastive about it

00:03:47.840 | and we will be coding it and then we will move on to how to combine the

00:03:52.960 | Embeddings of the image tokens and the text tokens. The only part that we will not be coding is the tokenizer

00:03:59.700 | Because I believe it's a separate topic that deserves its own video. So hopefully I will make another video about it

00:04:05.760 | So let's start

00:04:08.160 | All right guys before we go deep into each of these topics

00:04:12.080 | Let me give you a little

00:04:14.800 | Speech actually, so we will be exploring a lot of topics like a lot of topics

00:04:20.800 | We will be reviewing for example each of the single

00:04:23.600 | Operations that we do in the attention mechanism and we will be looking at it from the code point of view

00:04:28.880 | But also from the concept point of view and from the tensor operations point of view

00:04:34.640 | There may be some topics that you are already familiar with and that's perfectly fine

00:04:39.120 | There are some others that you are not familiar with and that's also perfectly fine because I will be explaining each topic multiple times

00:04:45.760 | So for example, we will be

00:04:48.320 | Implementing the attention mechanism at least twice

00:04:50.960 | So if you don't understand it the first time along with the code, then you will have another time to

00:04:56.080 | Understand it and with a different explanation

00:04:59.520 | And the same more or less goes goes on with all the other topics. For example, we will be first introducing the

00:05:04.880 | Normalization in one part and then I will review again the normalization

00:05:09.140 | The positional encoding done in one way and then we will see another type of positional encoding

00:05:13.760 | So don't worry if you don't understand everything at the beginning because I will be reviewing anyway each topic multiple times

00:05:21.360 | The important thing is you don't give up

00:05:23.600 | So if there is some topic that I couldn't explain because of lack of time

00:05:27.200 | For example, I will not be explaining how convolutions work because there are plenty of videos on how convolutions work

00:05:32.480 | So if you can pause the video watch five minute video on how a convolution work and then come back to this video

00:05:38.400 | That's the best approach I recommend

00:05:40.560 | The second thing is always write down all the code that I am I will be showing you so write it

00:05:46.400 | Line by line character by character because that's the best way to learn. So now let's get started

00:05:52.880 | Let's start with the first part. So the first part we will be talking about is this contrastive vision encoder

00:05:58.400 | Which is something that takes any as input an image and converts it into an embedding

00:06:03.700 | Actually a series of embedding. We will see one for each

00:06:07.360 | Block of pixels of this image. So basically our image will be

00:06:12.320 | Split into blocks of pixels like this into a grid and each of this grid will be converted into an embedding you can see here

00:06:22.640 | This embedding is a vector of a fixed size

00:06:25.840 | and that will be concatenated with the

00:06:29.040 | Tokens embeddings because as you know, each token is converted into what is known as an embedding

00:06:35.040 | Which is a vector of a fixed size. They will be concatenated and sent to the transformer which will basically attend to this

00:06:41.520 | Image tokens as a condition to generate the text. So this is called conditional generation

00:06:48.800 | But okay, we will explore all this stuff here

00:06:51.760 | Let's talk about this vision encoder now the vision encoder

00:06:55.200 | First we need to understand what is why it's called a contrastive vision encoder and to understand why it's contrastive

00:07:02.160 | We need to understand what is contrastive learning

00:07:04.240 | So let's go back to another slide, which is this one

00:07:08.720 | Let's go here

00:07:11.920 | so

00:07:13.600 | Imagine for now, we will consider the image encoder as a black box and later

00:07:17.840 | We will transform this black box into something more concrete

00:07:20.740 | now imagine that you have

00:07:23.600 | You go to the internet and when you go on wikipedia

00:07:26.260 | You see an image and when you see an image there is always a description of what is inside that image

00:07:31.680 | If you use a crawler you can crawl all of these images with the corresponding descriptions

00:07:37.460 | That in this will produce a data set of images along with the descriptions

00:07:42.560 | Now imagine that for some now for now imagine we have a text encoder that is most usually is a transformer model

00:07:50.400 | And then we have an image encoder which most of the cases it's a vision transformer

00:07:55.300 | And for now, we consider them as black boxes

00:07:58.560 | So it's something that takes as input an image and produces

00:08:01.940 | Here an image and produces an embedding representation of this image

00:08:07.040 | And if you feed a list of images, it produces a list of embeddings one corresponding to each image. What is this embedding?

00:08:13.920 | It's a vector that captures most of the information of this image

00:08:17.600 | And we do the same with this text encoder. So the text encoder is a transformer model that produces a series of embeddings. We will

00:08:24.240 | We'll see later

00:08:27.120 | But imagine you have this text encoder that given a text produces a single embedding of a single text

00:08:33.040 | But if you feed it a list of text it will produce a series of embeddings each corresponding to one single text

00:08:39.280 | now imagine

00:08:42.240 | The data set that we were talking about before which is the data set of images along with the corresponding descriptions

00:08:48.420 | So imagine we feed this data set of images along with the corresponding description to the image encoder and respectively to the text encoder

00:08:57.520 | It will produce a list of image embeddings and a list of text embeddings

00:09:02.580 | Now, what do we want these embeddings to be? Of course, we want the embedding

00:09:08.980 | Of the first image to be representative of that image

00:09:12.740 | So we want this embedding to capture most of the information of that image

00:09:16.500 | and of course, we want the embedding of the text number one to be

00:09:20.180 | A vector that captures most of the information about that text

00:09:26.560 | Moreover with contrastive learning we don't want only to capture information about the image or the text

00:09:33.200 | But we also want some properties and the property that we want from these embeddings is this

00:09:38.400 | We want the embedding of each image

00:09:42.000 | when its dot product with the

00:09:45.520 | Embedding of the corresponding text it should give a high value for this dot product

00:09:51.840 | And when you do the dot product of an image with a text that is not the corresponding one

00:09:56.880 | It should produce a low number for this dot product

00:09:59.520 | So basically with contrastive learning what we do we take a list of images

00:10:04.320 | We take a list of text which is the corresponding text one for each of these images

00:10:08.880 | So imagine that the image number one correspond to the text number one the image number two correspond to the text number two, etc

00:10:14.560 | etc, etc

00:10:16.400 | We encode them into a list of embeddings and then we want to train

00:10:20.800 | This model so this text encoder and this image encoder to produce embeddings in such a way

00:10:26.880 | That when the dot product of the image with its corresponding text is done

00:10:31.600 | It should produce a high value and when you do the dot product of an image with a not corresponding text

00:10:36.960 | For example i2 with text3 it should produce a low value

00:10:40.800 | now

00:10:42.640 | What we can do is basically we take this text embeddings, which is a list of embeddings

00:10:47.520 | We take this image embeddings, which is a list of vectors

00:10:50.660 | We do all the possible combinations of dot products

00:10:53.680 | So the image number one did with the text number one image number one with the text number two image number one with the text

00:10:58.800 | Number three, etc, etc

00:11:00.480 | Then we do the all the also for the text number one

00:11:03.520 | So the text number one with the image number one text number one with the image number two text number one with the image

00:11:08.320 | Number three, etc, etc

00:11:10.240 | And then we want to find a loss function that forces

00:11:13.520 | These dot products to be high so that each text with its corresponding image to be high

00:11:18.880 | While all the other possible combinations to be low in value

00:11:22.560 | And we do that basically by using what is known as a cross entropy loss. So

00:11:29.120 | To understand why we use cross entropy loss. We need to explore how language models are trained and we will do that very briefly

00:11:36.480 | so

00:11:38.160 | To not get us confused. So when we train language model, we do the we do so using what is known as the next token prediction task

00:11:45.680 | Imagine we want to train a language model on the following sentence. So I

00:11:49.920 | love

00:11:52.560 | pepperoni pizza

00:11:54.560 | Pizza

00:11:58.480 | How do we train such a language model? Well, we give a prompt to this language model for now

00:12:03.200 | Let's consider it as a black box. So I

00:12:06.240 | love

00:12:08.240 | I love pepperoni

00:12:10.420 | We feed it to the language model

00:12:15.760 | The language model will produce a series of embeddings

00:12:18.580 | Which are then converted into logits. So what is the logits? The logits is a distribution. It's a vector

00:12:25.440 | that tells

00:12:27.200 | What is the score that the language model has assigned to what the next token should be?

00:12:32.560 | Among all the tokens in the vocabulary. So for example, imagine this first number here corresponds to the token. Hello

00:12:39.280 | the second token here corresponds to the

00:12:43.120 | The second number here corresponds to the token. Let's say pizza

00:12:46.640 | The third corresponds to the token car the fourth

00:12:51.120 | Number to the token dog, etc, etc

00:12:54.800 | Which one we want to be the next token? Of course, we know that the next token is a pizza

00:12:59.680 | So we want the token number pizza to be high and all the other tokens to be low in value

00:13:04.480 | So we use the cross entropy loss basically to make sure that the next token is pizza. So how do we do that? Basically we

00:13:13.040 | Language model will output a list of numbers and we force the language model

00:13:17.200 | To produce the following output. So pizza should be one and all the others should be zero

00:13:22.000 | To compare these two things

00:13:25.680 | This one should be a distribution

00:13:28.880 | So basically the cross entropy loss what it does it takes a vector it converts it into a distribution

00:13:34.900 | With the softmax function and then we compare it with a label and we force the output to be equal to the label

00:13:42.400 | This will change the language model

00:13:45.200 | To generate a distribution the next time after the training in such a way that the pizza is given a high number and all the others

00:13:52.320 | Are given a low number and this is exactly the same that we do here for contrastive learning

00:13:57.360 | So we can use the cross entropy loss

00:13:59.680 | To force for example in this column here only this number to have a high value and all the others to have a low value

00:14:06.320 | And for this row here

00:14:08.480 | Only this number to have a high value and all the other number in this

00:14:11.920 | Row to have a low value and for example for this row

00:14:14.560 | We want the second item to have a high value and all the others to have a low value, etc, etc

00:14:19.040 | And we do that with the cross entropy loss

00:14:22.480 | Now here is the code that the pseudo code that they show in the

00:14:27.520 | Clip paper on how to implement the clip training with contrastive loss

00:14:31.840 | So basically we have a list of images and a list of text

00:14:35.360 | We encode them and they will become a list of vectors called image vectors and text vectors here

00:14:42.080 | image embeddings and text embeddings

00:14:44.720 | We normalize them later. We will see why we normalize stuff

00:14:49.040 | But okay, it's make sure that we reduce the internal covariance shift, but for now ignore it

00:14:53.680 | Anyway, we normalize them later. We will talk about normalization

00:14:56.900 | We calculate all the possible dot products between these embeddings

00:15:01.520 | So the text embeddings and the image embeddings, so we basically generate this grid here

00:15:06.640 | then

00:15:08.720 | We generate the labels the labels are what well for the first row

00:15:13.280 | We want the label the first item to be maximum for the second row the second item for the third row the third item

00:15:20.320 | And that's why the labels are arranged this

00:15:22.800 | This is basically the the function arrange generates a number between zero and in this case n minus one

00:15:29.680 | So for the row number zero, we want the item number zero to be maximum for the row number one

00:15:35.600 | We want the item number one, etc, etc until the row number n minus one

00:15:38.880 | We want the n minus one item to be the maximum one

00:15:42.480 | Then we calculate the cross entropy loss between what is the output of the model

00:15:45.920 | So what are the numbers assigned by the model to each of these dot products and what we want?

00:15:50.560 | The maximum to be among these numbers. This is the labels

00:15:54.240 | And we do it by rows and by columns this one you can see here

00:16:00.720 | then we sum these

00:16:03.200 | Losses and we compute the average so we compute the average loss between all the rows and all the columns

00:16:10.480 | And this is how we do contrastive learning. Now, let's explore. What is the problem with CLIP?

00:16:16.320 | All right. So what is the problem with CLIP?

00:16:20.560 | Well, the problem with CLIP is very simple is that we are using the cross entropy loss

00:16:25.760 | And the cross entropy loss basically needs to have a compare does the comparison between two distributions

00:16:32.160 | So in language model we compare the output logits which are transformed into distribution

00:16:38.080 | With the label so which item of this distribution we want to be the maximum one and we do the same here

00:16:43.440 | So we have this column

00:16:45.600 | We convert it into a distribution and we do it through a function called the softmax function

00:16:50.960 | So the softmax function basically it is a function that takes as input a vector and converts it into a distribution

00:16:57.860 | What does it mean? It means that when you have a vector like this, for example, it will be a list of numbers

00:17:04.960 | To be a distribution each of these numbers needs to be non-negative. So it needs to be

00:17:09.760 | Greater than or equal to zero and plus all of these numbers needs to sum up to one

00:17:15.600 | That's what a distribution is

00:17:17.760 | Of course

00:17:18.320 | The model will predict some numbers and it cannot force all the sum of these numbers to be one and it cannot force the numbers

00:17:24.640 | to be

00:17:26.320 | non-negative

00:17:27.440 | So we apply to the output of the model this function called the softmax

00:17:31.440 | Which transforms them into a distribution and then we can compare it with the labels

00:17:35.040 | So our label in the case for example for the first

00:17:37.840 | For the second row will be this

00:17:40.160 | So we want the first item to be zero the second item to be one and this one to be zero this one to be zero

00:17:45.200 | This one to be zero this one to be zero, but we need to apply the softmax to the output of the model

00:17:50.240 | now the softmax

00:17:52.960 | Function has a problem which is

00:17:55.040 | And we will see now

00:17:57.920 | this is the expression of the softmax basically to we take the output of the model and we

00:18:03.120 | exponentiate each item in the output vector, which could be a row or a column

00:18:08.240 | And after exponentiating we also divide them with the sum of all the other items

00:18:15.600 | So the exponential of all the other items

00:18:17.760 | So which means that we need to calculate first of all for each row the exponential of the item

00:18:23.840 | And then we need to divide by the sum of all the exponentials of all the other items including itself

00:18:28.800 | The the problem is that we are using this exponential. The exponential is basically a function that grows very fast

00:18:36.400 | So if the argument of the exponential

00:18:38.660 | Grows the exponential will become huge

00:18:41.680 | And this is a problem for computers because in computers we store numbers using a fixed representation

00:18:48.480 | Which could be 16 bit or 32 bit which means that we cannot represent up to infinity

00:18:53.520 | But we can represent each number up to 2 to the power of n minus 1 basically if you don't have negative numbers

00:18:59.520 | So if the exponential is too big then our numbers will grow too much and it may not be represented by 32 bit

00:19:07.440 | And that's a problem. So we need to make this softmax function numerically stable

00:19:13.520 | So whenever you heard the term numerical stability in terms of computer science

00:19:17.360 | It means that we want to make sure that the number can be represented within 32 bits or 16 bits or whatever

00:19:23.040 | range we are using

00:19:25.440 | How to make this softmax numerically stable?

00:19:28.640 | Well, the trick is this. The softmax is uh, each item is exponentiated

00:19:34.740 | So we do the exponential of each item

00:19:39.040 | And then we divide it by this

00:19:41.680 | This denominator which is known as the normalization constant, which is the sum of all the

00:19:47.360 | Exponentials of all the other items in the vector

00:19:50.000 | Now as you know, this is a fraction

00:19:52.320 | So in a fraction you can multiply the numerator and the denominator by the same number without changing the fraction

00:19:57.200 | So we multiply by this constant called c

00:19:59.840 | Each number can be written as the exponentials of the logarithm of the number

00:20:06.160 | And this is because the exponential and the log are inverse functions

00:20:10.400 | So we can write c as follows. So the exponential of the log of c

00:20:14.480 | By using the properties of the exponential which means that the exponential of the product

00:20:21.280 | The product of two exponential is equal to the exponential of the sum of the arguments

00:20:26.340 | We can write it like this

00:20:28.400 | And then we can bring this exponential inside the summation because of the distributive property of the product with respect to the sum

00:20:35.680 | After we bring it inside we can use the same

00:20:37.920 | Rule we applied above which is the exponential of the product is equal to the exponential of the sum of the arguments

00:20:43.620 | Now what we notice is that if we subtract something from this exponential

00:20:49.300 | this log of c

00:20:52.480 | We can make the argument of the exponential smaller which may make it numerically stable

00:20:58.320 | So what we choose as this log of c, basically we choose the

00:21:02.700 | Negative maximum number in the array that we are normalizing using the softmax

00:21:07.440 | This way basically the argument of the exponential will decrease and it will be less likely that this exponential will

00:21:16.860 | Go to infinity

00:21:19.980 | Which makes it numerically stable

00:21:22.460 | Now this basically means that to calculate the cross entropy loss for each of these

00:21:29.340 | columns and each of these rows

00:21:32.940 | First of all the model needs to output a list of

00:21:36.460 | Text embeddings and a list of image embeddings as you can see then we do all the possible dot products

00:21:42.460 | Then for each column first of all

00:21:45.260 | We need to find the maximum value in this column so that we can subtract it before calculating the softmax

00:21:51.120 | Then we need to apply the exponential to each of these items

00:21:54.780 | then we sum up all of this exponential to calculate the

00:21:59.160 | Normalization constant then we divide each of these numbers by this normalization constant

00:22:03.800 | so as you can see to apply the cross entropy loss involves a lot of computations and

00:22:09.960 | Also, it forces you to always have imagine you want to parallelize this operation

00:22:15.400 | Imagine that you want to distribute each row

00:22:19.080 | between different devices

00:22:21.640 | So this device here needs to have all the row in its memory because it needs to calculate this normalization constant

00:22:27.960 | So it has needs to have access to all of this row and if you want to do parallelize by column

00:22:33.800 | Then you need to have all the column

00:22:35.800 | In your memory because you need to calculate the first of all the maximum item then you need to calculate this normalization constant

00:22:41.960 | Then you need to normalize them so dividing by this normalization constant

00:22:45.240 | So it is involves a lot of computation

00:22:47.400 | But also it makes it difficult to parallelize because at any moment each device needs to have at least one full row or one full

00:22:53.960 | Column, which does not allow us to go to very big batch size

00:22:57.880 | And this is a problem. So if you look at the cglib paper, they say that note that

00:23:04.360 | Due to the asymmetry of the softmax loss the normalization is also independently performs two times

00:23:10.600 | So first of all to make the softmax numerically stable, we need to go through each single vector calculate the maximum

00:23:17.020 | Then we need to calculate the softmax

00:23:19.160 | but then we also need to calculate the softmax by rows and then by columns why because this

00:23:25.800 | Matrix here is not symmetric. So as you can see

00:23:28.920 | This is image number one with all the text and this is

00:23:32.840 | Text number one with all the images and this item here is not equal to this item here

00:23:37.480 | Because this is image number one with the text number two, and this is image number two with the text number one

00:23:43.640 | Because it's not symmetric means that you need to calculate the softmax for each single rows

00:23:48.040 | And then you need to calculate it for each single column and then you can calculate the loss

00:23:52.840 | So the problem with the clip is that it's very computationally expensive to calculate this loss this contrastive loss

00:24:00.680 | that's why in the cglib paper they propose to replace the

00:24:05.080 | Cross entropy loss with the sigmoid loss

00:24:10.440 | So with the cglib what we do is as follows

00:24:13.160 | Again, we have an image encoder that converts a list of images into a list of embeddings one for image image

00:24:19.880 | Then we have list of text which convert each text into a list of embedding one for each text

00:24:25.160 | Then what we do

00:24:29.320 | We calculate this all the possible dot products

00:24:31.880 | So the image number one with the text number one image number two with text number two and also image number one with text

00:24:37.160 | Number two text number three text four text five blah blah. So all the possible dot products between all these embeddings

00:24:43.100 | then instead of treating the loss as a distribution over a row or a

00:24:49.560 | Column or a row

00:24:52.200 | So we don't say in this row in this column

00:24:55.160 | I want this item to be maximum or in this row. I want this item to be maximum

00:25:00.440 | We use what is known as binary

00:25:06.040 | We use it as a binary classification task using the sigmoid loss

00:25:09.720 | In which each of these dot products is treated independently from each other

00:25:15.400 | So this is considered a single binary classification task in which we say okay this item here should be one

00:25:21.880 | This item here should be zero. This item here should be zero. This item here should be zero independently of what are the other items

00:25:29.400 | This one here should be zero. This one should be here zero, etc, etc, and we can do that with the sigmoid function

00:25:35.480 | So as you can see, this is the function the signature expression of the sigmoid function

00:25:38.920 | It takes as input this value called z which will be the dot product of our vectors

00:25:45.000 | And the output of the sigmoid is this stuff here, which is a number between zero and one

00:25:51.160 | So what we can do is we take each of these dot products. We run it through a sigmoid

00:25:55.900 | And then we force the label to be one for corresponding

00:26:01.240 | Text and images and zero for not corresponding ones. So each of these dot products now becomes a

00:26:07.560 | independent binary classification task

00:26:10.120 | basically this allow us to

00:26:13.240 | Grow the batch size to millions of items and also to parallelize because we can put this block here into one device

00:26:20.600 | And it can calculate it independently from this other device because they do not need to calculate any normalization

00:26:27.400 | Constant for each item or the maximum item in each row or column because each of them is independent from the others

00:26:34.360 | Now you may be wondering why are we even using a contrastive

00:26:40.060 | vision encoder

00:26:41.640 | I mean

00:26:41.960 | Why cannot we just use an ordinary vision encoder that just takes an image and instructs some kind of embeddings that capture the information?

00:26:49.480 | Of this image why we want it to be

00:26:51.480 | contrastive

00:26:53.160 | because

00:26:54.280 | We want these embeddings to not only capture a information about the image, but we want these embeddings to be

00:27:01.720 | Good representation that can be then contrasted or can be used along with text embeddings

00:27:09.100 | And this is exactly what we do in a vision language model. We extract some

00:27:13.800 | image

00:27:16.020 | Embeddings which are vectors representing we will see later a patch of the image

00:27:21.480 | So this you need to think of this image as being divided into a grid and this first

00:27:26.440 | second third four five six

00:27:28.920 | So we produce in this case, for example, nine embeddings which are nine vectors

00:27:33.800 | Each of them represents information about a patch of the image

00:27:37.080 | So we want these embeddings to not only be

00:27:42.200 | Representing the information of these patches, but also to be able to be contrasted with the text

00:27:48.520 | Which is what we do in a visual language model

00:27:50.360 | So we have some prompt and we kind of contrast it with the image embeddings to produce an output

00:27:57.560 | It is not really a contrastive learning in this case because we are using it as a condition

00:28:02.600 | We will see later how these things are merged

00:28:04.920 | But we want a visual language a vision encoder that is already trained to be used with the text because it has a better

00:28:11.880 | Representation for the image for being used along with the text. That's why we use the contrasting vision encoder

00:28:18.360 | also, we use them because they are cheaper to train so

00:28:21.880 | You can basically to train a contrasting vision encoder

00:28:26.600 | You just need to crawl billions of images from the internet

00:28:30.360 | Each of them already has a kind of a description because you can for example in wikipedia

00:28:35.480 | You always have the description of each image, but also the internet when you have an image you always have the html alt text

00:28:42.520 | It's called

00:28:44.040 | Which is the alternative text that is displayed when the image is not shown

00:28:47.320 | So you always have access to some kind of description

00:28:49.980 | Now, of course this vision encoder may be noisy because they we crawl stuff from the internet

00:28:55.400 | Which means that this stuff may not always be correct

00:28:58.280 | So sometimes you see a picture but the description displayed is not correct or maybe the crawler didn't get the correct information

00:29:04.920 | But because we train it on billions and billions and billions of images eventually it learns a good representation of this image

00:29:13.880 | So this vision encoder that we will be using is basically a vision transformer. So now let's talk about the vision transformer

00:29:21.020 | Let's talk about it here

00:29:24.600 | So the vision transformer is a transformer basically that was introduced in this paper and image is worth 16 by 16 words

00:29:32.680 | In which basically they train a transformer as follows. So first of all, what do we

00:29:39.960 | How does a transformer work?

00:29:43.320 | we will see later in detail what is the

00:29:45.640 | Attention mechanism, but for now, I just need you to remember that the transformer model is a sequence to sequence model

00:29:52.520 | which means that you feed it a sequence of embeddings and it outputs a sequence of

00:29:57.480 | contextualized embeddings

00:30:00.180 | What we do to encode an image with the vision transformer we take an image and we

00:30:07.240 | Split it into patches and in this case, for example, we can split into 16 patches

00:30:13.000 | So this is the first group of pixels. This is the second group of pixels

00:30:17.160 | This is the group of pixels on the bottom right of the image. This one is on the top right top right, etc, etc

00:30:23.400 | we extract

00:30:26.280 | Information about this patch using a convolution

00:30:29.020 | So when you run a convolution you can extract information about a group of pixels from the image

00:30:36.120 | And then for example, this one will produce this output

00:30:39.640 | This one the convolution of this patch will produce this output. The convolution of this patch will produce this output, etc, etc

00:30:46.520 | And then we flatten them. So we lose the positional information

00:30:50.300 | We just take we don't care if this four is the top right or the bottom left

00:30:55.800 | We just concatenate them one with each other

00:31:00.200 | We do we lose the two dimensionality in this case basically so we transform into a sequence of

00:31:05.640 | patches instead of being a grid of patches

00:31:08.760 | Then we add this position information so we say that okay, this is the patch number one

00:31:15.320 | So, how do we do that?

00:31:16.680 | This patch basically the embedding of this patch that will be the result of this convolution will be a vector

00:31:22.600 | We add to this vector another vector that tells the model

00:31:27.800 | Hey, this is the patch number one and this is the patch number two, and this is the patch number three, etc, etc

00:31:32.920 | So we do that by adding so this plus operation you can see here

00:31:36.600 | and unlike the

00:31:38.920 | Vanilla transformer or the transformer model that we see for language models

00:31:42.600 | These positional encodings are not calculated using sinusoidal functions, but they are learned

00:31:48.040 | So they are vectors that get added always so the positional encoding number one always gets added to the top left

00:31:55.720 | Patch the positional number two always gets added to the second patch from the top left, etc, etc

00:32:02.040 | The positional encoding number 16 gets added always to the bottom right patch

00:32:06.680 | So the model

00:32:09.560 | Has kind of access to this to the 2d representation of the image

00:32:13.800 | So the model will learn basically that the patch number 16 is always on the top right and this is always on the top left

00:32:19.960 | We feed it to the transformer

00:32:22.200 | So this is a series of embeddings because the sum of two embeddings is a series of embedding

00:32:28.120 | We feed it to the transformer model for now

00:32:30.760 | Let's consider it as a black box and later when we code it, we will explore each layer of this transformer

00:32:35.500 | The transformer what it does it does the contextualization of these embeddings

00:32:40.680 | So at input we have this each series of embeddings each of them representing one single patch

00:32:47.640 | The output of the transformer through the attention mechanism will be a series of embeddings again

00:32:52.920 | But each of these embeddings is not only capturing information about itself, but also about other patches

00:32:58.680 | In language models, we do what is known as

00:33:02.440 | We use in the attention mechanism. We use what is known as the causal mask. So this first

00:33:08.280 | Embedding should be only capturing information only about itself the second one only

00:33:14.360 | About itself and the previous one the third

00:33:17.240 | About itself and the two previous one the fourth one about itself and the three previous one, etc

00:33:23.000 | This is what we do with the language models with visual language models in the with the trust

00:33:27.880 | Sorry, not with visual language, but with the vision transformers

00:33:30.940 | We don't care about this

00:33:34.280 | being

00:33:35.720 | The model being autoregressive we say so we don't want these patches to only encode information about the previous patches because in the in an image

00:33:43.240 | There is no autoregressiveness. So it's not like the patch number 16 of an image

00:33:48.920 | It depends only on the previous patches and the patch number one does not depend on any others

00:33:53.960 | Because imagine you have an image in which the sun is here or the light source is here

00:34:00.360 | then this part here will be light will be illuminated, but

00:34:05.320 | So the illumination here depends on what is coming after in the image

00:34:10.680 | So in the image, we don't have this autoregressive

00:34:13.180 | relationship

00:34:15.400 | Why in the text without we do because we we write the text from left to right or from right to left

00:34:21.080 | But anyway, each word that we write depends on what we have written previously

00:34:25.400 | But this doesn't happen with image. So basically this contextualized embeddings

00:34:30.460 | They capture information about themselves, but also all the other embeddings

00:34:35.260 | and

00:34:37.800 | We use this contextualized embedding to capture information about each patch

00:34:43.080 | But also how it is present in the image. That's why we want them to contextualize

00:34:47.740 | So we want each patch to include information about its position, which is given by the positional encoding

00:34:53.480 | But also about what is surrounding this

00:34:55.880 | patch in the image

00:34:58.600 | By contextualizing them. So when we code it, this will be more clear for now. I just want you to get a

00:35:05.400 | Idea of what we are going to code. So we are going to code a model that will take an image will apply a convolution

00:35:13.020 | To extract a series of embeddings. You can see here. We will add a positional encoding to these ones

00:35:19.560 | Which are learned we will apply the attention mechanism

00:35:23.480 | Which is will be a series of layer actually of the transferable model that will contextualize these embeddings

00:35:29.080 | And then we will use this contextualized embedding as input to the language model for decoding the output of the language model

00:35:35.240 | So let's finally start coding

00:35:37.240 | Now in this video I will be

00:35:40.920 | Using a slightly different approach, which is I will not be

00:35:43.960 | writing each line

00:35:45.560 | I will be copying each line and explaining it step by step because I want this video to be more about explanation than just

00:35:52.040 | Coding because I want to use the code for explaining what happens under the code under the hood

00:35:58.280 | So let's create our first file, which is the modeling

00:36:03.240 | Oops, I'm using Chinese

00:36:05.240 | Siglip.py

00:36:07.560 | And let's start by importing stuff which we need I don't need copilot

00:36:14.060 | And then we create our first class which is the siglip-config

00:36:19.100 | So, what is this basically we will be using this visual encoder and this visual encoder will have some

00:36:27.700 | Configurations, why do we need a configuration class because uh, polygamma comes in different sizes

00:36:33.620 | Let me put this one. Okay

00:36:36.740 | Polygamma comes in different sizes

00:36:39.540 | Which means that each of this size of polygamma each of these models polygamma models has a different configuration for its vision encoder

00:36:46.660 | So let's see each of them

00:36:48.420 | The hidden size basically it's the size of the embedding vector of this vision transformer that we are going to encode

00:36:54.900 | the intermediate size is the

00:36:57.700 | Linear layer that we use the size of the linear layer that we use in the feed-forward network

00:37:02.340 | The number of hidden layers is the number of layers of this vision transformer

00:37:06.820 | The number of attention heads is the number of attention heads in the multi-head attention

00:37:10.500 | The number of channels is how many channels is each image has which is RGB

00:37:15.080 | The image size is because polygamma comes in I remember three sizes. So 224, 448 and

00:37:22.580 | 896 something like this

00:37:26.180 | The default information that we put here is the for polygamma 224

00:37:29.960 | Which supports of course image of size 224. So if you provide any image, it's first get resized into

00:37:36.980 | 224 by 224

00:37:39.840 | The size of each patch. So what is the number?

00:37:42.980 | It will be divided each image will be divided into patches. Each patch will be 16 by 16

00:37:48.980 | and the this way is a

00:37:52.260 | Parameter for the layer normalization. We will see later

00:37:54.420 | The attention dropout is another parameter that we will not be using in the attention calculation

00:37:58.900 | Basically, it's a dropout that we use in the attention, but we will not be using it

00:38:02.660 | And the number of image tokens indicates how many output embeddings this attention mechanism will this transformer vision transformer will output

00:38:11.060 | which is the how many

00:38:13.140 | Image embeddings we will have for each image

00:38:17.460 | Now before we saw that each an image encoder is something that converts an image into one single embedding

00:38:24.340 | So that represents all the information about that image

00:38:27.140 | but in the case of the vision transformer we can use all the output of the vision transformer to have because as we saw before

00:38:33.940 | Vision transformer is a transformer model. So which takes as input

00:38:38.180 | A list of embeddings and it outputs a contextualized embedding

00:38:42.820 | So each of these contextualized embedding will be the tokens of our image

00:38:46.740 | so it will not be one single embedding that represents the whole image, but

00:38:49.940 | Lists of embeddings that represent a patch of each image, but also information about other patches through the attention mechanism

00:38:57.460 | But we will see this later. So now this class is very very basic. It's just a configuration of our cglib

00:39:03.380 | Now let's start by coding the structure of this vision transformer. So let me copy this stuff here

00:39:13.700 | How to follow this video now I

00:39:16.260 | I am copying the code because I have already written before and I want to explain it instead of

00:39:21.780 | Coding it because I also allows me to copy the comments and also allows me to avoid any mistakes while coding it

00:39:29.220 | But I recommend that you code it from scratch. So you take this video and you just type whatever I am pasting here

00:39:37.460 | This is the best way to learn because it's like when you study a mathematical proof

00:39:42.500 | You should not just watch the proof on the piece of paper

00:39:45.860 | Because even if it you think it makes sense to you

00:39:49.460 | It doesn't actually because when you write it by hand, so when you code each of these lines by hand

00:39:55.300 | Your mind will think why am I typing this? Why am I writing this? Why am I multiplying this number by this number? Why am I?

00:40:03.380 | Calling this function so you question yourself when typing

00:40:08.180 | That's why I recommend that you type this code while I am pasting it

00:40:12.420 | I do it by pasting otherwise this video will be 20 hours

00:40:15.060 | so

00:40:17.140 | The first thing that we do is we create this vision

00:40:19.140 | Model, this vision model is made up of a transformer and it has a configuration

00:40:23.380 | So basically what we are doing is we take the pixel values of this our image, which will be loaded with NumPy

00:40:29.300 | So when you load an image with NumPy it gets converted into an array that is channeled by height by width

00:40:35.540 | But we can have a batch of images. That's why we have a batch size here. So the batch dimension

00:40:41.940 | And our vision transformer will convert this into a batch size NumPatches

00:40:47.140 | Which is how many NumImage tokens we have here and each

00:40:51.300 | Vector will be of a fixed dimension called embeddim here

00:40:56.340 | So basically our vision model will take an image as you can see a batch of images and it will give us a batch of

00:41:04.100 | List of embeddings one list of embeddings for each image where each embedding is a vector of size embeddim

00:41:11.480 | Okay. Now let's code the vision transformer, which is very simple also

00:41:16.760 | So let's do it also step by step actually

00:41:19.960 | so this vision transformer is basically a

00:41:23.400 | Torch layer

00:41:27.400 | Where we pass the configuration we save this embeddim, which is the hidden size

00:41:31.560 | We saw before which is the size of this embedding vector

00:41:34.360 | We first need to extract the embeddings from this

00:41:40.180 | We need to extract the patches from this image, which will be done with this layer. We will call SigLip vision embeddings

00:41:46.680 | Then we will run it through a list of layers of the transformer

00:41:51.060 | Which is this SigLip encoder because it reminds the encoder of the transformer

00:41:55.380 | Which is a series of layers of transformer and then we will have a layer normalization and we will see later how layer normalization works

00:42:02.100 | The forward method is very simple

00:42:07.060 | So the forward method is basically we take these

00:42:09.700 | Pixel values, which is the image which is a patch of images and we convert them into embeddings, which is

00:42:16.100 | Which basically means that we are extracting the patches from these images. So let's visualize it here

00:42:21.860 | So what we are doing with this

00:42:25.540 | Image embeddings we are taking these images. We will run a convolution here to extract patches

00:42:32.260 | Then we will flatten these patches and add the positional encodings

00:42:35.960 | And this stuff here will be done by this SigLip and vision embedding

00:42:40.520 | then we take these embeddings which are

00:42:44.420 | Patches plus the positional encoding and we run it through this encoder, which is a list of layers of the transformer

00:42:51.300 | So this stuff here is our encoder. What is the encoder?

00:42:54.340 | Well, the encoder is a list of layers of the transformer

00:42:57.860 | So you can think of it as being a list of these layers here. Actually these layers here

00:43:02.820 | one after another which includes a multi-head attention, a

00:43:07.300 | normalization, a feed-forward network and the normalization

00:43:10.440 | In the case of the visual transformer the normalization is done before the feed-forward and before the multi-head attention, but that's the only difference

00:43:17.940 | So this part here, so a series of layers is called the here

00:43:24.100 | We call it the encoder because it resembles the encoder side of the transformer

00:43:28.200 | And then we have a layer normalization. So now let's go to code this vision embeddings

00:43:34.500 | So we want to extract information about these patches

00:43:37.880 | Let's do it. Where are the vision embeddings? Here. Okay

00:43:46.900 | All right, so

00:43:53.860 | The vision embeddings is basically, okay

00:43:56.100 | Taking again the configuration because each of these models needs to have access to the configuration because they need to extract different

00:44:01.860 | Information from this configuration. So we have the embedding size, which is the size of the embedding vector, which is the hidden size

00:44:08.020 | The image size is how big is the image?

00:44:10.980 | And the patch size is how big is the patch that we want to get from this image. So basically we are talking about

00:44:18.260 | this

00:44:20.900 | In this case the patch size I remember is a 16

00:44:23.940 | Which means that we are going to take this patch here is going to

00:44:29.140 | 16 by 16 pixels

00:44:32.000 | How do we extract these patches? We do that through a convolution that is a 2d convolution, which it takes as input

00:44:38.740 | The number of channels of the image so three channels are gb and it produces all channels equal to the embedding size

00:44:46.100 | So the hidden size

00:44:49.620 | The kernel size so as you remember the convolution works like this, so let's use the ipad actually to draw so

00:44:56.020 | The convolution works like this. So we have an image

00:44:58.900 | Which is made up of let's say pixels. So suppose this is the grid of pixels

00:45:05.400 | And we have a lot of them

00:45:09.780 | Basically the convolution works like this imagine the kernel size is three by three

00:45:16.020 | So we take a three by three group of pixels. We apply this convolution kernel

00:45:21.220 | So if you are not familiar with how convolutions work, I will not be reviewing that here

00:45:26.100 | But basically it means that we have a matrix here

00:45:28.260 | You multiply each number of this matrix by the value of the pixel on which it is applied to it will produce

00:45:35.780 | features

00:45:38.020 | one feature

00:45:39.700 | And then you slide this kernel to the next group of pixel then you slide it again

00:45:44.900 | Slide it again, etc, etc, and it will produce many features in the output features

00:45:49.700 | However at as input we have three channels which you can think of it as three

00:45:55.700 | Parallel images one that is only red one that is only green and one that is only blue

00:46:01.460 | We run this kernel on all of these channels and it will produce

00:46:05.220 | Features how many kernels do we have?

00:46:09.920 | Depending on how many output channels we want. So for each output channel, we have a one kernel that is

00:46:15.440 | We have three kernels actually that is used for one for each of this number channels

00:46:22.960 | The stride tells us how we should slide this

00:46:27.440 | Kernel from one group of pixel to the next and we are using a stride that is equal to the patch size of the

00:46:34.240 | Kernels, which is equal to the kernel size. So which means that we take the first oops

00:46:40.400 | We take the first group of let's say three by three kernels

00:46:43.440 | Then we skip three kernels to we slide it to the next group of three by three. So there is no overlap

00:46:49.600 | So we take this kernel here

00:46:51.680 | Then we slide it to this group of pixel here

00:46:54.400 | Then we slide it to this group of pixel here so that there is no overlap. So basically what we are taking is

00:46:59.280 | list of features each extracted by a independent patch of this image that we run the kernel on

00:47:07.840 | And the padding if valid means that there is no padding added

00:47:11.200 | So basically this patch embedding is extracting information from our image patch by patch

00:47:18.000 | Where there is no overlap between these patches. How many patches do we have?

00:47:21.920 | Well, it's the size of the image which is 224 in the base version of

00:47:27.360 | PaliGamma divided by the patch size

00:47:31.200 | So image size is the number of pixels divided by how big is each patch and then to the power of two because we have

00:47:38.000 | Along two dimensions this image. So we run the patch. The patch is

00:47:41.840 | It's a square. So it's a 16 by 16 or 3 by 3 or whatever the number patch size is

00:47:49.600 | How many positions we have? So how many?

00:47:52.880 | Positional encodings we need well

00:47:55.360 | It's equal to the number of patches that we have because we need to encode information about where this patch came from

00:48:01.280 | So how many positional encodings we need equal to the number of patches that we have

00:48:06.080 | And what is each of this positional encoding? It's a vector. It's a vector of the same size of the patch

00:48:11.920 | So it's equal to embeddings. You can see here

00:48:14.480 | And it's a learned embedding. So it's a positional encoding that is a learned

00:48:20.160 | Embedding how many we have we have noon positions of them each of them with this size here

00:48:26.320 | And we will see later that each of them is added to the information extracted from the convolution

00:48:32.160 | So that each convolution output encodes information about where it came from in the image

00:48:37.360 | we register these positional IDs in the

00:48:40.800 | In the module which is just a list of numbers and we will use it later

00:48:47.440 | So this is just a range of numbers so between zero and noon positions mine one

00:48:52.720 | Now let's implement the forward method

00:48:58.240 | This is the reason I like to

00:49:00.320 | Copy and paste the code because I can copy all the comments without typing them one by one. Otherwise, it will take me forever

00:49:06.000 | So what we do now is okay. We had our image which is a pixel values here

00:49:10.640 | The pixel values came from noon pi so we will see later how we load the image

00:49:15.760 | but basically you have to think that you load the image with noon pi and noon pi loads a

00:49:20.880 | Batch of images, which is a channel height and width. It's a tensor with three channels and with the height of the image and the width of the image

00:49:28.880 | We will see that this

00:49:31.840 | Height and width is equal to the same because we resize each image to the input size of the image expected by the model

00:49:38.320 | So we will resize in the case. We are using the smallest polygama. We will resize each image to

00:49:42.960 | 224 by 224

00:49:47.040 | We extract this patch embeddings to this convolution so you can see here

00:49:51.520 | So this will basically take our image which is a batch of images and convert it

00:49:57.200 | Into a list of embeddings of this size

00:50:00.400 | So each image will be a list of embeddings of size embed dimensions

00:50:06.420 | How many patches we have well the number of patches

00:50:10.400 | For the height and the number of patches for the weight

00:50:14.720 | In this case, it will always be the same so you can think of it as a number of patches a total number of patches

00:50:20.720 | Each of patches with the dimension embedding dimension

00:50:26.900 | And as we saw before we flatten these ones, so we extract them here. Let me delete it

00:50:34.480 | So we extract these patches

00:50:38.960 | So we run the convolution and then we flatten them here

00:50:43.440 | So basically the convolution will give us 1 2 3 4 5 6 up to 16 or whatever the number of patches is

00:50:49.920 | and then we convert it into a tensor where the

00:50:52.800 | The patches are flattened

00:50:55.120 | So the first patch is here and the last patch is the last element of this tensor and this is what we do here

00:51:00.880 | Here because the output of the convolution is a 2x2 grid, but we don't want a 2x2 grid

00:51:07.520 | We only want a one-dimensional long list of patches and this is done by this flatten method here

00:51:13.520 | Then we transpose because we want the number of patches to come before the embedding dimension

00:51:19.300 | Because as input to the transfer we need to give a sequence of embeddings

00:51:24.480 | So that's why we want this num_patches dimension to come before so that it becomes a batch

00:51:29.600 | of sequence of embeddings and each embedding is a

00:51:33.360 | vector of size embedding dimension

00:51:37.360 | Each of these embeddings we add the positional encodings which positional encodings? Well the position

00:51:42.400 | Extracted from this embedding layer

00:51:46.140 | But which embedding do we want to extract? All the embeddings. So from 0 to

00:51:50.160 | Suppose we have 16 patches from 0 to 15

00:51:53.440 | What is the where is this information 0 to 15 is in this self dot position and this which is a range

00:52:00.080 | So as you remember a range is just a generates a list of numbers between 0 and the argument minus 1

00:52:06.960 | So we add we extract this the all the positional encodings from this position embedding

00:52:12.240 | Layer, which is this embedding layer here. We add it to the embeddings

00:52:16.880 | So what we are doing basically is we flatten this embedding

00:52:20.320 | We did that before then we add a positional encoding vector extracted from the positional encoding layer

00:52:25.600 | And these positional encodings are learned. So learned why because this embedding layer here is a list of

00:52:32.320 | embeddings

00:52:34.800 | That when the model is trained these embeddings will change according to the need of the model and basically we encode them

00:52:42.640 | So it's not like we are telling the model. This is position number one. This is position number two

00:52:48.000 | We add another embedding that is added to this

00:52:51.280 | patch

00:52:52.960 | each of these patches

00:52:54.480 | And then the model will learn to modify this positional embedding vector in such a way that they should encode the position

00:53:01.820 | Information because each of this position embedding is always added to the same patch

00:53:07.020 | So the first patch always receives the position number zero the second patch always the position number one

00:53:11.580 | We hope that the model actually tries to change this position embedding in such a way that they encode the positional information

00:53:17.580 | and actually it does because the model actually learns then the

00:53:20.700 | to relate

00:53:23.580 | Patch with each other by using their positional information

00:53:27.660 | And the only way for the model to do that is to change this position embedding in such a way that they encode the position information

00:53:33.840 | If you remember from the vanilla transformer, we use the sinusoidal functions

00:53:38.300 | So if you want to look at the original transformer if you remember

00:53:41.580 | here

00:53:43.740 | We have this position information

00:53:45.740 | Where is it here? So we create this position encoding using sinusoidal functions

00:53:52.780 | So instead of learning them we actually pre-compute them and then we force the model to learn the pattern

00:53:58.780 | Encoded by these sinusoidal functions in this case. We are not forcing the model to learn any pattern

00:54:04.060 | We want the model to create the pattern that is most useful for the model itself

00:54:08.220 | so we hope that the model will try to create this embedding layer in such a way that it creates some

00:54:15.260 | embeddings that are helpful for the model to

00:54:17.800 | to understand the position information

00:54:20.780 | and this is the meaning of

00:54:22.780 | position embedding

00:54:24.540 | Now we skipped before the normalization layer. So let's go actually to

00:54:29.020 | Understand what is normalization and how it works so that we always don't leave anything behind that is not explained

00:54:36.620 | All right. Let's talk about normalization. So imagine we have a list of linear layers

00:54:42.460 | Now a linear layer is defined by two parameters

00:54:46.700 | One is called the input features and one is called the output features

00:54:50.220 | Imagine we have input feature is equal to four and output feature is equal to four

00:54:54.300 | Actually, there is another parameter called bias

00:54:56.860 | So it indicates if the linear layer also has a bias term and suppose that it's true

00:55:02.540 | To the input of the linear layer usually we have a batch of items and each item is made up of features

00:55:11.260 | Suppose that for now as input there is only one item and it's made up of four features

00:55:15.820 | And as you can see the input features are four

00:55:18.380 | What will happen with four output features is this the linear layer you can think of it

00:55:24.220 | As a number of neurons where the number of neurons equal to the number of output feature of this linear layer

00:55:31.180 | what each neuron does is basically it has a

00:55:34.780 | weight vector

00:55:37.900 | As you can see here made up of four weights

00:55:41.100 | How many weights does it have? Well equal to the number of input features that this layer accepts

00:55:47.900 | So which is a four

00:55:49.980 | What each neuron will do it will do the dot product of the incoming vector

00:55:55.100 | So the input vector x multiply dot product with the weight vector of this neuron plus the bias term

00:56:02.940 | Which is one number for each neuron

00:56:05.740 | And this basically dot product plus this bias will produce one output feature

00:56:10.540 | Because we have four neurons. We will have four output features

00:56:14.380 | So each neuron will do the same job, but each neuron will have its own weight vector and its own bias number

00:56:20.540 | So this one here will have its own weight vector different from the other ones and its own bias term here

00:56:25.900 | Then suppose that we have another

00:56:28.860 | Vector that takes as input four features and produces two output features

00:56:34.140 | So you can think of it as a linear layer with the two neurons

00:56:38.140 | where the first neuron has a weight vector made up of four numbers because

00:56:43.740 | The incoming vector has four features and then one bias term here

00:56:47.740 | It will produce an output vector of two items

00:56:51.420 | The first item will be this number here and the second item

00:56:54.860 | The second dimension will be the dot product of the weight vector of this second neuron with the input vector

00:57:01.260 | plus the bias term of the second neuron

00:57:04.460 | Now, what is the problem with

00:57:06.460 | With the linear layers, but actually with all layers in general

00:57:12.140 | The problem is this it's called the covariate shift. The problem is that

00:57:16.220 | When you have an input vector

00:57:18.860 | That changes from one batch to another in magnitude

00:57:24.240 | Then the output of the layer will also change in magnitude a lot depending on what is the incoming vector

00:57:32.860 | So for example, imagine this the first input vector is all the numbers are more or less around one and two

00:57:40.460 | And the output is also more or less around

00:57:43.580 | suppose around two

00:57:45.980 | Then if the next vector that is coming to this layer is

00:57:49.660 | Much different in magnitude from the first one then the output will also be much different in magnitude

00:57:55.360 | And this is a problem for the model

00:57:58.220 | So the problem is that if the input of a layer changes, then the output of this layer will also change a lot

00:58:04.140 | So if the input changes drastically the output will also change a lot drastically

00:58:08.160 | then because the loss of the

00:58:10.940 | Of a model during training depends on the output then the loss will also change a lot because the loss

00:58:17.820 | Then determines the gradient during backpropagation

00:58:21.200 | It means that if the loss changes a lot then also the gradient will change a lot and if the gradient changes a lot

00:58:27.020 | Then because the gradient determines how we update the weights of the model during training then also the update of these weights will also change a lot

00:58:34.300 | so

00:58:36.300 | basically what happens is that the if the input the distribution of the

00:58:41.340 | Dimensions of this vector that is coming to the input of a layer

00:58:45.660 | Changes drastically from one batch to the next

00:58:49.260 | Then the output of the model will also change and then the loss will change then the gradient will change then the update of the weights

00:58:55.500 | Will change so what we will see that the loss will oscillate a lot

00:58:59.020 | And also the weights will try to keep up with this changing input distribution

00:59:03.840 | Which basically will result in a model that trains slowly. So here I have made a simple

00:59:09.900 | How to say

00:59:13.580 | Summary of what is happening

00:59:14.700 | So a big change in the input of a layer will result in a big change in the output of a layer which will result

00:59:20.540 | In a big change in the loss of the model which will change result in a big change in the gradient

00:59:25.840 | Of the during black propagation which will result in a big change in the weights of the network

00:59:31.580 | And what is the result of this is that the network will learn very slowly because the network will spend most of its

00:59:37.020 | Time but okay most of the effort trying to keep up with this distribution change in the input

00:59:43.580 | Instead of actually learning the features

00:59:46.140 | How to map the input to the output

00:59:50.300 | So the the first solution to this problem was batch normalization, which was introduced in this paper

00:59:55.660 | And with batch normalization what we do basically is that we have usually not a single item as input

01:00:01.740 | We have a batch of items suppose that we are training a classification image classification model

01:00:07.260 | So we have as input a list of images

01:00:10.460 | For example the image of a cat the image of a dog of a zebra of a tree of a stone etc, etc

01:00:16.220 | So you can think these are the dimensions of the vector that represent the cat

01:00:20.220 | These are the dimensions of the vector that represent the dog. These are the dimensions of the vector that represent the zebra etc, etc

01:00:25.820 | So what we do with batch normalization is that we calculate a statistic

01:00:30.240 | For each dimension of each item

01:00:35.100 | Which statistic do we calculate the mean and the the variance and then we

01:00:42.680 | Normalize each item by subtracting the mean and divide it by the standard deviation

01:00:48.620 | this will basically make each

01:00:51.020 | Dimension of each item be distributed

01:00:54.380 | According to a Gaussian with mean zero and the variance of one

01:00:58.780 | so basically what will happen is that

01:01:01.580 | each if we normalize each number if

01:01:05.420 | Because the image of a cat is much different from the image of the zebra

01:01:10.380 | Because the color distribution is different. The rgb distribution is different. So the pixel intensity is much different from each other

01:01:16.780 | What will happen is that the model will not see this change in magnitude

01:01:21.580 | but it will see

01:01:23.100 | And also will not see a change in distribution because all of these items will be distributed according to a mean of zero and the variance

01:01:30.140 | of one

01:01:31.420 | So what will happen is that the model will oscillate less in the output. So it will oscillate less in the loss

01:01:36.860 | So it will oscillate less

01:01:39.260 | In the gradient, so it will make the

01:01:41.500 | Weights of the model oscillate less

01:01:44.300 | So the model the training will be more stable. It will be it will converge faster basically this way. So

01:01:50.940 | To summarize

01:01:54.860 | Why do we need normalization is because the input of the model which depends on imagine you are training

01:02:00.860 | Classification or the image classification model then the input depends on the image and the image can be much different from each other

01:02:07.580 | If the image changes a lot, we don't want the model to feel this change in magnitude of the input

01:02:13.500 | We want the distribution of the inputs to be remain constant. Let's say

01:02:17.340 | So that the model doesn't oscillate so that this doesn't force the model to kind of just to keep up with the distribution

01:02:24.560 | This change in distribution. How do we do that? We we try to keep the distributions

01:02:29.520 | Constant so always try to have the input features to be distributed according to a fixed distribution

01:02:35.100 | Which is mean of 0 and 1 and we do that with this formula here, which comes from probability statistics basically each

01:02:42.060 | Distribution if you subtract its mean divided by the standard deviation, it will result in a Gaussian distribution of mean 0 and variance of 1

01:02:49.980 | Of course, this is valid also only for Gaussian distributions

01:02:54.480 | And

01:02:58.220 | And this will basically result in a more stable training

01:03:02.060 | Now the best distribution actually worked fine. However, it has a problem with the problem is that

01:03:07.580 | Which best normalization each of these statistics so the mu and the sigma are calculated

01:03:13.840 | Along the batch dimension. So we calculate the mu and the sigma for the dimension number one of each of these vectors

01:03:21.820 | Along the batch dimension. So basically to calculate this mean we are summing up the first dimension of each of these vectors

01:03:29.420 | And divided by the number of items that we have

01:03:31.740 | So we are mixing the features of different items

01:03:35.820 | So we are mixing the dimension number one of the cat with the dimension number one of the dog

01:03:39.980 | And

01:03:42.940 | so basically to to have good results, we need to use a big batch because

01:03:47.660 | If we use for example a cat and the dog it will result in one mean

01:03:52.780 | But imagine in the next batch, we have the cat and the zebra it will result in a completely different mean

01:03:58.620 | And then the next supposing the next batch we have a cat and the tree maybe it results in another different mean

01:04:04.700 | So also we will still have this problem of covariance shift because the mean is changing a lot between each iteration

01:04:11.120 | So the only solution to this actually is to use a very big batch size

01:04:15.340 | So we are forced to use a big batch size in order to alleviate this problem

01:04:19.660 | Of kind of mixing the dimensions along the batch dimension

01:04:25.980 | We introduce the layer normalization with layer normalization

01:04:28.860 | What we do is instead of calculating the statistics along the batch dimension

01:04:33.900 | We calculate them along the item dimension

01:04:36.220 | So the mu and the sigma that will be used to standardize the cat will only be

01:04:41.900 | Dependent on the dimensions of the cat not on the whatever the cat comes with

01:04:48.300 | So we are still doing each item minus its mean divided by the standard deviation

01:04:55.580 | But instead of this standard deviation and this mean coming from the first dimension of each item

01:05:00.620 | It comes from the average of this

01:05:03.180 | All the dimensions of the each item independently from the others

01:05:07.420 | So it doesn't matter which other item the cat comes with it will always result in more or less the same mu and

01:05:14.140 | Same sigma

01:05:17.660 | And this makes the training even more stable because we are not forced to use a big batch size

01:05:24.620 | And this is why we use normalization

01:05:27.120 | Okay, we have seen what is normalization now we should implement what is this thing called the encoder so this is Sigleap encoder

01:05:36.700 | Now the encoder is made up of multiple layers of the transformer model

01:05:41.980 | And the architecture more or less if you look at the vision transformer paper, it is like this

01:05:47.580 | So I changed it a little bit because I wanted to use the exact names that we will be using

01:05:53.660 | So we have first of all what we have so far is this thing called the Sigleap vision embeddings

01:05:58.460 | Which is basically taking the image it is

01:06:00.540 | Taking some patches of this image using a convolution each of this

01:06:05.740 | Output of this convolution is an embedding is used as an embedding. It's a vector

01:06:10.380 | And this embedding vector is added to another

01:06:14.300 | Vector called the positional encoding which is learned and then we feed this stuff to this thing called the encoder

01:06:21.260 | So we convert it into embeddings at the positional encoding then we feed it to the encoder

01:06:25.340 | And at the input of the encoder you need to think that we have

01:06:28.620 | These layers repeated n times here. It's written l times

01:06:33.340 | One after another such that the output of one becomes the input of the next layer

01:06:38.780 | the thing that you need to understand about the transformer is

01:06:42.460 | I repeat it is that the transformer is a sequence-to-sequence model that converts a sequence of embeddings into contextualized embeddings

01:06:51.280 | What does it mean? It means that at the input you have a list of

01:06:54.560 | Here embeddings each representing a patch of the image as an independent patch

01:07:01.520 | So this embedding here only captures information about the first group of pixels

01:07:06.000 | This embedding here captures all information about the second group of pixels, etc, etc, etc

01:07:10.560 | But then some through some magic called

01:07:13.760 | Attention mechanism this contextualized these embeddings become contextualized at the output of the transformer and we will see in detail this

01:07:21.520 | attention mechanism

01:07:23.600 | Such that this embedding here at the output of the transformer the first embedding is

01:07:28.240 | represents information about the first patch plus other it includes information not only about the first part but also about other patches

01:07:36.080 | And so is the second the third the fourth and the last one

01:07:40.320 | So they become contextualized in the sense that they capture information about the context in which they appear

01:07:46.400 | Which is different from language models in which each token captures information about the previous tokens in the case of the vision transformer

01:07:54.560 | Each patch includes information about all the other patches

01:07:57.600 | Now each of these layers

01:08:01.440 | is made up of so we have the this is the input of the encoder let's say

01:08:07.360 | And we will have the first layer of this encoder

01:08:10.480 | The first thing that we do is we apply a layer normalization and we saw how it works and why we use it

01:08:15.840 | The output of this layer normalization is a cop

01:08:18.800 | First the input of this linear normalization is saved for a skip connection that we do later

01:08:23.680 | Then the output of this layer normalization is sent to the self-attention mechanism

01:08:28.260 | It's this one here and this self-attention mechanism takes the output of the layer normalization as a query key and values

01:08:37.520 | It calculates the attention just like the usual formula

01:08:40.000 | So softmax of the query multiplied by the transpose of the key divided by the square root of the model multiplied by v etc etc

01:08:46.000 | The output of this self-attention is then summed up with this skip connection here

01:08:51.920 | Then the output of this summation is sent to this layer normalization along with the skip connection that is used later

01:08:58.480 | Then the output of the normalization is sent to this multi-layer perceptron, which is a list of linear layers

01:09:03.840 | We will see later and then we do another summation here with the skip connection plus the output of the multi-layer perceptron

01:09:10.180 | And then we do another layer like this and another another another and the output of the last layer is the output of our vision

01:09:18.320 | transformer. So as you can see

01:09:20.380 | the vision transformer takes as an input an image converted into patches. Patches are then fed to this

01:09:28.160 | Encoder which is a list of layers and the output is a contextualized

01:09:31.140 | patches or embeddings of these patches

01:09:33.860 | So let's code this encoder, which is basically this structure here

01:09:39.120 | And we will code each part of this structure and while coding each part we will go inside on how it works

01:09:46.880 | So the normalization we already know how it works, but we still have to explore what is this stuff here called the self-attention

01:09:52.580 | What is this stuff here called multi-layer perceptron?

01:09:56.240 | I believe it's convenient for us to go first through multi-layer perceptron and then we go to the self-attention

01:10:02.080 | I think because the self-attention is a little longer to do. So let me do the simple part first

01:10:06.480 | Okay, let's code this encoder

01:10:09.920 | Now I will copy the first part

01:10:13.440 | This one here, so let's copy it here

01:10:17.520 | So the encoder is made up of again, the constructor is made up of the configuration

01:10:22.240 | We save some stuff which is the hidden size and then we have a block called the self-attention block in this call this

01:10:28.000 | Here it's called the siglib attention. Now

01:10:31.200 | Note about the naming I'm using. So I am using the same names as

01:10:35.600 | the HuggingFace implementation

01:10:38.560 | For only simple reason which is I want to be able to load the pre-trained weights from HuggingFace

01:10:44.240 | So the pre-trained weights for the polygam are available on the HuggingFace hub

01:10:49.600 | So we want to be able to load them

01:10:51.680 | But each of these pre-load pre-trained models they have this dictionary of weights

01:10:57.040 | So where the dictionary tells you where to load each of these weights

01:11:01.520 | And if the names do not match you need to create some conversion script

01:11:04.720 | So I didn't want to do that and also it would just complicate the code uselessly

01:11:08.980 | So I just use the same names so that we can

01:11:12.240 | Load basically the pre-trained weights from HuggingFace

01:11:17.440 | Also because my code is based on the HuggingFace implementation

01:11:20.480 | So to create my code I use the HuggingFace implementation, but simplified a lot a lot a lot

01:11:25.680 | For example, I remade my own KVCache. I did a lot of

01:11:29.040 | Modifications to simplify it but it's based on the HuggingFace implementation

01:11:34.100 | anyway

01:11:36.080 | So we have this thing called the self-attention then we have a layer normalization. So we saw it's

01:11:40.400 | Where is it? And we have this layer normalization here

01:11:43.360 | Then we have this multi-layer perceptron, which is this stuff here. And then we have another layer normalization, which is this stuff here

01:11:49.920 | So we have two layer normalization. So now let's implement the forward method

01:11:54.480 | And the forward method I will copy it line by line so we can understand

01:11:58.960 | Okay this forward method. Now. The first thing we do is we save a residual connection, which is

01:12:05.680 | We basically save the input that we feed to this

01:12:09.260 | Encoder because we need to reuse it later. So we are saving this skip connection because we will need to use it here later

01:12:14.860 | Then we run it through the layer normalization the input

01:12:19.500 | And it's done here. So the layer normalization does not change the shape of the input

01:12:25.020 | It's just normalizing each of these dimensions such that they they all come up

01:12:30.700 | It's like they came out from a Gaussian of mean zero and variance of one

01:12:36.860 | Then we apply this magic thing that we will explore later called the self-attention and the self-attention system

01:12:42.380 | Also does not change the shape of the input

01:12:44.700 | Tensor, but as we saw before the attention mechanism is something that takes as input

01:12:50.140 | Embeddings and gives you contextualized embeddings. So it does not change the shape of these embeddings

01:12:55.600 | But we will implement it later. So for now just think of it as a black box that you feed in

01:13:00.700 | Embeddings and it gives you contextualized embeddings

01:13:03.980 | Then we have a residual connection and we can see that here. So this residual connection

01:13:09.500 | Skip connection was called

01:13:12.220 | Which is this first plus here

01:13:14.060 | So we are taking what we saved before with the output of the self-attention

01:13:18.300 | So what we saved before is this residual stuff here plus the output of the self-attention, which is this hidden states here

01:13:23.740 | This the result of the summation is saved again because there is another skip connection

01:13:29.100 | after

01:13:31.580 | I don't know why my alt tab is not working. So

01:13:33.580 | We save again another

01:13:36.380 | This stuff here. So we save it because later we need to use it here for the skip connection

01:13:40.860 | Then we do I guess another linear layer normalization which also does not change the shape of the input

01:13:49.340 | tensor

01:13:52.060 | And then we have this thing called the multilayer perceptron. Now the multilayer perceptron is something that

01:13:57.820 | It's not easy to explain what is used for but basically

01:14:01.100 | The multilayer perceptron we will see later is a series of

01:14:05.100 | Linear layers that takes each

01:14:09.740 | input embedding and

01:14:13.500 | Transforms it independently from each other from the others

01:14:17.820 | So while in the self-attention there is kind of a mixing of the patches incoming so that you get contextualized

01:14:24.380 | In the multilayer perceptron, there is no mixing between these let's call them tokens or patches

01:14:29.180 | Each of them is transformed independently

01:14:32.560 | And the multilayer perceptron allow us to increase basically first of all it adds parameters to the model. So the model has more

01:14:40.060 | Degrees of freedom to learn whatever it's trying to learn

01:14:43.980 | and the second

01:14:46.380 | Objective of the multilayer perceptron is that it allow to prepare

01:14:50.220 | Let's say prepare the the sequence of patches for the next layer. So if the next layer expect these patches to be somehow

01:14:57.980 | Different the multilayer perceptron allow to transform them

01:15:02.300 | Also, it adds a non-linearity. So the multilayer perceptron also includes a non-linearity which adds

01:15:08.060 | Which basically allow as you know non-linearities allow you to model more complex transformations

01:15:15.900 | So if you just create a list of linear layers without any non-linearities that you cannot model complex functions so that for example

01:15:22.220 | in the classification you cannot

01:15:24.300 | Map non-linearly separable data, but with by adding

01:15:29.900 | Non-linear transformations you add complexity to the model. So the model is able to map complex transformations

01:15:38.400 | So the multilayer perceptron just adds parameters and this non-linearity which is helpful to

01:15:45.420 | To to allow the model to learn whatever complexity it needs

01:15:49.180 | To to map the input to the output

01:15:52.620 | After the multilayer perceptron, I guess we have a

01:15:57.740 | Yeah, we have another skip connection and then we return the output of this skip connection here

01:16:04.140 | and also the skip connection does not change the shape of the

01:16:07.260 | Of the tensors of the embeddings

01:16:10.880 | Now, let's code first this multilayer perceptron. It's the easiest stuff to do

01:16:15.100 | So let's do it

01:16:17.100 | uh

01:16:18.300 | Let's go here. I I will also always copy first the

01:16:21.980 | Constructor and then the forward method so we can explore a little bit the structure and then we explore the logic

01:16:27.660 | So this multilayer perceptron just like in the vanilla transformer is made up of two layers

01:16:33.660 | plus a non-linear transformation

01:16:36.780 | So the first layer takes each of the embeddings which are we we can also call them tokens or patches

01:16:43.820 | Because most of the time we are dealing with language models and expands them

01:16:47.980 | So each of these vectors which is of size hidden size is expanded into this thing called intermediate size

01:16:55.180 | Usually it's chosen as three times the hidden size or four times the hidden size

01:17:00.380 | I remember in the vanilla transformer it was four times the hidden size

01:17:03.260 | Then we apply a non-linearity to this expanded tensor and then we compress it back to the hidden size dimension

01:17:12.780 | So let's do the forward method now

01:17:14.780 | Which is this one here

01:17:17.420 | So the first thing we do is we convert each of these embedded dimensions into intermediate sizes

01:17:23.340 | So again, we have a batch of images

01:17:26.060 | Each image is made up of num_patches number of patches each of this patch is represented by a vector of size embedding dimension

01:17:33.420 | With the first fully connected layer, we are expanding each of these patches into the intermediate size and then we apply

01:17:42.460 | A non-linear transformation in this case. It's the gelu function now

01:17:46.380 | You may be wondering why are we using the gelu function or the zwiglu function or whatever non-linearity there is

01:17:52.620 | The reason is always practical. So

01:17:55.660 | Basically

01:17:58.540 | There is a there is no like a rule of thumb for choosing the non-linearities to use for a specific case

01:18:05.020 | There are just some heuristics

01:18:07.820 | And the heuristics is that initially the transformer when it was introduced it was with the gelu function as non-linearities

01:18:13.840 | between these two fully connected layers

01:18:16.540 | But then people explored other non-linearities and they saw that they work better

01:18:21.500 | Now non-linearity is actually there is also some logic behind the choice of a non-linearity

01:18:25.980 | So because the non-linearity define also the flow of the gradient

01:18:29.820 | So for example, if you use the gelu function, if you look at the graph of the gelu function, let me draw it actually

01:18:36.940 | The graph of the gelu function is something like this. So

01:18:39.580 | Why I cannot draw it, okay

01:18:43.020 | So basically anything that is negative is zero. Let me use another color

01:18:49.100 | Anything that is negative is becomes zero basically and everything else is forwarded without any scaling

01:18:56.880 | So this means that if the input of the gelu function is negative the output will be zero and actually for any

01:19:06.220 | Negative input there will be no gradient because the gradient will be multiplied by zero. So it will not flow

01:19:10.860 | That's why for example, we introduced the leaky relu and other like

01:19:14.940 | In the relu family, there are other

01:19:18.060 | Functions that allow also a little bit of gradient flow from the negative side

01:19:23.660 | So the non-linearity basically tells you

01:19:27.020 | How the gradient will flow during back propagation. So having a non-linearity

01:19:35.980 | that allows

01:19:37.980 | That allows the gradient to flow back even when it's negative

01:19:40.940 | It means that the signal the model is not forced to always have the activation to be positive to have some

01:19:46.860 | Feedback from the loss function to optimize its weights

01:19:49.900 | And why we are using the gelu because people have tried it and probably it works better

01:19:56.780 | compared to the relu function for the same class of

01:20:00.140 | applications so in the vision transformer you see the gelu function, but

01:20:05.020 | In the lama, for example, they use the zwiglu function in other scenarios

01:20:08.300 | They use other functions and it's mostly based on heuristics on how they work in practice

01:20:13.980 | also, because a model is usually made up of billions and billions and billions of

01:20:19.180 | of parameters and it's not easy to find the regular regularity to understand why

01:20:24.860 | Specific non-linearity is working better than the other one

01:20:30.380 | Now, okay, then we apply the second linear layer

01:20:33.980 | Which is basically recompressing back this intermediate state into the embedding size and then we return it

01:20:39.980 | and this is our

01:20:42.860 | multilayer perceptron

01:20:44.860 | our next part is going to be

01:20:47.340 | we are going to code this attention mechanism for the vision transformer and we will see that it's

01:20:53.340 | Different than from those of language models because we don't have any causal mask or attention mask

01:20:59.980 | All right guys, so we have seen the multilayer perceptron now

01:21:04.460 | Let's go to the multi-head attention and for that

01:21:07.180 | I want to use the slides because I believe it's a little faster to explain on the slides and then we proceed with the code

01:21:13.420 | So what is the multi-head attention? The multi-head attention is a way of contextualizing stuff

01:21:19.420 | Which means that you start with a sequence of for example patches and you can think we have for example

01:21:26.140 | Four patches each of this patch is represented by a single vector of 1024 dimensions

01:21:32.620 | So you need to think of this as a vector of 1024 dimensions. So you need to think there are

01:21:37.340 | 1024 numbers in this row vector

01:21:40.700 | Then we have the patch number two the patch number three and the patch number four

01:21:44.700 | Each of this patch was extracted from a group of pixels from the initial image and it's only representing information about the patch

01:21:51.980 | It was extracted from so the part of the image it came from

01:21:56.300 | With the multi-head attention system. We uh, what we mechanism what we are doing is we are contextualizing these patches

01:22:03.820 | Which means that the output of the multi-head attention is a tensor of the same size

01:22:08.300 | As the input so this is a tensor of size 4 by 1024

01:22:12.480 | the output will be a tensor of size 4 by 1024, but where each of these

01:22:19.260 | Embeddings now does not capture information only about itself, but also about the other patches

01:22:25.820 | in the in the sequence

01:22:27.820 | This is for vision transformer for the language models we want something slightly different

01:22:34.220 | So for language models, we do have an input sequence, which is a sequence of tokens each token representing one single

01:22:41.020 | I don't want to use the term word because it's wrong but

01:22:44.780 | In my videos, I always make the simplification that each token is a word and each word is a token

01:22:49.740 | But this is not the case actually in tokenizer. So usually a token can be just any sequence of characters

01:22:56.320 | Does not does not necessarily be um, it does not need to be necessarily a word

01:23:01.660 | But for us let's treat them as word. It's just simplifies the explanation

01:23:05.840 | so

01:23:07.340 | We have a list of tokens. Each token is represented as an embedding. Let's say of 1024 dimensions

01:23:14.140 | So it's a vector of 1024 dimensions. So

01:23:17.400 | 1024 numbers for this one 1024 numbers for this one, etc, etc

01:23:21.720 | The multi-head attention in the case of language models

01:23:25.480 | What we want is we want to contextualize each token with the all the tokens that come before it

01:23:31.640 | So the output of the multi-head attention in the case of language models

01:23:35.560 | And this is this would be known as the self-attention mechanism with causal mask

01:23:43.160 | Is a sequence with the same shape as the input sequence

01:23:47.320 | So this vector this matrix here is a 4 by 1024. So the output will be 4 by 1024

01:23:53.180 | And each of these tokens is not capturing information only about itself

01:24:00.120 | But also about all the past tokens now the word I does not have any past token

01:24:04.920 | So it will only capture information about itself

01:24:07.720 | But the word love will capture information also about the token I because it comes before it and the word

01:24:13.160 | Pepperoni will capture information about I and love because they come before it etc, etc until the last token which capture information about all the sentence

01:24:21.080 | Why do we want to do this in language models?

01:24:25.160 | Let me give you a little understanding of why we do it in this way with language models and why the transformer is

01:24:32.280 | revolutionary for language models

01:24:35.480 | This is going a little off topic with respect to the vision transformer

01:24:38.600 | But I think if you understand this then you will understand the big part of the transformer and why it even exists

01:24:43.640 | So let's copy this stuff here

01:24:46.120 | Let's open a new page

01:24:48.600 | Now what we do with the language models is you need to think that a language model is

01:24:53.640 | Something that we need to we retrain on what is known as the next token prediction task

01:24:59.480 | Which means that given a prompt the language model try to understand what is the next token that completes this prompt

01:25:05.560 | How do we generate text with the language model? We start with some tokens, which are the prompt we generate the next token

01:25:11.480 | We put it back into the prompt and we ask again the language model

01:25:14.120 | What is the next token the language model gives us the next token?

01:25:16.680 | Then we put it back into the prompt and then we ask again. What is the next token etc, etc

01:25:20.280 | So we need to train a language model to train a language model

01:25:24.600 | We need to train a model to predict the next token given the past tokens

01:25:29.320 | And the transformer allow us to do that in parallel when training

01:25:35.000 | Which means that we start with an input that is a series of embeddings

01:25:39.340 | Which are uncontextualized so we start with this one and each of these actually is one single token. So this is only I this is only love

01:25:47.960 | This is a pepperoni

01:25:50.760 | And this is a pizza

01:25:54.600 | The output of the transformer of the self-attention mechanism will be a series of

01:26:01.400 | embeddings that are

01:26:04.420 | Uncontextualized in such a way that each token captures information of only about itself, but also about all the past tokens

01:26:11.240 | How do we train and the transformer can do it in parallel?

01:26:14.840 | So the self-attention mechanism will take this as input and generate this output in parallel

01:26:19.800 | So it's not will generate one token at a time, but it will generate all of them in the in parallel using this multi-head attention

01:26:27.240 | How do we train a language model basically?

01:26:31.340 | As we saw before the language model is something that given a prompt needs to predict the output. So what we want is that

01:26:39.020 | We can we take the input which is a

01:26:43.020 | This sentence here. We feed it to the transformer the transformer will transform it into a sequence of embeddings

01:26:49.340 | Contextualized embedding and then we need some labels to train this language model

01:26:54.060 | So the labels what will be well, we will we want whenever the language models

01:27:00.300 | Is given the word I to predict the word love

01:27:03.420 | So big, oh, I think i'm using not the pen

01:27:07.740 | here

01:27:09.420 | the word love

01:27:11.180 | whenever the

01:27:12.620 | Language model sees the word I love it should predict the word pepperoni

01:27:16.640 | Whenever it sees the word the sequence I love pepperoni it should predict pizza

01:27:26.620 | Whenever it sees the sequence I love pepperoni pizza

01:27:29.820 | It should predict the token end of sentence, which is a special token telling hey, I'm done with the generation

01:27:36.000 | Because the transformer can generate all of these contextualized embeddings in parallel

01:27:41.820 | we can also calculate the loss for each of these predictions in parallel and

01:27:46.300 | Calculate the with backpropagation updates the weights of the model to tell in parallel

01:27:53.020 | How the model should predict each of this token given the

01:27:56.780 | The previous tokens. So when we are given a sentence and we train language model the language model can

01:28:02.540 | Can be trained

01:28:05.820 | With only one forward pass on how to predict the next token inside of this sentence given the previous tokens as context

01:28:13.180 | In only one single pass of the transformer. That's why the transformer is so powerful because this contextualization happens in parallel

01:28:19.900 | So we can calculate the output in parallel for each position

01:28:22.540 | And because we know already know what is the label because the label is just the next token given the previous tokens

01:28:28.220 | we can calculate the loss in parallel for each positions and the model will learn in parallel how to

01:28:33.180 | Generate exactly this sentence in in one pass only

01:28:37.820 | so the model will not learn to generate one token at a time given the previous but

01:28:43.100 | All the sentence in one pass and that's why it's so powerful

01:28:47.740 | Now let's go back to our vision transformer

01:28:49.740 | Okay, so we have seen what is the difference between the vision transformer and the language model

01:28:54.220 | So in the vision transformer, we want to contextualize tokens or patches

01:28:57.980 | In such a way that they capture information about all the other patches

01:29:02.220 | But in the language model, we want each token to only capture information about itself and the previous tokens

01:29:06.940 | How does this self-attention mechanism work?

01:29:10.300 | We start with of course an input sequence. Our goal is to create an output sequence that is contextualized

01:29:16.380 | And there are many intermediate steps. So now we will see what are these intermediate steps one at a time

01:29:21.500 | so

01:29:23.340 | Let's start by creating the class of this this attention mechanism and we will create it. Let's create it here

01:29:29.900 | Okay, so in the input we have the configuration of the model we save some stuff that we will need later

01:29:37.580 | So the hidden size the number of attention heads because we are dealing with multi-head attention

01:29:43.660 | Head dimension we will see later what is it and why it's used

01:29:47.020 | The scale is basically the if you remember the formula for the attention is

01:29:51.260 | The queries multiplied by the transposed of the keys divided by the square root of the model

01:29:57.340 | And this is one over the square root of the model

01:29:59.740 | So the stuff that we need to divide the query multiplied by the keys with

01:30:05.100 | Then we have this dropout which is zero. I never saw it used in

01:30:10.780 | In polygamma, but I believe there are other cglib models that use it. So they they put it here

01:30:15.580 | But it you can think of it like non-existent for now

01:30:19.180 | and then we have these three linear layers called w, k, w, q and w, v which are

01:30:25.580 | Parameter matrices that are also present in the vanilla transformer

01:30:29.180 | We will see later what they are used for

01:30:31.260 | And then we have this output projection which in the paper of the transformer is called the wo matrix and we will see later

01:30:36.940 | What is it is used for?

01:30:39.580 | Let's start by implementing the forward. So the forward method is this one

01:30:43.900 | What is the input of the forward method?

01:30:46.060 | Well, the input of the forward method of this attention mechanism is basically what

01:30:50.540 | Is the output of the layer normalization in this encoder layer class

01:30:55.580 | So the output of the layer normalization is fed to this self-attention mechanism

01:31:00.000 | So it is something of this shape. So it's a batch size by non-patches by embedding dimension

01:31:08.380 | So what is does it mean? It means that we have a batch of images

01:31:12.220 | Each of these images is made up of some

01:31:14.460 | patches how many

01:31:16.780 | defined by this number non-patches

01:31:18.780 | And each of this patch is represented by a vector with the size embed dimension

01:31:24.700 | You can think of it as a vector of 1024 dimensions. I don't remember the exact number of dimensions right now

01:31:30.940 | You can also think as this non-patches as a sequence length

01:31:35.740 | So before we saw that a language model is made up of a sequence of tokens here. You can think of it as a sequence of

01:31:40.940 | Patches where the sequence length is this non-patches here

01:31:45.020 | The first thing that we do in the self-attention mechanism is we take the input and we run it through three

01:31:52.060 | Transformations one is called wq one is called wk and one is called wv and after we run it through these

01:31:58.140 | Transformations the output will become query key and values

01:32:02.300 | So let's do it

01:32:05.900 | And it's this stuff here

01:32:07.900 | So we take the input sequence, which is this hidden states and we run it through wq here. It's called the qproj

01:32:14.620 | Wk here is called the kproj w here is called vproj

01:32:19.020 | The shape of the tensor does not change. Basically. These are parameter matrices

01:32:24.960 | So they just add parameters to our self-attention that transform the input sequence so that they become query key and value

01:32:33.100 | So it's the query key and value is just a transformation of the input sequence. However

01:32:37.740 | In this case each token still is independent from the other

01:32:42.140 | So there has been no contextualization happening with the linear layers. So linear layers always treat each token

01:32:47.500 | Independently from the others just like the multi-layer perceptron each token in the multi-layer perceptron is expanded and then reduced

01:32:54.300 | Here, it's not even not expanded nor reduced. It's just transformed because the size is from embedding dimension to embedding dimension

01:33:01.980 | So it's just a transformation of the single token

01:33:04.780 | Why we want to do it? Because the self-attention mechanism needs to see the same sequence in three different ways as query key and value

01:33:12.620 | So we do three different transformations

01:33:14.620 | Later, we will see why they are called query key and values

01:33:17.820 | The second thing we do is basically we split this each of these tokens into smaller tokens

01:33:28.540 | How many smaller tokens based on how many heads we have and now we see why so let me do something strange

01:33:35.420 | Which is i'm not copying the entire line. I'm copying a part of it

01:33:38.460 | so

01:33:40.380 | We take this query state

01:33:42.140 | Which is a tensor of batch size numpatches embedding dimension and we are splitting the embeddim dimension into smaller

01:33:49.500 | parts

01:33:51.100 | Called head dimension. How many of this head dimension we have? We have numheads

01:33:56.560 | Okay, let me copy it all otherwise, I think it's going to be confusing. Sorry

01:34:02.080 | We also have this transposition later. We will see how it works. We will visualize the tensor operations

01:34:09.040 | We do it for the query the key and value, let's do it and then we see what is it about

01:34:16.320 | Okay

01:34:19.360 | So let's go to the slides

01:34:24.000 | So at the input of this fission transformer, we have a sequence of patches you can think of it as a sequence of

01:34:31.120 | vectors each vector made up of let's say

01:34:33.680 | 1024 dimensions or you can think of it as a

01:34:37.600 | Sequence of tokens in case we are working with the language model and each token is represented by 1024 dimensions vector

01:34:44.720 | The first thing that we do is we convert this input sequence

01:34:48.640 | Which we will call x into query key and value and we do it through three transformations. One is called

01:34:54.000 | Wq one is called wk and wbn

01:34:56.800 | Which is basically a matrix multiplication

01:34:59.380 | Now if you look at the shape of the input sequence here, it's 4 by 1024

01:35:04.820 | So here you can see the input sequence is 4 by 1024

01:35:08.260 | Where 4 is representing the sequence dimension

01:35:12.320 | So how many tokens or how many patches you have and the hidden size represents how many what is the size of this embedding vector?

01:35:19.760 | We multiply it each of these with wq wk and wv

01:35:25.040 | Now if you look at the dimensions here wq wk wv they are

01:35:29.360 | The size is embedding dimension to embedding dimension. However here I have represented it as

01:35:35.040 | embedding dimension to 8 multiplied by 128 so

01:35:40.800 | The overall size is the same. So it's 1024 by 1024

01:35:44.340 | However, i'm splitting this second 1024 into eight groups and later we will see why

01:35:51.840 | so you can think of it as a

01:35:54.640 | matrix multiplication that takes a matrix multiplication between this tensor here 4 by

01:36:02.560 | 1024 and this other tensor which is also 1000 by 24 by 1024

01:36:08.880 | However in which the second dimension is split into sub

01:36:12.080 | Groups, how many eight groups because eight is the number of heads we are going to work with

01:36:18.080 | each having 128 dimensions

01:36:20.900 | if you do this matrix multiplication, it is

01:36:23.760 | It will result in this output here. So basically it's a

01:36:27.680 | 1024 multiply this dimension here cancels out as you can see

01:36:34.480 | And then we have the second dimension that remains so in the matrix multiplication the inner dimensions cancel out and the outer dimensions remain

01:36:41.360 | Now

01:36:44.880 | You can if you are confused by this you can think of it like this. So it's like a 1024

01:36:49.140 | And it's

01:36:53.360 | 1024 nothing has changed. I'm just grouping the dimensions. So that's why it's possible

01:37:01.920 | But it this grouping is helpful. And now we will see why

01:37:05.840 | Let's visualize this tensor operation at the max matrix level

01:37:10.000 | So when we do query this x multiplied by wq we have nx which is a 4 by 1024

01:37:15.540 | so it's a sequence of

01:37:18.480 | tokens, each token is

01:37:20.700 | 1024 dimensions

01:37:22.480 | And we are multiplying by a very big matrix, which is 1024 by 8 by 128. How to visualize this matrix?

01:37:30.080 | Well, this is a wq. So it's a parameter matrix

01:37:33.360 | It's also wq and wv. So they all have the same dimensions

01:37:37.780 | You can visualize this like this. You can think of it as a matrix made up of

01:37:42.400 | 1024 rows

01:37:45.740 | Each row is made up of smaller vectors

01:37:49.600 | How many smaller vectors? 8 of them and each of these smaller vectors is made up of 128 dimensions

01:37:58.180 | The overall size of this matrix is still 1024 by 1024

01:38:02.660 | But each of these let's say these vectors are split into 8 groups

01:38:08.020 | So that the output is also a matrix in which each of the

01:38:14.740 | Tokens is a split into multiple subgroups. So it's a matrix that is 4 rows

01:38:21.060 | So as you can see, this is 4 is the number of rows

01:38:24.900 | Each row contains 8 groups of smaller embeddings and each of these smaller embeddings is made up of

01:38:31.780 | 128 dimensions

01:38:34.260 | So why are we even doing this?

01:38:36.260 | With multi-head attention, basically what we want to do if we want

01:38:40.500 | The multi-head attention is a way to relate tokens with each other

01:38:44.980 | We don't want to relate tokens to each other by watching the full embedding of each token

01:38:53.060 | We want to do it with 8 different heads

01:38:56.420 | Such that each head works with a smaller part of the embedding of each token

01:39:02.020 | So the head number 1 will only watch the first 128 dimensions of each token in the entire sequence

01:39:11.300 | The head number 2 will watch the next group of 128 dimensions. So the dimension from

01:39:18.820 | 129 to 256 of each token

01:39:23.060 | So this head will learn to relate all these tokens by only watching this part of the embedding of this each token

01:39:28.340 | This head will learn to relate tokens by only watching this part of the embedding of each token

01:39:34.020 | And this last head will watch to we learn to relate tokens by only watching the last part

01:39:39.540 | Last 128 dimensions of the embedding of each token. Why?

01:39:47.780 | In

01:39:49.620 | Many languages a word may have different meaning depending on the context in which it appears

01:39:56.180 | If we don't have multi-head attention because the multi-head attention we will see it later is based on what is known as

01:40:02.260 | What is a dot product?

01:40:04.580 | If we compute the dot product over all the

01:40:08.500 | all the

01:40:11.300 | Token then there is only way of calculating the dot product between two tokens

01:40:16.180 | Which is the full embedding of the first token with all the full embedding of the second

01:40:21.060 | So there is only one way of relating two tokens with each other

01:40:24.740 | By splitting each token into smaller groups

01:40:28.420 | Each dedicated to one head. So this is head 1, head 2 and head 8 and all the intermediate heads are here

01:40:36.820 | We learn to relate tokens to each other differently because each head is watching different parts of the embedding of each token

01:40:44.020 | And this is useful for language modeling, for example, because in language modeling

01:40:48.260 | Especially for example in Chinese

01:40:51.380 | Each word may have different meaning depending on the context in which appears

01:40:55.460 | So it may be a noun in some context. It may be a verb in some other context or an adverb in some other context, etc

01:41:02.980 | So we hope that this head here, for example learns to relate this token as a verb

01:41:08.020 | This head here will learn to relate this token as a noun and this head here

01:41:12.820 | Maybe will learn to relate this token as an adverb or some other property that this token has

01:41:17.700 | And this multi-head attention also has another advantage

01:41:21.320 | Because the multi-head attention is based on dot products between tokens

01:41:24.980 | This head here will do the dot product of this first 128 dimensions of this token with the first 128 dimensions of this token

01:41:33.140 | And this head because it watches this part of the token embedding and this other head watches this part of the

01:41:40.340 | Embedding they can work independently from each other

01:41:44.020 | And so because they can work independently from each other this computation can be parallelized

01:41:48.920 | That's why in the attention is all you need paper when they talk about the multi-head attention. They make this

01:41:58.260 | Drawing with multiple drawings behind you can see here with the head dimension appearing here, which means that each of this head

01:42:05.380 | Is computing this scale dot product attention in parallel

01:42:10.120 | With the other heads because each of them is working with a different part of the embedding of each token

01:42:15.860 | So they can work independently from each other

01:42:17.860 | And this is what we are doing here. So we group this

01:42:22.100 | This the embedding of each token into multiple subgroups

01:42:27.560 | Each dedicated to one head because we want this multi-head attention to happen in parallel

01:42:33.500 | Because each head is working with a different part of the embedding of each token

01:42:37.960 | And so it it becomes

01:42:40.600 | Much faster because we can compute all this stuff in parallel

01:42:44.440 | anyway

01:42:46.360 | What we have done in the code is as follows

01:42:48.440 | So we have taken our input sequence now here for the drawing. I have chosen a 4 by 1024

01:42:54.840 | but in the code it should be

01:42:56.840 | Depending on how many patches we have so numPatches by embedDimension

01:43:00.860 | We have multiplied each of them by the Q K and V

01:43:05.000 | And then we split them here as you can see in the

01:43:09.240 | In multiple heads, so we add this head dimension here in my slide

01:43:15.560 | I just pretend I am multiplying directly with a

01:43:19.080 | Parameter matrix that is already split into multiple heads

01:43:23.240 | Why am I doing differently here than compared to the code because we will be it will be useful for this

01:43:29.240 | Visualizing it this way is will be useful for when we will be

01:43:32.920 | Talking about the language model and especially we will be talking about grouped query attention

01:43:36.920 | Because with grouped query attention, we will see that the number of heads for the query

01:43:40.600 | Is much bigger than the number of heads for the keys and the values

01:43:45.240 | So here in the vision transformer the number of heads of the query key and values is the same

01:43:49.560 | So we don't use the grouped query attention and that's why

01:43:52.680 | We use the same number of heads for the query key and values

01:43:55.480 | Then we do this transposition and now we see what is this transposition

01:43:59.720 | So when you do this multiplication here, so you multiply the input by the Q projection. It will return the same

01:44:06.040 | input shape

01:44:08.600 | When you do this view, it will just split this last dimension. So this embedDimension into smaller parts

01:44:15.320 | So it will become num

01:44:17.400 | It will become like this

01:44:22.200 | Uh patches by heads, so we are splitting

01:44:25.340 | This dimension into these two smaller dimensions. So numHeads by headDimension

01:44:31.180 | So basically, what is this headDimension? headDimension is the embedding full embedding divided by the number of heads

01:44:38.120 | So this one imagine this is 1024

01:44:40.380 | Then imagine this is 8

01:44:43.240 | Then this will be 128 because it's 1024 divided by 8

01:44:51.080 | and

01:44:53.080 | Because we are not reducing the number of parameters or we are not throwing away anything

01:44:57.800 | We are just grouping differently each of these embeddings

01:45:00.940 | With this transpose here, we are changing the position of the two

01:45:07.560 | Two dimensions which dimension the position the dimension number one and the dimension number two, which is the numPatches with the numHeads

01:45:14.780 | So basically we are doing numHeads and numPatches

01:45:20.040 | So this will be the output of all this expression. So it will be a tensor of this

01:45:25.880 | Of this shape batchSize numHeads numPatches headDim. Why are we doing this transposition? Let's see

01:45:34.760 | so

01:45:36.680 | we have

01:45:38.040 | When we multiply by this wqwk and wv which is already includes the grouping. We are grouping each of these

01:45:44.360 | Vectors into sub groups each dedicated to one head

01:45:49.880 | Now what we have here is a sequence of tokens

01:45:53.080 | Each token is made up of eight group of embeddings. Each group of embedding is made up of 128 dimensions

01:45:59.960 | what we want, however

01:46:02.120 | is

01:46:03.560 | because we want to compute the

01:46:05.560 | Multi head attention in parallel, which means that each head should be able to visualize

01:46:11.500 | The entire sequence but a smaller part of the embedding of each token

01:46:17.800 | We need to transpose these two dimensions. So we exchange the sequence dimension with the head dimension

01:46:23.580 | and a way to visualize this is this

01:46:27.240 | that

01:46:30.120 | Let's do it. So we have this sequence of tokens each token is

01:46:35.560 | Divided into eight groups. Each group is made up of 128 dimensions. We want to convert it

01:46:43.320 | Into multiple sequences made up of only the part of the embedding dedicated to each token

01:46:49.480 | So when you do the transposition of these two dimensions here

01:46:52.920 | They become like this. So 8, 4, 128

01:46:56.620 | How can you visualize this matrix? You can visualize it as follows. It's a big matrix that contains eight smaller matrices

01:47:05.160 | each smaller matrices contains four tokens and each token contains

01:47:10.180 | 128 dimensions, which is exactly the dimensions

01:47:13.720 | That are dedicated to each of this head. So you can think of it as a sequence eight sequences

01:47:21.800 | where each sequence is made up of

01:47:24.760 | tokens and each tokens contain only the part of the embedding dedicated to each of the head that it's

01:47:31.720 | each of the eight heads

01:47:33.880 | It's composed of so this sequence here will only contain the first 128 dimensions of each token

01:47:40.440 | This sequence here will contain the next 128 dimensions of each token

01:47:45.560 | And the last sequence here will be a sequence of four tokens and each token will be made up of the last

01:47:51.400 | 128 dimensions of the initial tokens

01:47:54.600 | Why are we doing this? Because now we can compute

01:47:59.720 | The multi-head attention using this stuff here

01:48:02.040 | Independently from this one independently from this one independently from this one

01:48:07.400 | because each head has a sequence of four tokens and each token is made up of 128 dimensions

01:48:16.220 | And we end up in what we saw here

01:48:19.720 | So we can compute this scale.product attention using the query key and values where the query key values are not the entire

01:48:27.380 | Embedding of the token but are only the part of the token dedicated to that specific head

01:48:32.660 | So this head here suppose the head number one will be using the first 128 dimensions

01:48:38.180 | This second head will be using the second 128 dimension. The last head will be using the last 128 dimensions, etc

01:48:45.460 | So we have created the that's why we did this transposition because we now we can treat each head

01:48:53.200 | Independently each head is made up of is working with the four tokens

01:48:57.200 | Which is the sequence dimension and each token is made up of the part of the embedding dedicated to that head

01:49:02.960 | And this is why we do this transpose here

01:49:06.240 | The next thing that we do in multi-head attention is well, we have this

01:49:11.200 | Query key and values. What should we do?

01:49:13.440 | We should do query multiplied by the transpose of the key divided by the square root of the model

01:49:17.840 | And that's it. Yeah, so let's do it

01:49:22.560 | Let's calculate the attention weights, which is this one

01:49:26.720 | So we take the query

01:49:29.440 | Multiplied by the transpose of the keys where we are transposing the second and the third dimension

01:49:34.400 | What is the second and the third dimension?

01:49:36.320 | It's the numPatches with the head dimension because the query is pet size numHeads numPatches head dimension

01:49:43.200 | to multiply it with

01:49:45.520 | the keys we need to

01:49:47.520 | exchange the last two dimensions, otherwise

01:49:50.240 | You so multiply it we need like this. We need

01:49:53.520 | This stuff here

01:49:57.360 | Then we need head dimension and numPatches

01:50:02.020 | Such that if you remember in the matrix multiplication the inner dimensions cancel out and the outer dimensions remain

01:50:10.320 | so the outer dimensions basically

01:50:12.400 | Is

01:50:15.200 | numPatches numHeads

01:50:18.080 | then the hidden this is

01:50:20.080 | Head dimension will cancel out with this one and we will be left with numPatches

01:50:25.540 | So the output of this multi head attention basically, it's a matrix that is numPatches by numPatches for each head

01:50:32.160 | Let me delete this one

01:50:35.760 | So I know it's not easy to visualize it like this. So let's visualize it on the slides

01:50:41.440 | So what we are doing is we are multiplying the query with the transpose of the keys

01:50:45.280 | And then we are dividing by the square root of the model, but we already computed it here. So this is the square root of the model

01:50:51.120 | And we just because it's already one over square root so we just multiply it we don't need to divide by it

01:50:58.080 | So let's visualize in the slides how this multiplication works

01:51:03.120 | Okay, we already saw why we do the multi head attention because we want to parallelize the computation etc. So now what we are doing is we are

01:51:13.520 | for each head

01:51:15.280 | each head as we saw before is

01:51:17.280 | Made up of one sequence of embeddings where each embedding is not the full embedding of the token

01:51:23.920 | But it's a part of the embedding of each token. So it's a smaller embedding. Let's say

01:51:27.920 | So each head basically will do the following matrix multiplication when you do query multiplied by the transpose of the keys

01:51:34.400 | Each head is made up of a sequence of tokens

01:51:38.560 | And each token is not the full embedding of the token, but it's the first 128 dimensions of each token

01:51:45.440 | When we do the transpose of the keys each of these row vectors becomes a column vector as you can see

01:51:52.080 | And when we do this matrix multiplication for each head we will be getting this

01:51:59.120 | Matrix as output which is

01:52:04.280 | Sequence by sequence because as you can see when you multiply this matrix here by this matrix here

01:52:09.180 | You get four by four matrix as output because the inner dimensions cancel out

01:52:13.260 | What does this matrix represent?

01:52:16.860 | Each of these numbers represents the dot product of one token with another token

01:52:23.340 | So you can think of the rows as being the queries and the columns as being the keys

01:52:30.060 | This one here is the dot product of the first token of the queries suppose that the each of these tokens represent

01:52:37.100 | A sentence like I love pepperoni pizza

01:52:40.380 | Then this is the word I this is word the word love this is the word pepperoni and this is the word pizza

01:52:47.740 | Then this number here represents the dot product of the word I with itself

01:52:55.020 | So the first query with the first key

01:52:59.340 | This one here represents the dot product of the first query with the second key

01:53:05.340 | This one represents the dot product of the first query with the third key

01:53:10.780 | And we do all the possible dot products as you can see here

01:53:15.500 | now you

01:53:17.740 | Are and what does this matrix represent? This represents somehow the relationship between two tokens

01:53:24.380 | So the bigger the dot product the more intense is the relationship between two tokens

01:53:29.340 | Actually, it's then defined later. We will see that we apply the softmax

01:53:33.120 | But you can think of the dot product as being how the self-attention mechanism is relating to tokens

01:53:39.420 | How intense is the relationship of these two tokens?

01:53:42.140 | Why do we have this square root of the model as the denominator because

01:53:49.340 | We want to scale this dot product based on because usually when you train a model you train multiple variants of it, for example

01:53:56.300 | and

01:53:58.860 | We when we and suppose some for example, imagine you want to try you train multiple variants and you have this

01:54:05.820 | You try multiple number of heads

01:54:08.940 | You don't want the magnitude of these numbers to change between one try and the next one

01:54:14.220 | So basically by dividing by the square root of the model you keep the magnitude constant

01:54:19.440 | um

01:54:21.440 | Now what are what is this matrix doing so this matrix tells us how two tokens are related to each other

01:54:29.440 | Now in language modeling we also apply what is known as the attention mask

01:54:35.760 | So we don't want the word I to be related to future tokens

01:54:40.000 | So usually we don't want to compute this dot product

01:54:42.560 | We don't want to compute this dot product and we don't want to compute this dot product because we don't want the token I

01:54:47.520 | To be related to all any other token because there is no previous tokens

01:54:51.440 | We also don't want the word love to be related to the word the pepperoni and the pizza

01:54:56.560 | Because they come after it, but we want of course the word pepperoni to be related to the word love. So this this

01:55:02.080 | There should be a number here. So we don't want to mask out this one

01:55:05.680 | This is called a attention mask

01:55:08.240 | And how do we apply that basically?

01:55:10.960 | If we don't want some interaction between token to happen

01:55:15.040 | we can

01:55:17.120 | Calculate the matrix as usual

01:55:18.640 | So query multiplied by the transpose of the keys and then we replace all the numbers all the relationships that we don't want

01:55:24.640 | With minus infinity. So here we can replace this number here with minus infinity

01:55:29.620 | Here we can replace this number with minus infinity and then we can replace this number with

01:55:36.160 | Minus infinity

01:55:41.200 | So that after we need to apply the softmax the softmax will convert each of these numbers into a

01:55:49.360 | probability score

01:55:51.840 | because we want the relationship of one token with other tokens to be

01:55:56.880 | Between zero and one and also we want each row to sum to one

01:56:01.760 | Later, we will see why because actually the when we do the contextualization we are doing a weighted sum, but okay

01:56:08.880 | Let's forget it about now

01:56:11.040 | Anyway, the point is we apply the softmax row by row. So if we don't want the relationship of two tokens to

01:56:17.360 | be

01:56:19.440 | Considered by the attention mechanism. We replace that particular dot product with minus infinity before we apply the softmax

01:56:26.560 | Because the softmax we saw before is an exponential

01:56:29.840 | It's e to the power of x when e is to the power of minus infinity

01:56:34.000 | It will become zero. So the output of the softmax will become zero for all the interaction that we didn't want

01:56:40.080 | So that's why we replace it with minus infinity

01:56:42.260 | Now, let me put back whatever we had before

01:56:46.160 | Okay, so this is uh where we apply the mask

01:56:50.800 | So as you can see if we apply the mask before we apply the softmax

01:56:54.080 | It will replace with zero all the interactions that we don't want

01:56:56.960 | And this is um

01:57:00.640 | what is the

01:57:02.560 | This matrix here is known as attention weights

01:57:05.280 | so it tells us how intense is the relationship between two tokens and

01:57:10.000 | This matrix here is calculated independently for each single head because here I show you only one matrix here 4 by 128

01:57:18.900 | But we have eight of them

01:57:22.400 | And each of them is calculated in parallel

01:57:25.840 | So you need to think that you have eight of this matrix if you have eight attention heads

01:57:31.200 | And in this case in the code, you can see that the output is a list of it's a batch

01:57:37.200 | Because maybe we have multiple images

01:57:39.440 | Each of these images is managed by multiple heads

01:57:42.960 | Each of these heads will learn to relate tokens differently

01:57:46.720 | So each of these heads will give us a numPatches by numPatches matrix or sequence by sequence matrix

01:57:52.000 | Where each of this number represents how this head is relating two patches with each other

01:57:59.440 | So now we have seen how to calculate this attention weights

01:58:02.240 | Which basically it's a matrix that tells you how two tokens are related with each other

01:58:06.400 | It's kind of a score of how the attention mechanism thinks two tokens are

01:58:10.320 | Related to each other

01:58:13.840 | We continue our journey

01:58:15.840 | The first thing we do. Okay, we verify the dimension of this matrix

01:58:19.520 | And then we apply the softmax the softmax as we saw before is a way to convert these attention scores into

01:58:28.560 | Numbers that are between 0 and 1 and also such that they sum up to 1

01:58:32.560 | And we do it by soft with the softmax function, which is applied by rows

01:58:37.760 | And that's this dimension. This is a

01:58:41.200 | What is the meaning of this dimension parameter which tells you how you want to apply it?

01:58:45.600 | So we are applying it to the last

01:58:47.760 | Dimension you can think of this as the row dimension. This is the column

01:58:52.960 | So if you apply it on entire all the columns, it means you are applying it by rows

01:58:57.920 | then we have the dropout but as I said before we don't use the dropout because

01:59:04.480 | I didn't see it in the parameters of the polygamma ever being used. So we have it, but we don't use it

01:59:12.000 | And as you remember the dropout basically takes random

01:59:15.920 | With the probability p it will set some activations to zero

01:59:20.240 | So some numbers of this input matrix to zero, but we don't use it

01:59:23.680 | And it only happens during training and it's a way to reduce overfitting

01:59:28.180 | But as it's not used

01:59:30.180 | The next thing that we do in the multi-head attention is we are multiplying this attention weights matrix with the v sequence

01:59:37.940 | the value sequence

01:59:39.940 | So we multiply this matmul means matrix multiplication

01:59:43.000 | We are multiplying this attention weights with the value states, which is the value sequence

01:59:48.500 | which is a transformation of the input sequence through this wv matrix and also by grouped by

01:59:55.860 | Heads, let's visualize this operation

01:59:57.960 | so

02:00:00.260 | Let's go here

02:00:01.860 | so the output of the attention mechanism of the query multiplied by the keys is this matrix here where each number represents the

02:00:09.540 | How two tokens are related to each other by applying the softmax this number become between zero and one in each row

02:00:16.020 | And also in such a way that they sum up to one

02:00:18.740 | So here you can see it's 1.0 because there is only one number here. It's 0.4 and 0.6

02:00:25.140 | So they sum up to one and here is 0.2, 0.4, 0.4. So they sum up to one etc, etc

02:00:31.140 | Now when I say that these numbers represent the intensity of how the attention mechanism relates to token is because now when we multiply

02:00:39.720 | This matrix here, which is in the code is written as attention weights

02:00:45.220 | We multiply it by the v matrix. So the v sequence for the value sequence

02:00:52.500 | We are computing a weighted sum. Why?

02:00:55.460 | When we do this matrix multiplication

02:00:57.780 | We are multiplying for example a 4 by 4 matrix by a 4 by 128 matrix

02:01:03.860 | Where each of this v matrix is one for each attention head just like each of this matrix here

02:01:10.340 | Attention weights is one for each attention head. So each of these attention heads will be doing this

02:01:15.060 | Product in parallel. So each attention heads does query multiplied by the transpose of the keys in parallel the softmax in parallel

02:01:23.300 | and this

02:01:25.520 | multiplication with the v matrix in parallel

02:01:28.020 | I mean not these operations in parallel. It's the attention heads that work in parallel. The operations are sequential, of course

02:01:36.260 | now

02:01:39.140 | What is the output of this

02:01:41.780 | Product it's a 4 by 4 multiplied by 4 128. So the output is a 4 by 128 because the inner dimensions cancel out and

02:01:49.140 | the outer dimensions remain

02:01:51.140 | Let's analyze this output matrix here

02:01:54.500 | So it will be a matrix with four tokens each token represented by not the full dimensions

02:02:01.140 | But because we are working with multi-head attention each head will have a smaller part of the embedding of each token

02:02:07.460 | So it will have 128 dimensions in case we have eight heads and the embedding dimension is 1024

02:02:13.240 | this first number here will be the

02:02:16.900 | Will be the dot product of the first row of this matrix with the first column of this matrix

02:02:24.500 | And as we can see from this row here

02:02:27.780 | All the values are zero except the first one

02:02:32.180 | which means that only this token here will contribute to the output here, which means that this and

02:02:38.660 | The second number in this matrix here

02:02:41.220 | So this stuff here will be the dot product of the first row of this matrix with the second column of this matrix

02:02:48.500 | But most of the values here are zero except the first one

02:02:52.420 | Which means that only this token here will contribute to this second number here

02:02:57.140 | So all the dimensions in this row will be contributed only by the first token

02:03:02.100 | multiplied each

02:03:05.120 | The dimension of the first token multiplied by the number one

02:03:09.620 | Because all the other tokens will be multiplied by zero zero and zero

02:03:14.900 | Let's look at the second row of this matrix here this one here the first number

02:03:20.900 | So the first dimension of the second row of the output

02:03:24.660 | Matrix will be the dot product of the second row of this matrix with the first column

02:03:30.020 | now

02:03:31.860 | The first two numbers are non-zero and the second two numbers are zero

02:03:35.700 | Which means that only the dimensions of the first two tokens will contribute to this output embedding

02:03:41.400 | For each of these dimensions

02:03:43.460 | So for all the dimensions here will only be contributed by the first two tokens because all the other tokens

02:03:48.980 | Whatever there is now the number is here

02:03:52.100 | They will be multiplied by zeros. So they will not contribute to this output embedding

02:03:56.660 | That's why we can say that this is a contextualized embedding

02:04:00.200 | In which the contribution to this contextualization only comes from the first two tokens

02:04:06.740 | How are they these two tokens contributing? Well each of these numbers in the second

02:04:12.580 | Token will be multiplied by 0.4 and each of the number in the first token will be multiplied by 0.6

02:04:20.820 | This you can see it as the first token contributing

02:04:23.880 | 60 percent of the information to this contextualization and the second token contributing 0.4 to this

02:04:30.900 | 40 percent to this contextualized embedding

02:04:33.880 | And you can do the same for the third output

02:04:37.300 | So this output here the first number will be the dot product of this third row

02:04:42.740 | Multiplied by this first column and as you can see here, we have a zero because of the causal mask

02:04:50.100 | Which means that only the first three tokens will contribute to the third embedding here

02:04:55.220 | How much each token will contribute? Well, it depends on how are these numbers distributed?

02:05:00.440 | The first token will contribute 20 percent. The second token will contribute 40 percent and the third token will contribute also 40 percent

02:05:08.740 | So that's why when we talk about the attention width matrix, we talk about how to

02:05:13.460 | the matrix

02:05:15.540 | the attention mechanism

02:05:18.500 | Is telling us how intense is the relationship between two tokens so that each token will contribute that token will contribute more to the output

02:05:26.020 | embedding

02:05:28.260 | So if the the word let's say

02:05:30.500 | pizza and I are

02:05:33.120 | Very related to each other when then the embedding of the word I will contribute most to the output of embedding of this fourth

02:05:41.220 | contextualized position

02:05:44.660 | So it means that then the fourth was contextualized position will be 40 percent based on the information

02:05:50.660 | contained in the token I and 20 percent of the information contained in the word the love and

02:05:55.860 | 30 percent in the

02:05:58.180 | In the word the pepperoni, etc, etc, etc

02:06:00.980 | So this is why it's known as a weighted sum because you are

02:06:05.460 | Summing the contribution of each token if it's not masked out

02:06:11.540 | Weighted with the attention score

02:06:13.620 | associated by

02:06:16.100 | Calculated using the attention weights matrix here and we do this for each of this head in parallel

02:06:22.200 | So each head is watching a part of the embedding of each token and it's learning to relate them differently and then doing this weighted

02:06:30.100 | sum differently

02:06:32.180 | And each head will contribute

02:06:35.060 | Will output

02:06:36.500 | a list of contextualized embedding but each of this contextualized embedding will not be a full token

02:06:43.060 | It will be part of what is the full token and now we'll

02:06:46.180 | We see how we can merge the result of this multi-head attention

02:06:50.500 | And for that we need to look at the original paper. So if you look at the original paper

02:06:54.660 | We calculated this multi-head attention in parallel. And how can we merge the result of this multi-head attention?

02:07:02.340 | Well, we we we we go here and we basically concat these heads

02:07:06.820 | So we take the output of the first head we concat it with the next we concat with the third head with the fourth

02:07:13.460 | The fifth etc, etc

02:07:15.300 | All the heads so until we get the full dimension of the original token back because each head is made up of 100

02:07:22.340 | In case suppose 128 dimensions, so this will be the first 128 dimension then the next 100 and the third 100 etc

02:07:30.100 | Until the last 120 dimensions, so we get back the 1024 dimensions back

02:07:34.820 | And we do this stuff. Let's go back here

02:07:39.460 | here, so

02:07:42.180 | each head

02:07:44.180 | Will return a contextualized embedding for each position, but it's a contextualized

02:07:50.040 | Embedding

02:07:53.940 | That does not include all the original token contextualized but a part of it because each head is working

02:07:59.940 | In parallel with a part of the embedding of each token, then we concatenate them. So

02:08:05.140 | What we do is we basically we want to arrive to this stuff here. So we have a contextualized embedding

02:08:12.360 | Here one for each of the heads

02:08:15.300 | Okay, first we need to do I believe a transposition so we need to transpose back because before

02:08:23.300 | We transpose right? So we

02:08:25.300 | We put the head dimension first and then the sequence dimension

02:08:29.460 | So now we need again the sequence dimension and then the head dimension after

02:08:33.380 | so that each

02:08:35.860 | We go from this configuration

02:08:37.860 | Which is for each head. We have a contextualized list of tokens

02:08:43.220 | We want to get a list of tokens in which each

02:08:48.500 | Head is contributing its 128 dimensions, which are contextualized

02:08:53.000 | Embeddings, smaller embeddings, let's say

02:08:56.800 | So let's do this transposition also in code

02:08:59.700 | I believe it's here. So I think there is another checking of the output dimension

02:09:06.420 | We transpose back

02:09:10.820 | So we do this transposition back. So we did the first transposition here to exchange the

02:09:16.660 | Number of heads with the sequence dimension. Now we transpose back

02:09:19.780 | So we go back to the num_patches and num_heads

02:09:24.100 | So it's a sequence each sequence is made up of smaller

02:09:28.180 | Eight groups or num_heads group and each

02:09:31.140 | Head is made up of head dimension dimensions

02:09:35.000 | We do this contiguous because we want to reshape. Okay, it doesn't matter

02:09:40.260 | You don't have to know why we do this contiguous, but basically

02:09:45.920 | Contiguous means that we want the

02:09:47.920 | The tensor to represent the information in the memory in a contiguous way so that the next operation that we are going to do

02:09:55.600 | the reshape is basically

02:09:57.920 | Does not require any computation because when you do a reshaping or a viewing of a transfer of a tensor

02:10:04.240 | There is no change in the memory layout of the tensor

02:10:08.880 | Actually, the PyTorch will just change what is known as the stride of the tensor

02:10:14.480 | So if you go to a tensor

02:10:16.480 | We are going a little off

02:10:18.960 | off topic, but

02:10:20.960 | There is this thing called the stride which tells you how

02:10:24.080 | To go from one dimension to the next without changing the layout of how this tensor is allocated in the memory

02:10:30.560 | So when you do a view

02:10:32.480 | or a reshape

02:10:34.480 | The PyTorch will just change these numbers on the stride. Okay

02:10:38.640 | I will do another video on how this works

02:10:41.600 | But anyway, but this contiguous allow us to have this tensor all in the memory as a contiguous memory allocation

02:10:48.160 | So that this reshape operation can be done without

02:10:50.720 | Without

02:10:54.460 | computational overhead

02:10:56.320 | Now let's get back on track. So

02:10:58.320 | We did a reshape operation in the slides

02:11:02.240 | So after we have to do a reshape, we did the transpose operation and now we need to do a reshape operation

02:11:07.520 | So the transpose operation basically allow us to get again at the first dimension the sequence dimension

02:11:13.840 | Then the grouping of the group of

02:11:16.640 | dimensions of each token

02:11:19.260 | And each group contains 128 dimensions. Now, we need to concatenate them. How can we concatenate them?

02:11:25.680 | Well, we just want to merge these heads again together into one single token

02:11:30.480 | And we do that with this

02:11:33.660 | Reshape operation. So with reshape basically, we are going from numHeadsHeadDim to EmbedDim, which is

02:11:40.460 | In this case, it's 124

02:11:44.060 | So, how does it work the reshape basically the

02:11:47.900 | The PyTorch

02:11:51.580 | will take each of these

02:11:53.580 | Groups and will just merge them. So it will just concatenate them with each other. So instead of being a

02:12:01.020 | matrix that contains sub-arrays where each sub-array contains multiple sub-arrays and each of these

02:12:08.140 | sub-sub-array contains 128 dimensions, it will just become a matrix that contains one array that is made up

02:12:15.900 | 1024 dimensions, which is the concatenation of all these heads

02:12:20.940 | So this is how we merge the information of all this multi-head attention that was done in parallel into one single

02:12:30.780 | Token that is a contextualized version of the initial token

02:12:34.460 | So we as you can see we got back the initial shape

02:12:38.460 | So we started with before at the beginning of the multi-head attention. We started with

02:12:42.780 | 4 by 1024

02:12:45.660 | Input sequence and we end up with 4 by 1024

02:12:49.840 | Sequence

02:12:53.100 | There is one last part that we need to do

02:12:55.420 | that is

02:12:57.960 | Multiplication with this WO. So if you look at this concatenation that we have done

02:13:03.000 | The concatenation basically takes the this tensor this first token here

02:13:09.160 | Is just the concatenation of the first 128 dimensions, which are the output of the first head then the second 128 dimension

02:13:16.760 | Then the third 128 dimension and then the last 128 dimension. In total there are 1024 dimensions

02:13:23.880 | But there has been no mixing between the result of these heads. So it's just a concatenation of multiple

02:13:31.080 | of independent calculations

02:13:33.800 | Each calculation done by one head independently from the others

02:13:37.720 | But we want the token to not be a concatenation of independent calculations

02:13:43.720 | We also want to kind of mix the result of these heads with each other

02:13:48.600 | And the mixing happens when you do this multiplication by WO. The WO matrix is a matrix that is

02:13:54.920 | embedding size by embedding size

02:13:57.720 | Which basically

02:14:00.680 | As you can see does not change the shape of the input. So we have

02:14:03.400 | The input of this WO will be a 4 by 1024. We multiply by 1024 by 1024. So it results the same input shape

02:14:12.360 | But it will be

02:14:15.480 | because

02:14:16.520 | Let's look at this number here. This number here is the dot product of the first row

02:14:21.320 | So the first token with the first column of this matrix

02:14:24.840 | And the first column of this matrix is 1024 parameters. So all of these heads, so the

02:14:31.880 | 128 dimensions of the first head

02:14:34.500 | 128 dimensions of the second head, etc, etc

02:14:38.520 | Will all participate in the same dot product giving up one single number here

02:14:43.880 | So there have been a mixing of the results of this head. If we don't multiply with the WO there is not

02:14:49.800 | There is no mixing between the result of each head which happened independently in parallel

02:14:55.160 | And that's why we multiply it by WO

02:14:57.400 | So we don't want each token to be a contextualized version of multiple subtokens each calculated independently from each other by the multi-head attention

02:15:05.400 | We want of course it to happen because we want to parallelize

02:15:08.760 | But then we want to mix the result of this multi-head attention and we do that by multiplying by WO

02:15:14.620 | and now let's do it so

02:15:16.620 | For now, we just merge. So this reshape is basically doing the concat that we saw before in the attention paper

02:15:23.020 | Now we do the multiplication with the WO which is this stuff here. So out projection

02:15:28.540 | It won't change the shape of the tensor that is input to it

02:15:32.700 | And then we return it along with the attention weights. Actually, we will not be using the attention weights

02:15:37.020 | And now finally we have implemented the multi-head attention

02:15:41.340 | I just realized we forget something guys. So

02:15:43.900 | We forgot to implement this encoder. So we created the layer of the encoder, but we didn't create the encoder itself

02:15:51.660 | So what we created basically in this vision transformer is this stuff here. So let me open the slides

02:15:58.160 | We created one single layer like this one

02:16:02.700 | But we didn't create the sequence of these layers because an encoder is a sequence of these layers. So let's do it

02:16:08.620 | It's it's very simple. So this is a single layer

02:16:11.900 | But we need to create a sequence of them because we apply one after another such that the output of one is

02:16:17.180 | Used as input for the next one. It's a very simple class. So let's create it

02:16:21.340 | Let's create the

02:16:25.340 | Constructor so it's just very simple. It's a okay

02:16:28.620 | We save the configuration then each we create a sequence of layers where each layer is this encoder layer to which we pass the configuration

02:16:37.260 | How many we create based on how many layers it should have so the transformer layers

02:16:41.820 | And the forward is very simple. I can just copy it all. It's basically says, okay

02:16:48.780 | We have the input we give the input to the first layer and the output of this layer becomes the input to the next one

02:16:55.820 | So we do a for loop and then we return the the output of the last layer

02:17:00.380 | This is a very simple and as you can see between each layer, there is no change in the shape of the tensor that is fed

02:17:07.020 | I believe I think we have

02:17:10.380 | Coded all of the cglip. So which is our vision transformer

02:17:14.560 | You may think that I have lied to you by saying that at the beginning when we were talking about contrastive learning you

02:17:21.820 | Okay, actually, let's look at it. Otherwise, we will have the doubt so

02:17:27.580 | When we were talking about contrastive learning

02:17:29.580 | We were talking about generating one single embedding for each image

02:17:35.180 | But here we are generating a sequence of contextualized embedding

02:17:38.880 | So how can the image generate one single embedding?

02:17:43.680 | For a single image so in the transformer

02:17:47.600 | Is a sequence to sequence model

02:17:49.740 | So you give it a list of patches as input and it will give you a sequence of contextualized patches as output

02:17:56.540 | When working with something like clip, for example, if you want only one single embedding for each image

02:18:02.940 | You can just take the first output contextualized embedding from the transformer as a representative for the whole image

02:18:09.820 | Because it will force the model to put all the information in the first contextualized embedding

02:18:16.060 | So that's one way to do it

02:18:17.820 | Another way is to just take the average of all the output embeddings by the transformer to generate one single embedding

02:18:24.540 | Anyway, this was just a closing note before we move to the next part, which is our language model

02:18:30.620 | So let's go back to the architecture, which is here

02:18:33.180 | So we have coded this part here the vision encoder so we feed an image it will be

02:18:40.780 | The vision encoder extracts some patches each of these patches become an embedding to this embedding

02:18:47.100 | We add a positional encoding which is learned

02:18:49.740 | We send it to this magic box called the transformer layer, which will contextualize them

02:18:54.540 | We take the output of this contextualization and this becomes our

02:18:58.380 | Image embeddings

02:19:02.140 | Now before we can feed it to the language models

02:19:04.860 | These embeddings may not be of the same size of the embeddings used by the text layer

02:19:10.300 | So we will need to introduce this linear projection

02:19:14.460 | So in the next part of the video, we are going to code the language model including this linear projection here

02:19:20.540 | And we will learn how to merge these tokens the image tokens and the text tokens

02:19:25.740 | Okay, let's start

02:19:28.540 | So the next part that we are going to code is basically how to load the image from the disk to convert it into a tensor

02:19:35.100 | And also how to tokenize the text

02:19:37.500 | And we need we will see that we need to do the preparation of the text has to be done in a particular way

02:19:43.660 | Let's see actually why we have it has to be done in a particular way. So let's open the slides

02:19:48.220 | Oops, I think I closed it. So let me open it again

02:19:51.660 | All right

02:19:54.060 | So as you can see, we need to find a way to combine the image tokens with the text tokens

02:20:00.060 | So first we need to tokenize the text

02:20:02.220 | But we need to create some placeholders for where we will put the image

02:20:09.260 | Tokens before the text token. So I will use the term image tokens and image embeddings interchangeably

02:20:15.760 | because you can think of the image embeddings as kind of tokens that represents the image or and the

02:20:21.900 | Text are the embeddings that represent the text that is the prompt from the user

02:20:26.700 | so the first thing that we need to do is we need to learn how to load this image into a

02:20:32.300 | tensor because then as you can see from our cglib code the input to the cglib is

02:20:39.020 | A tensor that is has the channel the height and the width dimension

02:20:42.720 | which is then

02:20:44.780 | transformed into patches and contextualized, etc, etc

02:20:47.420 | Then we need to tokenize the text. We need to create this list here

02:20:52.540 | But we we will create first a list of tokens

02:20:56.140 | Each corresponding to the text tokens and then we will add some placeholders for where we will put the image tokens

02:21:03.500 | and then it will be the transformer that will

02:21:08.300 | Take these placeholders and replace it with the image. So

02:21:11.180 | I know it's a lot of things to remember. So don't worry. Let's code it and we will see it step by step. So let's go

02:21:18.460 | We create a new file called, let me check here processing

02:21:24.560 | processing.py

02:21:27.800 | We do some imports

02:21:32.540 | Okay

02:21:35.820 | We create these two constants and later we will see why we need them

02:21:40.140 | For now, just create them

02:21:43.340 | Okay, let's start from the beginning. So let's create this class called the polygamma processor

02:21:48.960 | This stuff here

02:21:53.020 | It has a constructor

02:21:55.020 | Which is this stuff here

02:21:57.020 | It will take as input the tokenizer how many image tokens?

02:22:02.460 | We need to generate for the image and what is the image size that this particular gamma will work with

02:22:08.780 | We save it. We save these two values and then what we do

02:22:13.660 | We need to add some special tokens to our tokenizer. So now I show you why we need to do it and how it works

02:22:21.100 | So the tokenizer that polygamma is using is the tokenizer of the gamma model

02:22:26.940 | But the tokenizer of the gamma model was not created

02:22:30.320 | With the special tokens for the image. So what they did was they basically created these

02:22:36.940 | additional tokens called the because

02:22:40.620 | Polygamma can be used for multiple purposes

02:22:43.640 | So what we saw here in my slide is basically here is trying to extract information from an image

02:22:50.140 | So we have an image we have a prompt and the polygamma so which is basically the gamma model here

02:22:57.100 | will decode the response by

02:22:59.500 | interpreting the

02:23:01.420 | Prompt and using this one as additional information for the prompt

02:23:05.180 | the image

02:23:07.180 | Polygamma actually can do much more than this a polygamma can also do image segmentation so it can

02:23:14.220 | Segment the part of the image that for example for this leg

02:23:17.740 | It can do object detection

02:23:19.980 | So it can detect all the instances for of of tree for example

02:23:24.220 | If we do object detection for trees, it will probably give us this this okay

02:23:29.020 | This is not a bounding box this box here telling that this is a tree

02:23:32.380 | If we do it ask it to detect all the feeds it will give us two

02:23:36.380 | Boxes one for this one one for this one, etc

02:23:39.580 | So polygamma can do a lot of this and the way it does it by using special

02:23:44.380 | tokens

02:23:46.380 | For the segmentation they are called the segmentation tokens and for object detection. They are called local location tokens

02:23:53.580 | And but we will not be using them. So our goal here is just to inference polygamma

02:23:59.340 | So we will not be working with the object detection or object segmentation

02:24:03.120 | But if you want more information on how these tokens work, there is a very nice article

02:24:08.940 | Not only this one from google. So here in google they say

02:24:12.300 | That polygamma uses the gamma tokenizer, but they extend it with these further tokens that are used to tell

02:24:19.580 | In the output of the model, where is the segments?

02:24:23.420 | where is the bounding box position that it has detected or where is the

02:24:26.860 | location of the

02:24:29.580 | Of the segmentation mask that the model has detected

02:24:33.980 | Another article that I recommend is the hugging face blog article about

02:24:37.980 | Polygamma, let me find it. I believe it is this one here

02:24:43.180 | In which they describe how this attention masks work

02:24:47.100 | So as you can see polygamma can detect the cat and will give us this output which is a lock

02:24:52.460 | Tokens, as you can see lock 0094, 00256

02:24:57.100 | Which this number 0094, 0256 tell us the position of the top left

02:25:03.820 | Top right bottom left and bottom right corner of this bounding box here

02:25:09.180 | But we will not be using

02:25:11.900 | Here because we are only interested in using the polygamma as a conditional model for generating an output

02:25:18.060 | Conditioned on the image that we feed it in

02:25:22.540 | But anyway because the tokenizer

02:25:24.540 | Used by polygamma is adding these special tokens

02:25:27.740 | We also add them here and how to add them and how many to add them is described in this article

02:25:32.460 | You can see here. And so basically we have

02:25:34.700 | 1024 location tokens for image detection and then 128 tokens for object segmentation

02:25:42.720 | Okay, we save the tokenizer

02:25:47.100 | Then what do we need to do?

02:25:48.860 | We have we also need to create this constant called image token

02:25:52.380 | what is this constant basically when we

02:25:55.980 | when we

02:25:57.880 | process our text with the gamma tokenizer the gamma tokenizer will only generate of course the

02:26:04.220 | The tokens for the text but later we need to also insert in these tokens the image tokens

02:26:11.820 | So what we need to do what we do basically is we insert some placeholder tokens

02:26:17.260 | That will then be replaced by the embeddings

02:26:19.760 | Extracted by the visual encoder and this placeholder tokens that we will be using is this image token here

02:26:26.300 | And we add it also

02:26:28.860 | And we

02:26:32.300 | We add it here in the tokenizer

02:26:34.300 | Now how to use this polygamma processor. So the polygamma processor is a special class that given an

02:26:40.780 | Text which is the prompt of the user and an image will load the image

02:26:46.840 | Reprocess it so resize it rescale it. Whatever the vision model needs to see

02:26:51.800 | And we'll create this

02:26:54.520 | Text tokens with the placeholder for the image tokens. So let's do it

02:26:59.000 | We create this

02:27:02.120 | Method here the call why we create the call method. Well, basically this allows the

02:27:06.920 | the instance of the processor

02:27:10.040 | To to be called like a function

02:27:12.840 | So when you create the processor you will we will create it like this like polygamma processor and then we can use it like this

02:27:19.720 | Passing the arguments here. So this is why we implement the call method

02:27:24.040 | And the call method takes as input a list of text and the list of images

02:27:28.600 | but we will actually only accept one text and one images because I don't want to deal with the

02:27:34.200 | Padding otherwise, it will complicate our code. Our goal is not to make it universally

02:27:39.420 | Perfect. Our goal is to learn by doing and how it works. Actually, this is this code will be usable

02:27:45.160 | So we will actually run the inference later, but it will only work with one image and one prompt at a time

02:27:50.200 | It doesn't matter because later we can later

02:27:52.760 | I will try to make the code for fine-tuning this model

02:27:55.400 | And we will see that we will change this code a little bit to to accommodate for the padding

02:28:02.680 | Anyway, we need to process these images and we will use a special method called process images

02:28:09.400 | So if we take each of these images and we need to resize it

02:28:12.920 | We resize it to the image size that is accepted by this polygamma version. So

02:28:18.120 | As you can see the weights of polygamma

02:28:20.680 | Actually show there is multiple weights, but this is two to four only resizes the images to the size

02:28:28.100 | 124 by 224 and generates 128 tokens for this in each image

02:28:33.780 | then we rescale this image and later we will see why we do it and then we

02:28:38.100 | We normalize it using the mean and the standard deviation of ImageNet

02:28:43.540 | It's not really the ImageNet mean and standard deviation, but later we will see how it works

02:28:47.620 | Anyway, suppose that this method here will load the image will rescale it will normalize it etc and convert it into

02:28:57.460 | A tensor that can be then processed by the vision model

02:29:00.340 | We do it here so

02:29:04.980 | We create here a tensor. So because this will

02:29:08.020 | Return a list of tensor. We need to create a one single tensor with the batch size

02:29:13.540 | So we stack them stack basically means that if we have a list of tensor, it will create one single big tensor

02:29:18.980 | Where it adds one

02:29:21.620 | another dimension called the batch size one

02:29:25.220 | So instead of becoming a list of tensor it will become one big tensor

02:29:28.500 | This is a NumPy tensor it is converted into a torch tensor

02:29:33.860 | And then we

02:29:37.620 | Create the input to the model. So later we will expand this method. So now I just create them

02:29:44.020 | What is this method going to do? Well, this method is going to

02:29:49.780 | Let's check here. It's going to create the tokens of the text and create the placeholder for the image tokens

02:29:56.280 | and then

02:29:59.620 | We tokenize it using the placeholder tokens for the image

02:30:04.340 | And then we return it. So now let's expand

02:30:08.100 | This stuff I know that I have copied a lot of code. Now, I will explain it one by one

02:30:13.460 | So let's start at input. We have a list of text and the list of images. Let's process these images

02:30:19.300 | So let's create this process image function

02:30:21.620 | What is it going to do?

02:30:24.260 | Let's copy it. It's very simple actually

02:30:26.980 | Okay, the process image takes as input a list of images what is the size that we want of these images

02:30:35.700 | What is the kind of resampling that we want to do when resizing this image? You can do linear, you can be cubic, etc

02:30:44.160 | Rescale factor if we want to rescale this image and

02:30:47.360 | the normalization mean and the standard

02:30:50.580 | And this has the same meaning as the normalization that we do in the neural networks. So we want the

02:30:57.280 | The image no matter what it represents to always have the same distribution more or less

02:31:03.120 | So centered on zero and the variance of one

02:31:05.680 | And the way we do it is basically we take the image

02:31:10.000 | Values so the tensor we subtract the mean of all the images that we have in our data set

02:31:16.240 | And usually we use the mean of the image net data set and the standard deviation of the image

02:31:21.840 | Net data set

02:31:24.480 | I don't know why in the hugging phase they use 0.5 because it's actually not really 0.5

02:31:28.880 | It's very close to 0.5 each of these numbers, but it's not really so maybe it works anyway

02:31:35.920 | And we have one for each channel of the image. So one for r one for g and one for p

02:31:40.560 | So what is this

02:31:44.000 | Function going to do first it resizes the image by using this resampling method

02:31:47.760 | Then it will convert the image into a numpy array

02:31:50.400 | Then it will rescale it so that the pixel values instead of being between 0 and 255 will be between 0 and 1

02:31:57.440 | Then it will normalize using the mean and the standard deviation of image net

02:32:01.280 | And then it will move the channel dimension to be the first dimension. So

02:32:06.320 | Instead of being a height width channel, it will become channel height width

02:32:11.120 | Let's implement this very simple method. So there is first the resize

02:32:16.980 | The resize is just going to resize the image using the

02:32:21.600 | methods already implemented by

02:32:24.720 | The pill library. So this

02:32:27.120 | This one called the python imaging library

02:32:31.520 | So it will take the image and it will resize it using this resampling method

02:32:36.160 | Then we have this

02:32:39.920 | rescale

02:32:41.360 | The rescale is just going to rescale the image

02:32:43.680 | So it will convert each pixel value instead of being between 0 and 255. It will rescale it into

02:32:49.920 | Between 0 and 1. Why? Because as you can see here, we pass a scale factor of 1 over

02:32:56.060 | 255. So that's why we are multiplying it by this scale

02:33:01.420 | The next thing that we are doing is normalizing

02:33:04.800 | normalizing means that we want the each of these values to be

02:33:09.340 | distributed like it's coming from a Gaussian of mean 0 and variance of 1 and we do it by

02:33:14.380 | Subtracting the mean and dividing by the standard deviation as you can see here

02:33:22.140 | I believe we have already implemented everything for the process images

02:33:25.980 | Now, let's go further. So we have these images we are processing them. So they are still a list of images

02:33:32.700 | We convert them into they are converted into a list of numpy arrays and we do that here

02:33:38.780 | As you can see first we convert them into numpy arrays then we rescale, normalize, transpose

02:33:44.160 | So we have a list of numpy arrays

02:33:46.860 | This list of numpy arrays is converted into a single tensor instead of being a list of tensor is becoming one big tensor

02:33:53.260 | And then we convert it into a torch tensor. This torch tensor

02:33:57.900 | Is the pixel values that will be fed to the model to the image encoder

02:34:03.100 | Now we need to take our text

02:34:06.060 | And we need to tokenize it but we need to tokenize it by already accommodating for the position in which we will put the image

02:34:13.500 | embeddings

02:34:15.980 | And we do that by processing this each of this text through this function called add image tokens to prompt which as the name implies

02:34:23.260 | We'll add this image token placeholders to the prompt

02:34:26.140 | And the way it's done is here

02:34:29.020 | It's very simple actually also

02:34:31.740 | We can

02:34:34.300 | Save it here. It's a long comment because I found a little bug in this one, but okay later I explain to you

02:34:40.300 | But basically we add some image token placeholders. How many of them? Well, depending on how many image

02:34:46.540 | Tokens this model needs in the case of polygama 224. We need 128

02:34:52.320 | tokens, I believe

02:34:55.420 | Oh, no, this is not this is the text tokens, I think it's 256 I remember correctly

02:35:01.120 | Later we can check. I think it's in the config.json. Let's go here

02:35:09.080 | 256

02:35:11.080 | Image tokens

02:35:14.460 | Then we add the beginning of sentence token and then we add the prompt of the user. It's called the prefix prompt

02:35:20.700 | How did I come up with this function I didn't come up with it I copied from

02:35:26.860 | Hugging face implementation, but how did hugging face come up with this actually?

02:35:31.340 | It's from the paper of polygama

02:35:34.300 | So if we go to the polygama paper, let's go here

02:35:39.100 | here

02:35:40.780 | Here they show you how to prepare the input for the gamma model

02:35:44.060 | So we have a list of image tokens

02:35:47.740 | Then we have the prompt of the user that tells us what the language model needs to do with these images

02:35:54.380 | So if as you saw the example before in in the introduction

02:35:57.920 | Here the prefix is this one

02:36:00.380 | So we want the language model to tell us where is the photographer resting by looking at this image and the model will generate this output

02:36:07.500 | So this is called the the prefix

02:36:09.900 | So this is the prefix and the prefix is built by first taking okay

02:36:14.460 | We take the image tokens and we are adding them here and based on how many this model particular size of polygama needs

02:36:21.580 | then we have the beginning of sentence token and this one then we have the tokens of the

02:36:27.260 | Prefix, which is the task that we want the language model to perform

02:36:32.140 | And then we have a separator the separator token is a slash n. So it's the new line

02:36:38.220 | new line character

02:36:41.740 | So we have this beginning of sentence token. So then we have the token the the task

02:36:47.100 | The the prompt by the user based on what task we want the language model to do and then we have the separator token

02:36:54.380 | Which is a slash n now in the paper. They say that they tokenize the

02:37:00.540 | Token separately

02:37:02.540 | so the slash n needs to be tokenized separately from the rest of the

02:37:06.940 | Input because we don't want the slash n to be merged with this with the

02:37:13.260 | With the prompt by the tokenizer, so as you know the tokenizer will convert a sequence of

02:37:19.580 | Characters into tokens and if in the dictionary of the

02:37:27.100 | The language model there is one character suppose that we ask the language model to tell me where is the photographer

02:37:34.000 | And suppose that the in the

02:37:37.900 | and then we have this new line suppose that in the vocabulary of the

02:37:43.260 | Language model there is a token that is like this. So

02:37:47.020 | rougher

02:37:48.940 | And escape and it will become one single one single token

02:37:52.860 | So suppose that this one becomes the token number three and then there is another token that is a space protog

02:37:58.620 | It becomes the token number five and then the token the d is another token. So it's the token number six, etc

02:38:05.180 | So we don't want the escape and to be merged with whatever comes before it

02:38:09.340 | So they in the paper, they recommend to tokenize it separately. So that's why I I wrote this

02:38:14.860 | Comment here to to note that it should be tokenized separately, but I don't know why in hanging phase they do it

02:38:21.580 | Without tokenizing it separately

02:38:23.580 | It could be a bug or it could be some other indication that I am missing

02:38:27.660 | So I just write it now later. I will investigate and probably ping the hanging phase team

02:38:31.900 | But for now, we just need to think how we prepare the input

02:38:35.500 | So the input is prepared like this a number of input image tokens

02:38:39.500 | What is each of this image token? It's this placeholder token that we created here this image token

02:38:46.940 | how many of them depending on the size of the model and we have this beginning of sentence token and then we have the

02:38:52.220 | Prefix the prompt of the user and then we have the slash n. We take all of this and we tokenize it

02:38:59.180 | Using our tokenizer

02:39:02.380 | And we return this stuff here. So we return this input

02:39:05.740 | Which is the input IDs and the attention mask that will be generated by the tokenizer

02:39:11.200 | In this case, we are not using any padding. So the attention mask will be just a list of ones

02:39:16.060 | So what is the input IDs? As you remember tokenizer converts the text into

02:39:21.580 | A list of numbers where each number represents the position in the vocabulary of each token

02:39:27.020 | So these are not embeddings. These are just input IDs

02:39:30.700 | So it's a list of numbers where each number represents the token position in the vocabulary

02:39:35.440 | So imagine our vocabulary is made up of words

02:39:38.940 | So the word hello the sentence hello world may be tokenized as follows

02:39:45.100 | so world

02:39:47.100 | It may be tokenized as a list of two tokens, for example, three tokens

02:39:52.300 | For example, the first one corresponding to the word hello

02:39:54.860 | Then the one corresponding to the space and then one corresponding to the word world

02:39:59.980 | Suppose it's the token number nine. So these are called input IDs. So it's not an embedding

02:40:05.100 | It's just one number for each token

02:40:07.740 | Then by the embedding layer, this will be converted into embeddings, which will be one

02:40:14.540 | Vector for each token. So with the suppose 1024 dimensions

02:40:19.440 | So this one will be for the first token

02:40:23.180 | 1024 dimensions then for the second token another 1024 dimensions, etc, etc, etc

02:40:29.340 | So this is how we prepare the input. So for now, we have resized the image converted into a tensor

02:40:35.740 | Then we have taken our prompt. We have added some placeholder tokens for the image then we have

02:40:41.980 | Added the prompt of the user and then the slash and character as indicated by polygamma

02:40:47.440 | And now our processor will return this stuff. Now, we need to understand what to do with this stuff

02:40:53.500 | So we need to code our language model. All right guys, so let's continue our journey by creating another file here called

02:41:00.620 | modeling_gamma.py

02:41:03.160 | Which will be our language model. So the language model that will decode the answer of the

02:41:09.980 | the answer

02:41:11.740 | Using the prompt or given by the user and the image that we have provided as input

02:41:16.300 | So we create this file. We import a little bit of stuff the usual stuff

02:41:21.740 | So torch, some math, typing and then we import siglib model that we have created before so the visual model and the configuration that it needs

02:41:29.580 | Let's do a bottom-up approach which means that we first create the structure of the model and then we create each single component

02:41:40.400 | So let's do it

02:41:42.400 | Let's do it this one. All right

02:41:46.720 | Our main class will be called the polygamma for conditional generation

02:41:52.720 | So why it's called conditional generation? Because we are conditioning the generation of text on the image that is provided as input

02:41:59.680 | This is why it's called conditional generation

02:42:01.680 | and also actually it's because of

02:42:04.240 | how we create the attention mask that we will see later because we are attending to all the tokens of the

02:42:09.760 | prompt of the user and all the tokens of the image

02:42:14.400 | Without any causality so it's used like a condition, but we will see that later. So

02:42:20.640 | The constructor accepts a configuration file, which we are going to create now

02:42:24.960 | It will create an instance of the vision model. So the encoder of the image it will create this multi-modal projector

02:42:32.000 | Which is a linear layer. Let's actually visualize it all these components

02:42:35.940 | So we go here and then we open this stuff. So basically the multi-modal projector is this

02:42:43.840 | linear layer you can see here linear projection

02:42:47.140 | and

02:42:48.960 | the vision model is this

02:42:50.960 | Contrastive vision encoder and then we have gamma for causal language modeling, which is this our transformer decoder

02:42:58.000 | So this class basically polygamma for conditional generation is actually the class that will

02:43:02.080 | Make make connect all these components together

02:43:05.760 | I don't know why my pen is not working my ipad pen

02:43:09.920 | Oh now it's working. It looks like so

02:43:12.640 | Yeah now it's working. Okay, let's continue

02:43:15.520 | All right, so we have created this it will create an instance of the language model

02:43:20.880 | It will save some stuff like what is the language model? What is the vision tower, which is the

02:43:26.960 | image encoder the multi-modal projector

02:43:28.720 | which is the linear layer that will convert the size of the embedding output by the

02:43:32.720 | Vision encoder into the size of the embedding of each text token so that they can be concatenated with together

02:43:40.560 | We also save the padding token

02:43:43.200 | We need to create another method called tie weights and we will see later what is this about

02:43:51.200 | Or actually we can check now what this is about

02:43:55.280 | so tie weights basically means this so let's go back to our

02:43:59.440 | Here and let's open the attention mechanism. And actually let's open the transformer model

02:44:05.760 | so weight tying is a technique for kind of

02:44:08.800 | reusing the

02:44:11.820 | parameters of one layer into another

02:44:14.080 | And specifically in the case of language model most language models are in decoder only language model

02:44:19.600 | Which means that they are only made up of this part of the transformer without the cross attention

02:44:25.840 | So there is no this block here

02:44:27.840 | So it's they are made up of a self-attention with the normalization then a feed forward with the normalization a lot of layers like this

02:44:35.840 | so one after another then we have a final linear layer that projects the embedding output by these layers into

02:44:42.800 | Logits, and then we have the softmax to understand which of these tokens has the maximum

02:44:48.540 | Probability score given by the language model

02:44:50.540 | now in this

02:44:52.540 | the job of this linear layer is basically to convert the embedding of the

02:44:56.940 | Contextualized embedding output by the last layer of this series of layers

02:45:02.060 | Into the vocabulary size, which is exactly the opposite that this job

02:45:07.500 | Layer is doing so the embedding layer the embedding layer is converting the token ids

02:45:14.140 | So the position of each token in the vocabulary into an embedding while this

02:45:18.300 | Linear layer here is doing exactly the opposite converting an embedding into its position in the vocabulary

02:45:23.440 | so

02:45:25.660 | Many language models not all of them

02:45:27.660 | use a technique called

02:45:30.300 | Weight tying which basically shares the parameters of this layer and this layer because they are doing basically one the inverse job of the other

02:45:37.580 | Which is also a technique actually to reduce the total parameters of the model because if you are sharing these parameters

02:45:43.900 | you will

02:45:45.900 | You will reduce the number of parameters

02:45:47.900 | And in many language models this depending on the vocabulary size

02:45:51.180 | These parameters can be actually quite expensive on the overall total number of parameters of the model

02:45:56.220 | So it could be like 10% of the parameters in this layer here

02:45:59.180 | So if you are sharing them, you are actually reducing the number of parameters

02:46:03.020 | Let's say by 10% because depending on the how many

02:46:05.420 | Tokens you have in your vocabulary

02:46:08.300 | So we created this method here tie weight and later we will implement it also in the language model

02:46:13.420 | So in the gamma decoder language model

02:46:15.740 | That will tie the weights of these two layers

02:46:18.780 | Okay, now that we have seen also this one. Let's go further, which is the implementation of the forward method. So

02:46:26.060 | So we implemented the forward method as follows so it accepts the input ids

02:46:32.940 | What are the input ids? The input ids will be the input ids extracted from this

02:46:39.480 | Polygama processor which will be

02:46:41.480 | the

02:46:42.920 | Some image tokens. So a lot of tokens like this one image image image image

02:46:47.720 | How many depending on the size of polygama we are using?

02:46:50.600 | Then it will contain a beginning of sentence token. Then it will contain the prompt of the user

02:46:56.840 | So for example, tell me where is this photographer and then a new line

02:47:01.560 | Character the token corresponding to the new line character

02:47:07.800 | Yeah, text, okay, so, then we have the pixel values which is the

02:47:12.200 | Again is the image extracted from this polygama processor, which is the image

02:47:18.040 | loaded by this polygama processor

02:47:21.000 | rescaled resized and

02:47:23.300 | Normalized using the mean and the standard deviation of this image net standard mean and standard deviation

02:47:31.020 | It is converted into a pair into a tensor and then provided as is

02:47:37.480 | Then the goal of this polygama for conditional generation will be to take this image and feed it to the image encoder to get extracted

02:47:44.120 | the image tokens

02:47:45.640 | Then we have this attention mask. The attention mask is provided directly by the tokenizer

02:47:49.880 | So whenever you tokenize text using a tokenizer, it gives you two output. One is the input ids and one is the attention mask

02:47:55.880 | Because we will not be using any padding the attention mask will be a series of one

02:48:00.360 | Later, we will see how we also need to modify the attention mask

02:48:05.640 | But actually we will not be modifying because we will not be using any padding so

02:48:09.800 | Yeah, then we have the KB cache, which we will talk about later when we actually use it

02:48:14.920 | So for now just consider it as something that you don't know anything about and later we will discuss

02:48:19.400 | Okay, so let's see that

02:48:24.280 | Okay

02:48:27.880 | We have first we make sure that we are not using any padding because I didn't implement the code to manage the padding

02:48:34.440 | Then we extract the input embeddings of the text tokens and the image placeholder tokens

02:48:40.200 | So in the language model, we have added a fictional token called

02:48:44.840 | Image, so this token here

02:48:47.640 | Which will be converted into an input id so it will be converted into a number which corresponds to its position in the vocabulary

02:48:53.980 | What we are doing is we are converting all the input tokens

02:48:58.520 | Which are the image tokens the beginning of sentence token the tokens of the prompt plus the new line character

02:49:05.240 | into embeddings

02:49:06.920 | of course the embeddings produced by the image placeholder tokens will be

02:49:10.280 | Junk because we will not be using them because they do not correspond to the actual image features

02:49:14.920 | But later we will replace them inside of this one with the correct one

02:49:19.400 | so now we have this input embeddings the first thing we do is we

02:49:24.200 | Extract the features of the image and we do it like this

02:49:27.320 | So we feed the pixel values of the image, which is a tensor directly to the vision tower. So the vision tower is our

02:49:34.280 | Siglip vision model. So it means that we are using the forward method here. So we are feeding the pixel values here

02:49:41.640 | It will extract what it will extract some patches with their contextualized embeddings

02:49:48.440 | So it will for each image it will give us n

02:49:52.340 | Patches and each of these patches is a contextualized patch actually

02:49:56.180 | The second thing we are going to do is we are going to resize this embeddings image embeddings into the same size of the

02:50:05.380 | language model

02:50:07.780 | Embeddings

02:50:10.100 | And for that we do this other line

02:50:12.100 | So we take the image embeddings extracted by the vision encoder and then we resize them using a linear layer called the multi-modal projector

02:50:20.340 | So later we will see this is actually just a linear layer that will convert this embedding

02:50:25.300 | So this embed dimension extracted from the vision encoder into the hidden size

02:50:29.540 | Which is the same embedding size used by the language model for each of this each of its tokens

02:50:34.420 | Now we need to merge the tokens extracted from the vision

02:50:41.300 | Model with the text token extracted from these embeddings which already contain some placeholders for where we should put the image tokens

02:50:50.420 | And for that we will create another method called

02:50:52.980 | Let me first paste it

02:50:55.700 | Called merge input ids with image features in which we pass the image features extracted from the vision encoder the input

02:51:02.740 | Embeddings extracted from the language model with which already contains the placeholders

02:51:07.720 | the input ids which are the original input ids fed to the

02:51:11.620 | The tokens fed to the language model and the attention mask given by it and the KB cache later

02:51:17.540 | We'll see why we need the KB cache

02:51:20.500 | Suppose that these input features have been merged so we will get these input embeddings these input embeddings. What are they?

02:51:27.620 | Well, let's visualize it on the

02:51:29.700 | Oh, wait, where is it? My okay

02:51:33.300 | Uh, oops

02:51:36.980 | So let's go here

02:51:39.460 | Okay

02:51:41.940 | So what we are doing is basically we are creating this stuff here. So we are taking the

02:51:46.660 | First we are taking the image features extracted by the vision encoder and these

02:51:49.860 | Features are here

02:51:52.500 | Then we are resizing them using this multimodal projector, which is this stuff here

02:51:57.300 | Which will resize the each embedding vector to the correct size so that they can be concatenated with the

02:52:03.300 | embeddings of the text tokens

02:52:06.240 | the text tokens

02:52:08.660 | When we tokenize them, they already contain some placeholder tokens, which are those image tokens

02:52:14.500 | We saw before in the processing_polygamma.py file

02:52:17.460 | Our goal is to replace each of them with the features extracted from this vision encoder after it has been resized by the multimodal projector

02:52:26.260 | And for that we will use this method here

02:52:29.060 | So this method takes the image features extracted after

02:52:31.940 | They have been resized the input embedding extracted from the language model which contains the text tokens and the placeholders

02:52:37.940 | And it will replace this stuff here

02:52:40.580 | So suppose that now it everything has been replaced. So we treat it as a black box

02:52:44.580 | What we are going to do we are going to feed all this sequence

02:52:47.300 | Which is a sequence of image features and the text tokens to the language model

02:52:51.620 | which will

02:52:53.540 | Use the prompt of the user which are these tokens and the image fed by the user to generate some text

02:52:59.380 | So let's implement this part here, which is just calling a method

02:53:04.100 | And it's very easy

02:53:09.540 | Because it's just calling a method and later we will implement this language model

02:53:13.620 | So for now, I created the structure of what we are doing

02:53:16.420 | So we extract first we tokenize the text the text already contains placeholders

02:53:21.060 | We replace these placeholders with the features extracted from the vision encoder. We feed everything to the language model. The language model will

02:53:27.060 | Generate some output and we return this output

02:53:30.020 | Now our goal is of course to implement all of these blocks that we have created that we have taken for granted for now

02:53:36.820 | The first thing that we can do is to implement this polygamma config which will give us some understanding of what are

02:53:41.700 | What is the kind of configuration that this polygamma needs?

02:53:44.500 | For that we create it we need to create this polygamma config

02:53:51.800 | Okay, the polygamma config basically takes as input so the polygamma is

02:54:00.340 | So what is gamma? What is polygamma? And what is cglib?

02:54:05.300 | I think you should already have an understanding of it now. So polygamma is all of this stuff here all this stuff here

02:54:11.060 | So it's a combination of a vision encoder and a text decoder language model. So a gamma model

02:54:17.060 | So it's composed of two parts

02:54:18.980 | It's composed of a cglib vision encoder along with a linear layer that will change the embedding size

02:54:24.660 | And it's made up of a language model called gamma language model

02:54:29.540 | So the polygamma needs of course the configuration for this block here

02:54:33.860 | So the language model and the configuration for the vision encoder so that it can create an instance of

02:54:39.300 | This cglib class and of this gamma language model passing their own configuration to them

02:54:45.220 | And this is what you see here

02:54:47.300 | So you have the vision config which is the configuration of the vision encoder the text config which is the configuration of the text

02:54:53.140 | decoder which is gamma

02:54:56.340 | The ignore index is not used. We will not be using it for labels

02:55:00.340 | So if you are training, but we will only doing inference

02:55:02.820 | The image token index is the token corresponding to the placeholder image token. So the

02:55:08.500 | This token here. So let's this this stuff here

02:55:11.780 | The vocabulary size. So what is the vocabulary size of the model?

02:55:16.420 | the projection dimension is how

02:55:19.300 | What is the final dimension that the image features should be resized to before feeding to the language model?

02:55:25.940 | So what is basically the output size of this linear layer?

02:55:30.580 | Then we have the hidden size which is the embedding size of the language model

02:55:35.460 | So the language model has some tokens. These tokens are embeddings and these embeddings have a dimensions. How many dimensions?

02:55:41.780 | 2048 in the base version of gamma

02:55:44.900 | This stuff is something that HuggingFace needs we will not be using it

02:55:50.980 | We save the padding token id if in case it's fast, so we save the vision encoder

02:55:55.060 | We save the text encoder and then we need the configuration of the text language model

02:55:59.220 | Which is the gamma model to which we pass the of course the text configuration and to the vision encoder. We pass the vision configuration

02:56:05.400 | We have how many

02:56:08.100 | number of tokens

02:56:10.100 | For image tokens each image will generate which is basically the size of the image divided by the patch size

02:56:17.140 | So it's actually how many patches you get for each image

02:56:21.300 | Um, which is also corresponds to how many image tokens you get here

02:56:26.500 | Because of course if you divide the image by four you get four patches

02:56:31.700 | If you divide it in smaller parts, you get more patches and each a polygamma size

02:56:36.420 | So polygamma two to four, I think it has 256 tokens. Another one has more etc, etc

02:56:44.100 | Um, the projection dimension is how we want to resize this image tokens, etc

02:56:49.620 | So now let's create also the configuration for the gamma model

02:56:52.660 | which is just the configuration of any language model because it has

02:56:57.060 | A vocabulary size how much tokens we have in our vocabulary the hidden sizes. So what is the size of the embedding?

02:57:04.820 | Embedding vector of each token the intermediate size of the feed-forward layer as we saw before

02:57:12.020 | In Sigleap the number of hidden layers. So how many layers our transformer has in this gamma language model

02:57:18.740 | How many attention heads we have? Okay here we have a difference

02:57:22.340 | This is called the grouped query attention when you have a different number of heads for the query and for the key and values

02:57:28.340 | the number of heads here refers to the number of heads for the

02:57:32.420 | Queries and the number of heads for the key and values is this parameter here. We will see later how it works

02:57:38.180 | The head dimension is how many

02:57:40.180 | Dimensions each head will work with as we saw before we divide this big embedding into smaller groups one dedicated to each head

02:57:47.860 | This is how many dimensions each head will watch

02:57:50.580 | Now this configuration. It's a hard-coded

02:57:53.560 | But actually it will come from the configuration file of the polygamma model that we will load

02:57:59.460 | So if you go to hugging face, you can see

02:58:02.100 | Hugging face, polygamma

02:58:07.540 | You go to two to four you can see here

02:58:09.860 | We will load all this configuration from this config.json file

02:58:13.700 | Which as you can see contains this text config this visual config which contains exactly the information that we need here

02:58:20.500 | This max positional encodings indicates how much the maximum number of positions our model has been trained upon

02:58:28.740 | Which is necessary for the rotary positional encodings

02:58:33.380 | RMS norm is we will see later. What is the rms normalization, but just like the layer normalization

02:58:39.460 | We have this parameter called rms norm fps. Okay, I will explain it later

02:58:43.940 | Actually, the rope data is another parameter of the rotary positional encoding, which is the base frequency

02:58:49.640 | And also we will see later. What is it?

02:58:52.100 | the attention bias

02:58:54.100 | Indicates if in the attention matrices

02:58:56.420 | We are we want the bias because as you remember we have the wqwk and wv matrix

02:59:00.900 | These are linear layers and we can have also the bias term, but we I believe we never use the bias for this

02:59:07.300 | And it looks like we yeah, we don't use any bias for it. So if they don't overwrite it then it remains false

02:59:14.340 | Dropout just like before we are not going to use it and the padding token id and we save all this stuff. So nothing so

02:59:21.920 | Sophisticated here now the first thing that we are going to do since we have already implemented polygama for conditional generation

02:59:28.160 | I believe that the first thing that we can do is this method here merge input ids with image features

02:59:33.760 | But for that we will need to understand. What is the kb cache?

02:59:37.120 | All right. So let's start coding this method. So

02:59:40.800 | Let me go also here in the code that I have already written. So I will code it piece by piece

02:59:47.440 | So that we don't get lost in the explanation

02:59:51.760 | So we create this method which has this signature

02:59:54.340 | If you don't see it all it's this one here

02:59:58.240 | And let's extract. Okay. The first thing we do is we extract some information from the inputs

03:00:05.460 | Which are what is the embedding dimension from the image features because we need to

03:00:11.600 | Which are already resized

03:00:14.400 | Because we pass them after sending them through this multimodal projector

03:00:18.320 | So they have already been resized to the same size of the text tokens

03:00:22.000 | Then we have these input ids which tells us how many tokens we have the input ids

03:00:26.480 | If you remember correctly is the not the embedding of each token

03:00:30.080 | It's the number indicating the position of each token in the vocabulary

03:00:33.060 | While the input embeddings are the embedding of each token after they have been extracted from the embedding layer of the language model

03:00:40.880 | And that's why we have this

03:00:43.680 | It's a vector

03:00:47.520 | now

03:00:48.400 | The first thing that we do is we scale these image features

03:00:51.680 | so

03:00:54.000 | We scale these image features which also helps. It's like the same kind of scaling that we use in

03:01:00.400 | In the attention mechanism, so we do query multiply by transpose of the key divided by the

03:01:06.080 | Square root of the model here. We do the simple the same kind of scaling

03:01:11.760 | Because probably they have tried multiple variations of the model and we want the magnitude of the numbers to remain the same

03:01:18.480 | That's why we divide it by the the size of the hidden side. So if they if you want to double the for example the embedding

03:01:25.440 | Size of the image features you want the magnitude of numbers more or less to remain the same. That's why you you scale them

03:01:35.360 | Now the first thing that we need to do is to create the final tensor that will hold the combined

03:01:41.380 | Features of the image tokens and the text tokens and this is and it's this tensor here

03:01:46.560 | It's made up of zeros and it has the size of batch size

03:01:50.000 | Sequence length. So what is sequence length? The sequence length is the number of input ids we have

03:01:55.520 | What are these input ids? The input ids that are coming from this processing polygamma

03:02:00.340 | class

03:02:02.720 | which are the placeholder for the image tokens the

03:02:06.000 | beginning of sentence text the

03:02:08.640 | tokens of the prompt and the new line character

03:02:11.760 | So the token corresponding to the new line character

03:02:15.140 | So we create this sequence of empty embeddings of which size of embedding size dimension

03:02:23.140 | Embedding dimension which is the same size of the embedding vector of language model because the image

03:02:29.120 | Tokens and the text token will have the same size which is embedded dim here

03:02:33.680 | We want to be of the same size of the same d type

03:02:37.520 | So if it's floating point 32 of the input embeds and we put it on the same device

03:02:43.120 | The first thing that we do is we create some masks that will be useful for understanding which is a placeholder token

03:02:50.160 | Which is a text token and which is a padding token, even though we will not be using any padding

03:02:54.640 | So I just took the original implementation, which was already handling the padding, but we will actually never have padding tokens

03:03:00.720 | How to understand which one is a text token?

03:03:03.600 | Well, a text token is something that is not an image placeholder token and it's not a padding token

03:03:08.880 | What is an image token?

03:03:10.560 | Well something that is equal to the image placeholder token and the padding tokens are the tokens that correspond to the padding token id

03:03:18.480 | this mask will be

03:03:21.360 | useful for us to understand where to put the embeddings of the image tokens in this

03:03:25.920 | Final embedding tensor where to put the text token in this final embedding tensor and where to put the padding tokens in this final

03:03:32.080 | embedding tensor

03:03:34.080 | We expand them so

03:03:37.440 | Here we see them and later we will see why we need to expand them. So basically we are creating I believe the

03:03:44.160 | few dimensions more

03:03:46.960 | because we need to create the

03:03:49.120 | batch size dimension and the sequence dimension

03:03:52.100 | I don't know. We already have the sequence dimension because it's already given by the input ids

03:03:59.600 | We are creating the batch dimension and then we are expanding it to this embed

03:04:04.320 | dim dimension

03:04:07.200 | Later we will see why we need it. So basically this means that

03:04:10.560 | The text mask here. So let me draw a sample of how it may look like

03:04:17.600 | Oops, what did I do?

03:04:19.440 | here

03:04:20.400 | the text mask here

03:04:22.400 | Will be something like this. So if suppose that the

03:04:25.520 | The input ids are the tokens corresponding to the image. So suppose that it's the

03:04:31.920 | 567 so we have

03:04:34.780 | So we have many tokens corresponding to the placeholder for the image then we have the beginning of sentence token suppose usually it's the

03:04:42.320 | token number one

03:04:44.480 | Then we have the prompt of the user

03:04:46.880 | So suppose that it's a token number 56 78

03:04:50.180 | and 99 and 21 and 11 then we have the

03:04:55.040 | Slash and token. So it's suppose it's the token number two

03:04:59.760 | What we the text mask here will be basically something that is like this so it will be zero zero zero zero zero

03:05:10.000 | And then it will be one one one one one one and then it will be zero

03:05:16.480 | uh, actually one because the slash n is still part of the

03:05:20.080 | text the image tokens mask will be

03:05:24.400 | one one one one one and then a series of zero because all the others are text tokens

03:05:30.800 | And the padding will be

03:05:34.400 | Equal to all zeros. So I don't write all of them, but you can understand all zero because we don't have any padding token

03:05:43.280 | Then we are expanding them to

03:05:45.280 | This expand basically repeats these zeros and ones along this dimension the embedding dimension that we are adding here with this unsqueeze

03:05:53.940 | And we will need it later for the for another method, which is the wear method

03:05:59.040 | So for now, just keep in mind. We are just expanding this token by repeating this series of zero and one along a new dimension

03:06:05.060 | So the first thing that we do is we copy the text

03:06:09.660 | Embeddings into this final embeddings and we do this by using this method. So we say this final embeddings

03:06:16.000 | This wear method basically says that if this condition is true

03:06:20.620 | It will take the input from the second argument. Otherwise, it will copy the third argument

03:06:26.620 | So if wherever this condition is true, it will copy this stuff here wherever this condition is false. It will copy this stuff here

03:06:33.980 | so

03:06:36.300 | We are saying that whenever

03:06:39.100 | Um

03:06:40.140 | The the text mask is one

03:06:42.380 | We copy the embedding from the input embeds which correspond to the text inputs plus the placeholder for the image

03:06:49.740 | But we will only be copying the text

03:06:51.740 | Text tokens because for the image image tokens, we will have zero in this mask

03:06:58.940 | Otherwise just keep the final embedding as it is

03:07:02.700 | Then we add the image tokens

03:07:06.860 | As you can see here

03:07:08.860 | Which is using another method called the must scatter and we cannot use the torch dot where because the sequence length of

03:07:17.980 | Image scaled is not equal to the sequence length of the final embedding

03:07:22.300 | But basically this does the same job as the where

03:07:25.500 | So what we are saying is that copy from the scaled image features where this stuff is true

03:07:33.500 | So we are copying the image features where where the image mask is true where the image mask is true

03:07:38.620 | Where we have the placeholder tokens for the image so we are copying in the final embedding the image tokens

03:07:44.140 | Where before we had the placeholders?

03:07:46.640 | Then we copy the padding

03:07:50.620 | And the padding we just zero out everything because we don't care about what is in the paddings

03:07:55.840 | So what we are saying is that wherever the padding mask is true

03:07:59.100 | Just copy a zero a tensor made up of zero. Otherwise keep the final embedding as it is

03:08:03.980 | Now comes the interesting part so for now we have created the final embeddings

03:08:10.620 | What is the final embeddings is this stuff here. So let me show you again from the ipad. It's this stuff here

03:08:16.620 | So now here we have the first image token embedding

03:08:20.140 | second image token embedding third image token embedding blah blah up to

03:08:25.960 | 256 image token embeddings in the base version of polygama if I remember correctly

03:08:30.360 | And then we have the embeddings of the tokens corresponding to the prompt

03:08:35.080 | Plus the padding but the padding we will never have because I excluded it from my implementation

03:08:40.620 | So now we come to the interesting part

03:08:44.440 | Which is the creation of the attention mask and the attention mask has to be created in a particular way

03:08:50.280 | based on

03:08:52.120 | How we are working with the KV cache

03:08:55.320 | And for that I need to introduce the KV cache. So that's why this part is interesting. So let's go

03:08:59.880 | So let's talk about this thing called KV cache

03:09:02.920 | But before we talk about the KV cache, we need to understand what is the problem that the KV cache is solving

03:09:08.280 | So when we train a language model

03:09:10.840 | So as I we saw before the transformer can be thought of as a model as it's a sequence to sequence model

03:09:16.680 | Which means that you feed it a sequence of n tokens and you get as output n tokens

03:09:22.440 | These n tokens as output are not normal tokens anymore

03:09:25.960 | They are contextualized tokens means that each of them is not capturing information only about itself

03:09:30.920 | But also about other tokens which depend on the mask that you use if you use the causal mask

03:09:35.880 | It means that only each token will only capture information about itself and all the previous tokens

03:09:41.320 | If you are not using any causal mask, then each token will encapsulate information about all the other tokens in the sequence

03:09:47.800 | Which is what we do with vision encoders like the image encoder we saw before the Sigleap one

03:09:52.280 | Because the transformer is a sequence to sequence model, so let's open our ipad

03:09:59.400 | Now because the transformer is a sequence to sequence model

03:10:02.760 | It's very useful during training

03:10:05.960 | So suppose that we want to train we train a language model on the following sentence. So it's always the same which is

03:10:12.440 | I

03:10:13.800 | love

03:10:15.240 | pepperoni

03:10:17.400 | Pizza

03:10:19.400 | Pardon my calligraphy I write very fast recently we feed it to this black box that we will call the transformer model

03:10:30.040 | Each of these stuff here each of these uh tokens is actually an embedding

03:10:37.880 | So we will get an as output a list of embeddings, but they will be contextualized

03:10:44.260 | Contextualized one for the first token one for the second token. So this is the second embedding

03:10:49.140 | This is the third embedding and this is the fourth embedding

03:10:52.260 | I am again making the simplification that each word is a token and each token is a word

03:10:56.500 | How we train a language model?

03:10:58.900 | Well, we force the language model to predict the next token given the contextualized embedding

03:11:04.980 | So this contextualized embedding here contains information only about the word I in case we are using the causal mask

03:11:11.860 | so let's here is

03:11:14.580 | This only contains information about the token I

03:11:17.060 | This contains information about the token I but also the token love this contains information about the token. I love

03:11:24.820 | Pepperoni pep and this contains information about all the other tokens. I love

03:11:31.700 | pepperoni

03:11:34.800 | Pizza

03:11:36.820 | What labels do we use when training a language model

03:11:39.460 | Well, in this case, we want the first language model that given the prompt it should predict. What is the next token?

03:11:45.460 | So given only I the the language model should predict the word

03:11:49.780 | Love so the the the label here is love

03:11:53.860 | Given only the token love. I love so the prompt. I love that the language model should predict the token pepperoni

03:12:04.580 | Given the token the prompt I love pepperoni the language model should predict pizza

03:12:09.720 | And given all the sentence it should say end of sentence so it means hey i'm done with the generation

03:12:17.080 | Now this is how we train a language model. How do we actually inference a language model is the same way

03:12:24.740 | So we start with what is known as a prompt

03:12:27.220 | so suppose that the user only gives us one token as a prompt the word I

03:12:32.340 | And suppose that our language model has been trained on the sentence before so I love pepperoni pizza

03:12:37.220 | How can we generate the entire sentence? Well, we feed this single token to our black box, which is our transformer

03:12:44.420 | So now I will write it reversed because I don't have space above

03:12:48.100 | transformer

03:12:50.980 | The transformer will generate it's a sequence to sequence model, which means that it takes as input one embedding

03:12:56.960 | Corresponding to our prompt token I and it will generate one contextualized embedding

03:13:02.420 | So it will be one embedding what do we do with the language models we project this single embedding into logits

03:13:10.740 | so we use the linear layer at the

03:13:13.360 | Output of the of the transformer, which is this stuff here

03:13:18.640 | To generate logits for this token. So let's go back here

03:13:26.160 | This this is the output embedding so out

03:13:29.200 | put

03:13:31.420 | embedding

03:13:33.120 | We convert it through the linear layer

03:13:35.200 | into logits

03:13:38.400 | This logits tell us what is the score assigned by the language model to each token

03:13:45.200 | So how likely that particular token is the next one to convert it into a probability score?

03:13:51.600 | So something that sums up to one we use the softmax. So suppose that we have already applied the softmax

03:13:57.700 | Actually, let's apply it softmax. So

03:14:01.680 | It will remain a single embedding

03:14:04.880 | Sorry a single logits token, but the difference is that now they sum up all to one

03:14:10.960 | Which one we select the one with the highest number usually this is called a greedy strategy

03:14:16.240 | There is another strategy called the top p which means that we sample from the top the tokens with the top score

03:14:23.920 | Up to 90 percent. So suppose that there are three tokens here

03:14:28.240 | Okay, actually the top we will see later when we implement the inference for now

03:14:31.760 | Just think that we are always sampling the one with the highest probability score. So we use the greedy strategy

03:14:36.800 | by

03:14:38.960 | using the greedy strategy

03:14:40.560 | What will happen is that probably the model if it has been trained well, it will tell us that the next token is very likely the token

03:14:48.080 | Love so this is how we know. What is the next token?

03:14:51.840 | How do we generate then the next next token? We take this token love

03:14:56.400 | This token love and we put it back into the input of the language model

03:15:02.320 | So now we feed a new input to the language model. Let's remove this stuff

03:15:08.560 | Delete

03:15:10.560 | Now we are feeding two tokens to the language model

03:15:13.280 | Language model is our transformer model. So it's a sequence to sequence model

03:15:17.520 | It means that it takes as input two tokens. It will output two tokens

03:15:21.200 | So it's taking as input two embeddings. I am drawing here the text

03:15:25.920 | But actually you need to consider that these are two embeddings of these two tokens

03:15:30.160 | So we feed two embeddings. It will output two embeddings

03:15:33.860 | one corresponding to the token I

03:15:38.320 | One corresponding to the token I love

03:15:40.640 | Very ugly writing. So let me write it better

03:15:45.040 | one corresponds to the token I so the first position one corresponds to the second position which is

03:15:51.040 | Because this is a contextualized embedding. It will include information about

03:15:55.280 | I and love

03:15:58.000 | Now before what we did was to project this output embedding into logits here

03:16:03.920 | We have two embeddings which one should we project into logits? Of course. It's the second one. Why?

03:16:10.560 | because

03:16:13.120 | This embedding includes information about the two tokens. So it's like we are using the entire prompt. So what we do is we

03:16:20.000 | Send it to our linear layer

03:16:23.920 | Linear layer

03:16:27.520 | It will become logits. So let's write actually logits

03:16:31.300 | Then we apply this thing called softmax which will convert this logits into

03:16:37.600 | probability scores

03:16:40.080 | How do we understand what is the next token?

03:16:42.080 | Using I love as prompt. Well, we sample from the softmax which one the one with the highest score. So

03:16:48.240 | We take the one with the highest score as the next token so if the language model has been trained

03:16:54.960 | Well, it will be the token pepperoni

03:16:57.680 | So it will be the token

03:16:59.680 | Pepperoni

03:17:03.680 | Now, what do we do? How do we generate the next next next token? We take this word pepperoni

03:17:08.720 | We feed it back into the language model and we ask again the language model. Hey generate the next token

03:17:14.080 | Let's delete this stuff here I love

03:17:19.520 | Pepperoni

03:17:26.480 | We feed it to the language model

03:17:28.160 | We are feeding three tokens to the language model which are converted into three embeddings then are fed to the transformer

03:17:33.600 | The transformer will output three output embeddings

03:17:36.660 | one corresponding to each position

03:17:39.280 | Now without writing the first position will correspond to a contextualized embedding that only includes information about the token I

03:17:48.560 | the second

03:17:50.560 | Embedding contextualized embedding will include information about I and the love the third contextualized embedding will include information about I love

03:17:58.240 | Pepperoni, which one should we project?

03:18:00.540 | Of course the third one because it's the one that encapsulates information about all the prompt

03:18:05.760 | So this way we keep doing this way and we generate

03:18:09.600 | text

03:18:11.360 | Now, what is the problem here? The problem is that at every step of inference

03:18:15.280 | We are generating a lot of embeddings. Suppose that the prompt is very large

03:18:20.320 | A lot of embeddings that we are not using so we are creating them because the transformer is a sequence to sequence model

03:18:25.680 | It's generating them

03:18:26.960 | But then we are only projecting one single embedding to the logits and then to the softmax to understand what is the next token

03:18:33.760 | And as you know, the transformer model uses this thing called attention mechanism and the attention mechanism generates this matrix

03:18:40.800 | That is a sequence by sequence, which is the attention scores matrix that we saw before

03:18:44.560 | which means that when you have a thousand tokens

03:18:48.960 | It will generate a matrix that is a thousand by one thousand. So it's a one million numbers in that way

03:18:54.240 | So it's a huge matrix and then you only need to use a part of this matrix that will generate this embedding here

03:19:00.480 | So is there a way to not generate the embeddings that we are not going to project into logits?

03:19:06.160 | But only generate the one that we only need to generate the next token

03:19:10.320 | Yes, and it's possible through what is known as the kb cache and the trick is here. So now let's open this other slide

03:19:18.000 | The trick is this one. So when we calculate the

03:19:21.200 | attention matrix, so the query multiplied by the transpose of the keys divided by the square root of d

03:19:27.360 | Model or d head in case we have a multi multi head attention

03:19:31.040 | What we are getting is suppose that we want to generate the word pizza by using the prompt I love pepperoni

03:19:39.060 | If we do it naively we will pass all these

03:19:44.620 | Embeddings, so I love and pepperoni to the transformer. The transformer will convert them into query key and values using the projection

03:19:51.340 | wq wk and wv

03:19:53.480 | Let me check if my yeah, it's still working

03:19:55.820 | um

03:19:58.700 | It will convert them into query key and values and now then we use the query key and values to calculate this

03:20:04.940 | Matrix here. So the query multiplied by the transpose of the keys, which is this matrix here

03:20:10.860 | Then we multiply this matrix by the v matrix with by the v sequence and it will give us the output

03:20:17.180 | of the

03:20:19.240 | Attention, which is contextualized embedding you can see here and we saw also before that when we multiply by v

03:20:24.460 | We are doing what is known as a weighted sum using these weights as weights in this weighted sum

03:20:30.300 | now

03:20:32.780 | When this is the input of the model

03:20:34.620 | So the input of the model is I love pepperoni and the output that we are getting is a three contextualized

03:20:39.440 | Embeddings so the embedding corresponding to only to the word I the embedding corresponding to the word

03:20:45.020 | I love and the embedding corresponding to the I love pepperoni

03:20:47.760 | We know that we only need this one here because this is the only one that we need to project into logits

03:20:53.980 | And then to generate the next token. So is there a way to not compute these two stuff here that we will not be using?

03:21:00.940 | Yes, and the trick is here

03:21:03.420 | The trick is this

03:21:05.420 | Embedding contextualized embedding here is the result of the multiplication of this matrix by this matrix

03:21:12.700 | but not all of this matrix by the v sequence, but only the last row of this matrix by the v sequence because

03:21:20.700 | This number here comes the the number

03:21:24.300 | Let me okay

03:21:25.500 | Then this number here comes from the result of the dot product of this row here

03:21:31.820 | With all the columns of this matrix here

03:21:34.220 | So this number here comes from the dot product of the first

03:21:38.700 | The last row of this matrix with the first column of this matrix the second number in this matrix output

03:21:44.540 | Vector comes from the dot product of the last row of this matrix with the second column of this matrix

03:21:51.580 | the third number here comes from the

03:21:55.500 | Dot product of the last row of this matrix with the third column of this matrix, etc, etc for all the 128 dimensions

03:22:02.720 | So what we need to generate only this one is the last row of this matrix, but all the v sequence

03:22:09.420 | So basically to have

03:22:13.900 | so

03:22:15.500 | Because the attention matrix as we saw before we can consider the rows

03:22:20.460 | To be the queries and the columns to be the keys to have only this last row here

03:22:26.540 | We need only the last token as query

03:22:30.060 | But all the previous tokens including itself as keys and we need also all the tokens as values

03:22:37.740 | That's why what we do is the following when we generate text with a language model

03:22:44.460 | What we do is

03:22:46.940 | Imagine we have a prompt

03:22:49.020 | Um

03:22:50.300 | Let me draw in such a way that it's not confusing. So I think we can continue here. So

03:22:55.900 | Imagine we start again the process of generation of text, but this time we do it with the kv cache

03:23:02.380 | So we start with one token. Let me do it

03:23:05.420 | Top to bottom. Otherwise, it gets confusing because before I did top to bottom. So

03:23:10.140 | Okay, we use only the token i as input to the language model

03:23:14.620 | The language model will convert it into an embedding blah blah blah, then we feed it to the transformer

03:23:19.120 | Suppose that it's only made up of one layer. Actually, it's a series of layers

03:23:23.260 | uh this

03:23:26.140 | Single token will be converted into query key and values. So it will be a sequence of tokens

03:23:32.540 | But in this case, we only have one

03:23:34.620 | So the q sequence will be one token. The k sequence will be one token. The v sequence will be one token

03:23:40.380 | We do this thing called self attention

03:23:44.240 | uh

03:23:46.240 | Which will calculate that matrix so the query multiplied by transpose of the keys which will be a matrix that is one by one because

03:23:52.080 | We only have one token

03:23:54.080 | And then we multiply it by v so it will result in only one contextualized embedding as output

03:23:59.920 | So it's this stuff here what we do we project it into logits

03:24:03.700 | Which is another vector then we convert it into softmax which is another vector

03:24:11.920 | uh

03:24:13.920 | And then we sample the next token

03:24:20.720 | The difference with the kv cache is that whenever we pass a token to the input of this self attention

03:24:28.580 | We cache the key sequence and the v sequence into a buffer called the kv cache

03:24:35.280 | so now imagine that there are

03:24:37.600 | There is a box here called the kv cache

03:24:40.240 | That initially is empty. But after we pass the token I

03:24:43.760 | It will contain the embedding. So the q embedding. Sorry the k embedding corresponding to the token I

03:24:50.960 | And also this is the kv cache. So it is made up of the key cache and the v cache

03:24:56.240 | This is the key cache

03:24:59.040 | Then we have the v cache which is initially empty

03:25:01.440 | But after we send in the first token, we save this v sequence. It only contains one token. So we save it here

03:25:08.320 | So it's the token I

03:25:10.320 | We compute the self attention

03:25:14.080 | Using the query key and values. It will result in only one output embedding. We project it into logits

03:25:21.120 | We project it into softmax. We sample. What is the next token? Very probably it will be the token love

03:25:26.640 | What do we do now

03:25:30.560 | What we did before was that we took this word love

03:25:33.920 | Put it back inside of the prompt and then ask the language model again. What is the next token?

03:25:38.480 | But with the kv cache we do something different

03:25:40.640 | With the kv cache. We always take the previously generated token. So in this case is the token love

03:25:47.200 | We use it as input

03:25:50.400 | Only the single token love

03:25:54.880 | Let me delete a little bit here

03:25:59.440 | And we use this single token as input to the language model

03:26:03.520 | Now what happens is that we feed the transform this single token love into its embedding which is an

03:26:10.720 | Uncontextualized embedding we feed it to the first layer of the transformer as a query key and values for now

03:26:16.720 | The query key and value contains only one token the token correspond the embedding corresponding to the token love

03:26:22.960 | however

03:26:25.200 | when doing self attention

03:26:27.280 | We don't use only one single token

03:26:29.600 | for love

03:26:31.760 | For the key for the keys and values we take this single token love we append it to this buffer called

03:26:39.200 | Kv cache. So now it contains love here for the values. Also it contains love

03:26:45.120 | And then we use this buffer as the key and value sequence in the self attention

03:26:50.640 | So we take this token love we convert it into query key and value the query key and values are one single token

03:26:57.600 | But the query the key and value we append them each of them into their respective buffer here

03:27:03.520 | And then we use the content of this buffer to calculate the self attention

03:27:08.400 | What happens is that we have only one query, but now we have two keys and two values

03:27:13.440 | Which will result in exactly the calculation of this last row of this matrix

03:27:21.360 | That the last row that we are interested in to predict only the next token and not generate all the other contextualized embedding

03:27:28.800 | In this case, we are only seeing

03:27:31.520 | Two tokens, but later we will see with the third token. It will be exactly the last row of that matrix

03:27:36.560 | anyway

03:27:39.360 | The output of this self attention because we have one query two keys and two values

03:27:43.680 | I can guarantee mathematically it will be one single embedding you can verify by yourself

03:27:48.800 | But basically if you have one query as you saw before the self attention mechanism

03:27:52.820 | Will generate a matrix that is a sequence by sequence

03:27:55.760 | But in this case, it's the the roles of this matrix are defined by how many queries you have. So we have only one

03:28:01.360 | And we have however two keys

03:28:04.240 | So the key number one and the key number two

03:28:06.400 | So it will be a matrix that is one by two and it will result in only one output embedding token when you multiply it by b

03:28:16.240 | And we saw that before actually when we calculated the dimensions of the output embedding

03:28:20.800 | We saw that it's only the last row that generates the last embeddings and this is exactly what we are doing here

03:28:26.320 | Anyway, this the self attention calculated like this

03:28:30.240 | So using the query the single token, but as keys and value the content of the buffers the keys and the kv cache

03:28:36.960 | To calculate the self attention we result in only one output embedding

03:28:41.200 | Which is exactly the contextualized embedding that we are interested in to generate the next token

03:28:46.160 | We project it into logics. We'll project it to the softwares and it will result in the next token being

03:28:50.560 | pepperoni

03:28:53.500 | Naively what we did before was take this for the pepperoni and feed it back into the prompt and then feed all the prompt to

03:29:00.240 | The language model but with the kv cache it's different. So we use the last generated token pepperoni

03:29:08.960 | Let me write it all pepperoni

03:29:10.960 | We feed it to we convert it into a single embedding

03:29:15.140 | So the query key and value here are one single token

03:29:20.080 | But before computing the self attention, we put this key and value inside each of their buffers

03:29:27.520 | So now the buffer for the k contains pepperoni as well

03:29:31.280 | And also the v contains pepperoni

03:29:36.080 | Then to calculate the self attention we don't use this key and v we use the content of the kv cache because it contains three tokens

03:29:43.360 | So as query we use only one token, which is the word pepperoni

03:29:46.660 | But as key and v we use the content of the kv cache. So it will result in a matrix that is

03:29:51.360 | Exactly the last row that we saw here because it's exactly this one now because we have as a query

03:29:58.480 | Only the word pepperoni and as key is the token. I love pepperoni

03:30:03.440 | Which will result when multiplied with the v sequence, which is three tokens because we have also the v cache

03:30:08.640 | Will result exactly in the computation of this output embedding here, which is only one single embedding

03:30:15.780 | Which is exactly the one that we need to predict the next token, which will be

03:30:20.480 | the token pizza, I guess

03:30:23.120 | Etc etc. So this is the kv cache this kv cache basically allow us to during inferences

03:30:30.640 | So during token generation to avoid generating all the embeddings

03:30:34.580 | Of all the input sequence, but only generate the last

03:30:38.400 | Embedding contextualized embedding which is exactly the one that we need to we need to predict the next token

03:30:44.960 | There is another thing that we used to know about kv cache, which is the pre-filling the pre-filling is basically we started here with

03:30:53.280 | With a single token as a prompt of the user

03:30:56.720 | So we only use the word I but usually the prompt is a little longer. So it's not only one token from the user the user

03:31:03.840 | maybe

03:31:04.960 | Suppose that the user uses multiple tokens, so it uses the word I love

03:31:09.280 | What we do is because we have already access to all the tokens of the prompt of the user

03:31:17.040 | We are not generating them. We can pre-fill instantly using all of the prompt

03:31:21.440 | of the user

03:31:23.520 | All the kv cache corresponding to the prompt of the user so we can do instead of doing first adding I and then adding love

03:31:30.320 | We add both of them in the same forward pass. How to do that?

03:31:34.480 | We take we use both of them. We convert them into embeddings

03:31:38.080 | So it will result in two embeddings. We feed it to the language model as query key and values

03:31:42.320 | Initially, the kv cache is empty

03:31:44.720 | This will result in a cool sequence of two tokens the k sequence of two tokens and the v sequence of two tokens

03:31:52.960 | We put the k and the v inside of their respective buffer called the k buffer and the v buffer which comprise the kv cache

03:31:59.920 | So now it contains I and love

03:32:03.120 | this contains I and love

03:32:06.240 | then we

03:32:09.360 | Calculated the self-attention

03:32:10.560 | So now we have two tokens for the query two for the keys two for the values because the content of the kv cache contains

03:32:16.000 | two tokens

03:32:17.440 | Which will result in a two by two matrix, so it will result in two output embeddings

03:32:23.460 | And two output softmax which one we project in the um in the logits only the last one

03:32:32.640 | Because we are we are not interested in predicting the word love. We are only interested in knowing what comes after love. So we only take the

03:32:40.800 | Embedding corresponding to the position of the word love we project it into logits

03:32:47.460 | And we project it into softmax to understand what is the next token

03:32:50.740 | So only during this pre-filling phase we actually allow the generation of multiple output embeddings

03:32:57.960 | And then we discard the one that we don't need

03:33:00.900 | Why do we do it because we don't want to add one single token at a time because it will be too slow

03:33:06.180 | If you have a lot of tokens, you just add them all at once in the kv cache

03:33:10.260 | And then you use this kv cache which is pre-filled now to generate one token at a time

03:33:16.420 | The reason we do it is because the gpu is very fast at parallelizing stuff

03:33:20.740 | So it's very good at parallelizing computations

03:33:22.900 | So actually by doing all of these computations inside of the gpu

03:33:26.740 | Will result in a much less wall clock time instead of adding one token at a time

03:33:30.820 | And this is guys the kv cache. So now we can finally code it

03:33:34.340 | Okay, let's code the next part. So we copy this part here and all of this

03:33:41.380 | And all of this actually let's copy it all

03:33:46.180 | So now that we know what is the kv cache

03:33:48.100 | We know that we have two parts to do when we work with the kv cache

03:33:51.700 | The one part is called pre-filling and one is token generation during the pre-filling. We send all the prompt of the user

03:33:57.220 | to the kv cache

03:34:00.340 | To the model using as a query key and value and this will create the initial cache that will then be used by subsequent

03:34:07.320 | During token generation. So where we generate one token at a time

03:34:11.300 | Why do we do this two phase because we want the the prompt is already available to us

03:34:15.540 | We don't want to edit one token at a time while the token generation

03:34:19.300 | We want to generate one token at a time because we don't have these tokens

03:34:22.100 | so to create the attention mask for the

03:34:25.140 | for working with the kv cache basically, so

03:34:28.500 | when we are working with the pre-filling phase, we will have that the

03:34:32.980 | Number of queries key and value will be the number of the tokens inside of the prompt. So we generate a mask that is

03:34:40.180 | sequence by sequence

03:34:42.180 | Because it will be used in the attention mask. So let's visualize it actually

03:34:46.260 | so suppose that we are doing the following so

03:34:50.900 | This suppose that we receive a prompt that is I love pepperoni and we want to generate the next token, which is pizza

03:34:58.180 | The attention calculation will result in the following attention score

03:35:02.660 | So it's a matrix that is three by three in which we want to mask out some

03:35:07.840 | interactions between tokens especially for each query cannot attend to future keys

03:35:12.400 | And the way we do that is we create an attention mask

03:35:16.560 | Of the same size of the attention matrix as you can see so three by three. So sequence by sequence

03:35:21.680 | in which we

03:35:24.400 | Before we apply the softmax. We add this thing called mask to this

03:35:28.560 | matrix

03:35:31.280 | And this mask is made up of minus infinities for all the position in which we don't want any interaction to happen

03:35:38.160 | And this is what we are doing here. So at the beginning we create

03:35:41.920 | We are inserting the prompt of the user and we should mask out future tokens, however

03:35:49.600 | in

03:35:51.680 | And we create a mask that is a token sequence by sequence

03:35:54.960 | So this is during the pre-filling so when the KB cache is not or the KB cache does not contain any item means that we are

03:36:01.040 | Doing it for the first time. So we are pre-filling the prompt of the user

03:36:04.160 | Now we are not adding any minus infinity value to this KB is to this attention mask during the pre-filling. Why?

03:36:12.560 | For to understand that we need to understand how polygamma attends to the

03:36:17.120 | Image tokens and to the prompt of the user. So for that, let's open the page of

03:36:24.540 | polygamma

03:36:26.540 | And here we can see the attention mask

03:36:28.540 | So a prompt in polygamma is made up of the image tokens, which are 256 in the case of the smallest polygamma

03:36:37.760 | Then we have the prompt of the user which is a beginning of sentence token plus the prompt of the user

03:36:43.180 | So for example, the prompt of the user may say extract where the photographer is in this picture

03:36:48.060 | And then we have a separator token, which is the new line token we saw before

03:36:53.420 | As you can see the attention mask here is not masking out anything for the part that corresponds to the

03:37:00.300 | Prompt because the prompt of the user is made up of the prompt

03:37:04.220 | So the textual prompt plus the image and we don't mask out anything. Why? Because and it's quite

03:37:11.420 | and it's different than what we usually do with language models because

03:37:15.500 | for the image tokens

03:37:17.900 | We can understand that we don't mask out anything because each text token that we will generate needs to access all the image tokens

03:37:25.020 | So it will be conditioned on all the image tokens. That's why it's called conditional generation

03:37:29.120 | And that's fine because we saw that each image is each image feature each image embedding is encoding

03:37:36.080 | Not only itself

03:37:37.740 | But also all the other embeddings and we want each text token to watch all the image to be predicted and that's fine

03:37:43.500 | the point is why in the

03:37:46.540 | The prompt is not causal

03:37:49.740 | So as you can see the first token of the prompt, which is this one

03:37:53.500 | so suppose that the prompt is two tokens, for example, I love and

03:37:56.940 | We want to generate the word pepperoni and pizza, which should be the first output token and the second output token you can see here

03:38:05.180 | Why are we not applying any causal mask to the tokens of the textual prompt?

03:38:14.780 | Because the textual prompt is usually very short

03:38:17.420 | And we want and it usually describes what is the task that we want the vision language model to perform

03:38:23.660 | and it's a choice that the palygamma

03:38:26.720 | authors made which is

03:38:29.420 | because usually this

03:38:31.180 | This prompt represents the task that we want the language model to perform

03:38:34.380 | We want all the tokens that will be generated to watch all of the

03:38:38.940 | tokens in the prompt

03:38:41.480 | Moreover, we want each token in the prompt to watch even future tokens of the prompt itself

03:38:47.400 | So you can think of this

03:38:50.600 | As the query this one as the keys

03:38:55.160 | When we will do prefilling what we will have is the following so we will have

03:39:00.360 | The prompts let's use a different color. So we will have all the tokens of the prompt which are the

03:39:06.440 | Textual prompt which is the textual prompt that we will send to the model

03:39:11.080 | plus the image

03:39:12.760 | tokens

03:39:14.280 | And we do not need to generate any mask here because each

03:39:18.840 | Text prompt can watch even future tokens of the text prompt because you can see that this is the keys

03:39:26.200 | This is the query number one of the text prompt and this is the key number one of the text prompt

03:39:32.360 | This is the key number two of the text prompt and as you can see the query number one of the text prompt

03:39:36.760 | So this beginning of send the token can attend to the key number two of the text token

03:39:41.880 | It's a choice that the palygamum

03:39:45.000 | Authors made so they they said okay, usually the prefix of the

03:39:50.040 | Because we are not generating this prefix, which is the prompt that we send to the model telling what the model needs to do with the image

03:39:57.960 | We do not need to add any causality because we do not

03:40:02.840 | Need the model to be causal with respect to this prefix because we are not going to generate it

03:40:07.960 | however, the only thing that we are going to generate is this thing called suffix which are the

03:40:13.560 | Output tokens predicted by the model using the prompt textual prompt and the image

03:40:18.600 | And this needs to be causal

03:40:20.920 | So the first token output by the model needs to attend all the previous keys, which are the image token

03:40:27.480 | So these three image tokens plus the four tokens of the text prompt

03:40:32.760 | Then the next token predicted by the model should be able to access again all the image tokens

03:40:39.000 | So the first three tokens then the four tokens of the textual prompt plus the last generated

03:40:45.080 | token

03:40:46.760 | By the model then when we generated the next next token, it will need to access

03:40:51.320 | the

03:40:53.560 | First three image tokens then the next four text tokens of the prompt

03:40:58.280 | And the two tokens predicted by the model before so it is causal only in the generated text not in the prefix part

03:41:07.240 | Which is different than normal language models in normal language models when we prefill even the

03:41:13.080 | When we prefill the

03:41:18.120 | the prompt the prompt

03:41:20.840 | Itself is prefilled using the causal mask because the the prompt is just

03:41:25.160 | A part of what the model would generate if it would start with only the first token

03:41:30.440 | But this is not the case in PaliGamma. It's a choice that the PaliGamma team made

03:41:35.240 | So it's not like the language model has to work in this way or there is any advantage or disadvantage

03:41:40.700 | The only advantage if we want to say is that the information about the prompt

03:41:45.880 | Is replicated in each of these tokens because each of these tokens basically

03:41:50.440 | Includes information also about future tokens that are part of the prompt and this happened when they train the model

03:41:56.120 | so when you train the model also you don't mask out the

03:41:59.320 | The future tokens inside of the

03:42:03.320 | Textual prompt you only mask out what you expect the model to generate

03:42:09.340 | Using the image token and the textual prompt. So to rehearse

03:42:15.160 | Let's go back to this image. What is the text prompt? So when we inference a language model we provide a

03:42:20.920 | Visual text visual language model. We provide an image as condition and then we provide some

03:42:27.080 | Text prompt which is a description of what we want the language model to do with this image

03:42:32.280 | For example tell us where is the photographer in this picture?

03:42:34.920 | And then the model will generate some tokens as outputs telling us where the photographer in this case is

03:42:41.880 | and what we do when we train this language model is that

03:42:45.080 | Let's go back here

03:42:47.800 | We do not mask the tokens of the textual prompt

03:42:51.560 | So when we ask the language model what to do with this image

03:42:54.360 | We do not mask out during training and also during inference, of course because the model needs to work in the same way

03:42:59.560 | But we mask out only what we expect the model to generate

03:43:03.800 | So the causality is only in the generated tokens and it's a choice that you make with the language model

03:43:09.000 | It's not necessarily it has to work with this way because normal language models

03:43:13.640 | They actually mask out all the tokens

03:43:15.720 | There is no like not masking out of the prompt because usually the prompt itself

03:43:20.120 | You can consider it as something generated by the model, even if it's not

03:43:23.480 | So this is a more of a philosophical question that's a technical

03:43:28.200 | But the reason is that it's a choice made by the polygamous authors also in visual language model

03:43:32.920 | Especially like polygamous the task so the prompt the textual prompt is usually very short

03:43:37.800 | It tells the model what to do with the image that it's being fed

03:43:40.760 | so for example localize where is the cat in this image or

03:43:43.480 | Extract all the numbers or tell me where is the photographer in this image, etc, etc

03:43:50.200 | And also the usually the generated output of the model is very short

03:43:53.960 | So we don't use at least polygamous models like polygamous are not used for generating very long

03:43:59.320 | Content but they can be of course fine-tuned to do it

03:44:04.520 | So, let me delete this part. Otherwise it remains here forever

03:44:08.280 | Okay

03:44:11.320 | All right, so now we have seen how we generate the

03:44:14.200 | The the mask for the pre-filling so in the past for the pre-filling

03:44:18.360 | We do not mask out anything because we do not mask out the text prompt and we do not mask out the image prompt

03:44:24.520 | The interesting part is that when we generate the text we have we generate one token at a time with the KB cache

03:44:32.280 | Which is this this else part here

03:44:35.160 | We also do not mask out anything. Why?

03:44:38.440 | Because let's go back to the polygama here picture. So here

03:44:43.640 | When you generate the first token, the first token needs to access all the image tokens and the text tokens

03:44:50.360 | So does not we don't need to mask out anything

03:44:52.840 | When we generate the next token as you can see it needs to access all the image tokens and all the text tokens

03:44:59.320 | Plus the last generated token here. So we do not need to mask out anything then again for the next next token

03:45:05.320 | We need to access all the previous tokens plus the two previously generated tokens

03:45:09.960 | So we do not need to mask out anything because we are generating one token at a time

03:45:13.800 | So it needs to access all the previous tokens plus the image tokens plus the textual prompt

03:45:18.920 | So we never need to mask out anything. So you may be wondering why are we never masking out anything?

03:45:25.000 | Because we are working with the KB cache and with the KB cache

03:45:27.480 | We only generate one single row of this matrix at a time

03:45:32.040 | And as you can see

03:45:33.320 | We always generate the last row and the last row is always the last token that needs to access all the previous tokens

03:45:38.920 | So we never need to mask out anything. However during training

03:45:42.200 | when you train a model

03:45:44.920 | on something then you need to mask out because the model will generate all the

03:45:48.920 | Contextualized embedding in parallel and you want each contextualized embedding to only be contextualized on the previous token

03:45:54.600 | So you need to mask out. So during training we will have a causal mask, but during inference, which is our case

03:46:00.200 | We don't have any causal mask at least when working with the KB cache and at least

03:46:04.040 | When working with models like polygamma if you work with a normal language model like normal like llama

03:46:09.880 | For example when you do the pre-filling you actually need to mask out the pre-filling part

03:46:14.200 | But in the case of polygamma because of the choices made by the polygamma team. We do not need to mask out anything

03:46:19.640 | And this is why we do not need to mask out anything

03:46:22.840 | So when we will in the future plan to make another video on how to fine-tune this model that we have made

03:46:27.720 | And we will see that we will need to introduce some kind of mask

03:46:31.080 | And the mask will have to be generated exactly like shown by the polygamma paper. So let me check if my it's still working

03:46:39.080 | Sometimes I lose connection with my cam. So I need to check every once in a while. So

03:46:43.160 | We add then okay

03:46:46.920 | we have created this mask which is filled with zeros because

03:46:49.480 | We need to fill up minus infinities to all the positions where we want to mask out something

03:46:55.000 | But we never mask out anything. So we always make this tensor full of zeros

03:46:58.920 | when we are pre-filling we generate a sequence by sequence mask, but when we are

03:47:04.760 | Generating tokens, we only generated the last row of that metric. So we have only one

03:47:11.080 | Query, so as you can see assert query is equal one

03:47:13.800 | So we only have one query and then we have how many keys we want which is how many keys there are in the KVCache

03:47:19.720 | We add the plus one to this KVCache because before using the KVCache we add this current token

03:47:25.480 | So the query token inside of the KVCache then we extract it before calculating the self-attention like we saw before

03:47:31.000 | As you know the KVCache when we do the attention computation, we have one attention computation for each head

03:47:37.720 | So we need to add the head dimension because there will be one attention matrix for each head

03:47:42.120 | And that's why we add this head dimension here

03:47:44.440 | Okay. Now we have generated the KVCache

03:47:47.240 | Let me check what else we need to do

03:47:50.200 | We need to generate the positions of the tokens that will be used by the rotary positional encodings

03:47:56.380 | So when we are working with the pre-filling part of the KVCache

03:48:01.240 | It means that we have n tokens that are part of the prompt of the user which are the image tokens plus the text tokens

03:48:07.720 | Then we need to generate enough positions to apply the rotary positional encoding. So which the positional encoding

03:48:13.480 | How many of them we need we need up to how many tokens there are in the prompt

03:48:20.360 | Which is indicated also by the number of ones in the attention mask which is generated by this processing polygamma code

03:48:26.520 | So when you generate the tokenized text

03:48:28.840 | It will give you the input IDs and another tensor of the same size as the input IDs with all ones

03:48:35.300 | Indicating that we do not mask out anything and if you count the number of ones it also gives you how many tokens there are

03:48:41.140 | In the input IDs, so that's what we are doing here

03:48:43.380 | We generate enough positions. So when we are doing the pre-filling suppose that the pre-filling is made up of 256 image tokens

03:48:52.660 | And then three tokens of the textual prompt. So what we will this will generate basically 0, 1, 2, blah, blah, blah

03:49:00.800 | 255, 256, 257, and 258

03:49:05.520 | A sequence like this. This sequence will be then used to understand which

03:49:09.920 | Rotary positional encoding we need to apply to each token

03:49:13.040 | when we are however doing the

03:49:15.760 | Token generation we only have one single query to which we need to apply the positional encoding

03:49:23.700 | And for that we only take the one token

03:49:27.040 | So this will generate only a one single mask, which is the position corresponding to the last

03:49:31.680 | To the last token

03:49:36.640 | Okay

03:49:37.360 | So when we do token generation basically we have some tokens that are already saved in the KV cache

03:49:41.840 | And then we have one new token, which is the last predicted token, which we use as a query

03:49:46.080 | To understand what is the position of this token

03:49:49.120 | We also pass the attention mask in the case of the attention mask

03:49:52.640 | It will indicate that it's all made up of ones how many ones well indicate well

03:49:57.200 | Based on how many tokens there are in the KV cache

03:50:00.000 | Plus one because we also have the new token that we need to add to the KV cache before doing the self attention

03:50:05.040 | So what we are doing here is the same. So we are counting how many ones there are in the KV cache

03:50:09.920 | Which is already plus one

03:50:12.000 | And then we take this last number

03:50:15.120 | And we this is how we generate the position IDs

03:50:21.280 | And then we return this stuff here, so let me return this stuff

03:50:26.720 | Okay, so we have implemented this method

03:50:29.840 | So what this does this method do this method basically takes as input the image features

03:50:34.800 | It takes as input the input IDs and the input embeddings

03:50:38.240 | What are the input embeddings are the image the embeddings of the image placeholder, which we will not use

03:50:45.280 | And then the image features our goal is to put all the image features in the right places in this input embeddings based on where

03:50:52.240 | Are these image embeddings placeholder positions?

03:50:55.300 | And we did we do it here

03:50:58.000 | then

03:50:59.200 | Here actually then we create the attention mask, which is basically just made up of zeros which

03:51:04.720 | Do not confuse the zeros in the attention mask

03:51:07.520 | We are creating here with what we are probably commonly used to see in the attention mask

03:51:11.920 | So let me show you actually this one also

03:51:15.120 | So usually you are probably used to see the attention mask as a bunch of num ones and zero and the zero indicates which number

03:51:21.440 | Should be masked and the one which indicates what is the number that should not be masked

03:51:25.600 | This ones and zero is actually then converted into a number of in a series of minus infinities and zeros before

03:51:33.200 | Being added to the attention matrix

03:51:36.000 | Instead of creating a ones and zero which then converted into minus infinities and zeros

03:51:41.200 | We are already creating the mask that can be directly added to the attention mask

03:51:45.280 | So we are creating a bunch of zeros, which basically means that

03:51:48.480 | You add a bunch of zeros to this matrix

03:51:51.280 | So it's like you are not masking out anything

03:51:53.440 | If you want to mask out something then you need to add some minus infinities in this mask, but we never add any minus infinities

03:52:00.240 | So we are not masking out anything

03:52:02.240 | And this is our method that combines the image features with the text tokens

03:52:07.680 | Our next goal is to create the structure of the polygama

03:52:11.220 | Actually, we can create this polygama multimodal projector. Yeah

03:52:15.280 | All right. So let's create this polygama multimodal projector. Let me put away this stuff here

03:52:21.840 | We just copy it. It's very simple. I just I don't even need to copy first the constructor and then

03:52:28.400 | So the polygama multimodal projector is just that linear layer that converts the size of the image features

03:52:34.620 | Extracted from the vision encoder into the same size of the embedding size that is used by the language model

03:52:41.900 | So it's just a linear layer that converts the hidden size of the vision model into the projection dimension, which is equal to the

03:52:49.900 | embedding size of the text

03:52:52.440 | text model here

03:52:55.420 | So this project projection dim is equal to the you can see it here is equal to the hidden size

03:53:01.740 | That is been then used by the language model

03:53:04.460 | So it's basically resizing the the embeddings so that they can be concatenated with the text tokens

03:53:11.020 | Let's go back here. So as you can see, we are just applying this linear layer

03:53:15.980 | Our next step is to code the language model itself. So the language model the gamma language model is a transformer model

03:53:23.420 | So it it will code a language model. So

03:53:25.900 | Transformer model so we create this gamma for causal language modeling

03:53:30.860 | Which takes the configuration of the gamma model as input and the gamma model, which we will create later

03:53:36.060 | Basically in the hugging phase whenever you see something something for causal language modeling

03:53:42.860 | It is a transformer model plus a language modeling head, which is the linear layer in the transformer that projects each embedding into

03:53:49.820 | logits

03:53:52.300 | So this is basically the transformer model this gamma model and then this is gamma for causal lm is the gamma model plus a linear layer

03:53:59.820 | That's why we are reusing this instance plus a linear layer. So the forward method will be very simple

03:54:05.180 | We need to implement these two

03:54:08.620 | methods which are used for the

03:54:11.020 | Weight tying so we saw before that weight tying basically means that we share the weights of the embedding back

03:54:17.180 | Layer with the logits layer. So this is what we are doing

03:54:20.380 | So when we type weights, we just copy from the embeddings to the language modeling head

03:54:25.420 | Which is the linear layer that converts the embedding into logits

03:54:29.920 | Then we have the forward method which is also very simple because it will not do anything except for

03:54:36.480 | Applying sending the stuff to the language model and then applying this

03:54:40.400 | Linear language modeling head which is the linear layer to convert into logits

03:54:44.800 | So as you can see here

03:54:47.680 | We send the input directly

03:54:50.000 | So the attention mask the position IDs the input embeddings the kvcache we send it to this language model, which we will implement later

03:54:56.960 | The output of this language model will be a series of embeddings, but we do not want embeddings. We want logits. So

03:55:02.720 | This is what we do

03:55:04.880 | We take the outputs. We take the hidden states from these outputs, which are the series of embeddings

03:55:10.560 | We apply the language modeling head. So it's the linear layer. We make sure it's a floating point numbers

03:55:15.920 | we return and return whatever

03:55:19.920 | Result is it so we return the logits and if the user specified the kvcache, we also return the updated kvcache. That's it

03:55:27.680 | Because here there is no logic the logic will be here in gamma model

03:55:32.320 | Yeah, so let's go to implement the gamma model, all right

03:55:37.120 | So what is a language model a language model is an embedding layer plus a series of transformer layers

03:55:44.000 | And then we have the language modeling head. The language modeling head is already implemented here in gamma for causal language modeling

03:55:50.160 | So we just need to create the other part which is the embedding layer and the list of transformer layers

03:55:55.440 | Let's do that. So we create first the constructor. So this

03:55:59.680 | gamma model

03:56:02.000 | which takes the configuration some

03:56:04.000 | Information that it needs so the vocabulary size why we need a couple vocabulary size because we need to create the embeddings how many embeddings

03:56:10.880 | we have

03:56:12.480 | Depending on the number of tokens in our vocabulary each embedding vector will be of size a hidden size

03:56:19.600 | This indicates the position of the embedding token inside of the vocabulary

03:56:23.060 | And basically I think the embedding layer takes it as input so that it does not update the gradient for this token here

03:56:30.000 | And then we have a list of layers

03:56:32.880 | for the

03:56:34.960 | For our transformer

03:56:37.440 | These are called here are called gamma decoder layers. So they are the transformer layers. We have how many of them we have

03:56:45.440 | Depending on this parameter num_hidden_layers. And then we have a final normalization, which is a rms normalization, which I will describe later

03:56:52.880 | What is it and why it's different from a layer normalization?

03:56:56.020 | We need to implement this method here get_input_embeddings, which is used by the language modeling head. So as you can see we use it

03:57:05.120 | here

03:57:07.760 | We use it here to extract the initial embeddings

03:57:10.960 | From the language model which are then combined with the image features we saw before here and then send to the language model

03:57:16.800 | So the language model here is receiving not the input IDs, but it's receiving the embeddings already

03:57:21.840 | So the image embeddings plus the text embeddings

03:57:24.420 | Which is the same embeddings that we will receive here in the forward method of gamma model

03:57:30.000 | Now, let's make the forward method

03:57:33.280 | Which is also very simple because we do not implement much logic here

03:57:39.600 | So we receive the attention_mask, the position_ids, which are the position that we will apply for each token

03:57:45.200 | How to apply the positional encoding to each token

03:57:48.800 | We didn't talk about the positional encoding yet because we apply the rotary positional encoding in this case, which are applied

03:57:55.120 | During the calculation of the attention

03:57:57.200 | So they are not applied at the beginning like we saw before with the Sigleap or with the vanilla transformer

03:58:02.320 | But they are applied just before calculating the attention

03:58:06.160 | We have the input embeddings which we saw before are the image features plus the text tokens

03:58:11.520 | And in case we have the KB cache also the instance of the KB cache, which we didn't implement yet

03:58:16.320 | But we already know how it works

03:58:18.320 | so

03:58:20.960 | Let's do it. So the first thing that it does it is

03:58:24.560 | Taking and applying some kind of normalization, which is the same reason we apply

03:58:31.020 | Normalization also to the input of the image features

03:58:33.500 | We want the kind of the magnitude of the numbers to remain the same even if the number of dimensions increases

03:58:38.560 | then this language model is made up of a series of layers of

03:58:43.660 | Transformer layers. So what we do is the output of one layer becomes the input of the next one

03:58:49.340 | And that's what we are going to do here

03:58:51.340 | Oops, I've copied it

03:58:56.860 | So we take the decoder layer we send it the first hidden state which is the input of this forward after it's been normalized

03:59:04.160 | We send the attention mask. We send the positional encodings the KB cache and it will return something which is

03:59:10.380 | Contextualized embeddings which become the input of the next layer

03:59:14.860 | So we replace basically these hidden states with the output of the first layer so that it becomes the input of the next layer

03:59:20.940 | And we do it for all the layers

03:59:24.380 | The output of the last layer we send it to a normalization

03:59:28.240 | Layer which is the rms normalization, which we didn't see yet, but we will talk shortly

03:59:35.740 | So I want to actually redraw what we are doing so far. So we have arrived

03:59:41.580 | So for that, let's go back to the ipad

03:59:45.840 | All right, so

03:59:53.340 | What we are doing basically is this so we have created the

03:59:56.620 | Embeddings before we have merged them with the image tokens and the text tokens

04:00:01.420 | We did not apply any positional encodings because we are doing the rotary positional encodings

04:00:06.800 | Which are applied exactly when we calculate the attention

04:00:10.560 | So if we were to draw the the gamma architecture, it would be like this. So we have the

04:00:15.820 | embeddings

04:00:22.620 | Then I remember there is some kind of normalization

04:00:25.040 | Doing but it's not a linear not a normalization layer. It's just we are normalizing the embedding

04:00:31.420 | So it's not a layer actually so we do not have to draw it

04:00:34.300 | Then we have a series of layers and we have n of them

04:00:37.900 | Each of these layers is made up of a normalization

04:00:42.160 | RMS normalization

04:00:46.700 | Then we have self-attention

04:00:48.700 | So attention

04:00:51.420 | then we have a

04:00:53.340 | Plus so a skip connection here

04:00:55.660 | Uh, I think I made it too small. So let's make it a bigger

04:01:00.300 | This layer

04:01:03.180 | Then we take the output of this one and send it to another normalization, which is an again in rms normalization

04:01:09.280 | Then we send it to a feed forward network

04:01:12.640 | The output of this one is sent again to another

04:01:20.460 | Skip connection

04:01:22.060 | Then the output of the last layer will be sent to again another normalization, which is the rms normalization

04:01:28.640 | Then we send it to a linear layer for the logits

04:01:32.480 | Linear and let me shift it down and then we have the softmax so so far

04:01:43.820 | So far what we have made is basically we are now creating this structure here, but without coding the single block

04:01:50.540 | So we are just creating this

04:01:52.620 | Forward method that will run the output of the embeddings to each of this layer one after another and will apply the final normalization

04:02:00.800 | Rms normalization, which is this stuff here

04:02:04.380 | And then it will be sent to the linear layer when it will be sent to this linear layer

04:02:09.900 | With gamma for causal lm because as you can see gamma for causal lm will take the output of this

04:02:14.860 | Model, what is this model?

04:02:16.940 | It's everything

04:02:18.540 | Except the linear layer and then we'll apply this linear layer called the language modeling head which will convert it into logits

04:02:25.420 | And after we will apply the softmax, but that is for sampling

04:02:29.020 | So now we need to create this decoder layer. So what is this decoder layer?

04:02:32.940 | This decoder layer is this stuff here. We need to code the normalization. We need to code the attention mechanism

04:02:38.940 | We need to code the field forward network and of course all the skip connections. So let's do it

04:02:43.580 | All right. The first thing that we can implement actually very easily is the rms normalization. So let's explore it

04:02:50.060 | So I have a slide ready there for that

04:02:52.220 | So as we saw before with layer normalization

04:02:54.620 | What we are doing is that we are normalizing each value using some statistic collected from the value from each item itself in the batch

04:03:03.100 | So each item in the batch suppose

04:03:05.500 | It's a batch of pictures and the first picture is that of the cat in the layer normalization

04:03:09.740 | What we are doing is for each dimension of this vector

04:03:12.620 | We calculate a statistic using this vector which is the mean and the standard deviation

04:03:18.480 | And then we normalize each value in this vector using these two statistics. How do we normalize? Well, we recenter it around

04:03:27.260 | Here it's not written, but I can show you the formula here

04:03:30.300 | You basically subtract the mean that you calculated and you divide it by the standard deviation

04:03:35.760 | And the layer normalization actually works fine

04:03:39.980 | But recently in most language models, we are seeing another kind of normalization that is known as root mean square normalization

04:03:47.120 | Basically what we do with this normalization is that each of these features in this

04:03:55.900 | Each item of the batch

04:03:57.740 | We are normalizing it in such a way that it becomes like it's coming out from a distribution

04:04:03.120 | Gaussian distribution with a center of zero and a variance of one

04:04:08.620 | What they claim in the root mean square normalization paper is that they say

04:04:14.300 | that the success of the

04:04:17.260 | Layer normalization is not because of its recentering invariance, but because of its rescaling invariance

04:04:25.340 | which means that

04:04:26.860 | To actually reduce this internal covariate shift, which is the reason we use normalization

04:04:32.240 | The model does not need to see the values

04:04:36.700 | Centered around zero. It just needs to see the values mostly surrounded around whatever mean they are centered upon

04:04:45.420 | So the values of this cat, for example, they do not need to be all around zero

04:04:51.900 | They could be all around 500 or all around minus 100 as long as they are more or less around

04:04:58.300 | 500 or more or less around minus 100 all of them

04:05:02.060 | That's the meaning of reducing the variance to one

04:05:06.140 | So we want most of the values to be around whatever mean it is

04:05:09.900 | And this is a hypothesis

04:05:12.700 | Made by this paper and it's actually verified because most language models right now

04:05:18.060 | They do not suffer from the internal covariance shift because they can be trained successfully very fast just like the layer normalization ones

04:05:25.420 | But by using this root mean square normalization, why it is advantageous

04:05:31.660 | Instead of layer normalization because instead of computing two statistics for the mean and the variance

04:05:39.100 | We only need to compute one statistic, which is this root mean square statistic

04:05:44.380 | Why we do not compute just the standard deviation like we do with the layer normalization because to compute the standard deviation

04:05:51.180 | You need to have the mean

04:05:53.260 | But we do not want to compute the mean because we do not want to recenter them

04:05:58.620 | So we do and because we don't compute the mean we cannot compute the

04:06:02.940 | the standard deviation

04:06:05.900 | So we replace this standard deviation with another statistic that allow us to

04:06:11.100 | Reduce the variance, which is this root mean square statistic

04:06:14.640 | Which is calculated as follows. So we take each item in this vector

04:06:19.660 | So this item, this item, this item, this item, this item, this item

04:06:22.540 | We make the power of two of each of this item. We sum them up all together. We calculate

04:06:29.120 | The mean of this summation so divide by n basically

04:06:33.020 | Square root and this gives us the square root mean

04:06:38.540 | Square statistic for this item then we take each of this item and we divide it by this statistic

04:06:44.380 | Multiplied by a learnable parameter called gamma, which is one for each feature

04:06:50.380 | So basically with root mean square normalization, we are obtaining the same

04:06:56.620 | covariate, internal covariate shift

04:06:59.580 | I mean, it solves the same problem of the internal covariate shift as layer normalization, but by computing one less statistic

04:07:07.340 | So we compute less statistics. So it is faster basically

04:07:10.380 | and

04:07:12.940 | Okay. Yeah, so let's implement it. Let me put away this stuff

04:07:17.740 | All right, so now we copy this class we put it here

04:07:24.940 | Then we later we explain it it's very simple

04:07:30.220 | Uh, let me copy all the forward method

04:07:35.180 | It's very simple. Okay. So what we are doing with rms normalization is that okay

04:07:39.740 | we are creating a weight matrix, which is the

04:07:41.820 | number of parameters one for each feature in the vector to which we apply this root mean normalization how many

04:07:48.700 | Dimensions will have this vector well the same as the tokens because we are we will go we're going to normalize tokens

04:07:55.820 | So this dim will be the hidden dimension of our language model

04:08:00.300 | We compute this root mean square statistic as follows. So we calculate the power of two of each item

04:08:06.060 | We compute the mean of this

04:08:08.220 | Power of two. So what we are calculating here is basically this term here. So let me

04:08:14.060 | Show you this term here

04:08:16.700 | Then we do one the square root of this which is this r sqrt

04:08:21.340 | but actually we are not doing the square root we are actually calculating the

04:08:25.500 | One over the square root of whatever is the argument of the r sqrt. So stuff here

04:08:31.260 | And instead of dividing each item we are multiplying with one over sqrt, which is exactly like dividing by one

04:08:39.340 | by the square root

04:08:41.820 | Why do we have this item here plus self dot eps in the argument of the square the square root

04:08:51.900 | Well, because this r sqrt is one over the square root of

04:08:56.380 | Whatever is inside

04:08:58.780 | But if the computation of this statistic produces a number that is very close to zero in this division

04:09:05.500 | We are basically dividing by zero which will make the output of this division this number here very big. So instead of

04:09:12.780 | To avoid this division by zero we add to the denominator of this division. So this denominator we add a very small number called eps

04:09:22.080 | As you can see, it's a very small number to avoid this division by zero

04:09:25.200 | And it's the same parameter that we also pass in the layer normalization as you can see here

04:09:29.600 | We pass this parameter, which is a very small number to avoid this division by zero

04:09:33.280 | So the forward method is basically just doing this normalization and then we multiply each of this number by this gamma parameter

04:09:41.120 | Which is a learnable parameter as you can see

04:09:43.920 | Here, so we have here we have this gamma parameter

04:09:49.840 | And then we return it

04:09:51.840 | That's it. This is normalization

04:09:53.840 | Now we can move to the next part, which is the coding of this decoder layers

04:09:59.440 | All right

04:10:01.920 | Let me check gamma model so we can create the decoder layer. So let's copy some code

04:10:13.440 | All right, so the decoder layer as we saw before it's this stuff here

04:10:17.680 | So we need to create something that manages all these blocks here

04:10:22.400 | So something that takes an input a list of embeddings apply a normalization then apply a transformer

04:10:27.940 | Attention, sorry

04:10:29.600 | Then it applies a skip connection

04:10:31.200 | Then the output is sent to another normalization then to a feedforward layer block then again another skip connection then produces some output

04:10:38.240 | So we will just create this simple block which is the same structure as the decoder layer that we have the encoder layer that

04:10:44.080 | We have created in cglib. So it's the equivalent of

04:10:46.560 | This block here the encoder layer. It will be doing the same job

04:10:50.880 | So, let's do it

04:10:54.640 | So what we are doing is we are saving some stuff

04:10:57.520 | So the hidden size of the model then we are creating the attention

04:11:00.800 | Block, which we will code later the multi-layer perceptron, which is the feedforward network block

04:11:06.240 | The first normalization and the second normalization because in the decoder block we have two normalizations

04:11:10.900 | So as you can see here, we have one normalization here and one here

04:11:14.640 | So the forward method is the same very similar to the one we have coded for cglib

04:11:20.800 | so we take some hidden states, which is the

04:11:23.440 | Input to this layer the attention mask, which will be sent to the attention mechanism the position

04:11:28.800 | Ids which also will be sent to the attention mechanism because we are using the rotary positional encodings

04:11:33.920 | And the kb cache which also will be sent to the attention mechanism

04:11:36.660 | So let's actually let me just copy it and then I explain it because it's the same as the encoder

04:11:42.960 | So we take the input we apply the first normalization to this input which is

04:11:47.680 | This stuff here this normalization

04:11:50.720 | Then we send the output of the normalization

04:11:53.460 | This hidden state we send it to the self-attention block along with the attention mask the positional encodings and the kb cache

04:12:00.320 | And this will produce an output which will be then summed up with the skip connection here, which is this stuff here

04:12:06.080 | So we take the output which is hidden states plus this residual which we saved before to create the skip connection

04:12:11.840 | then we create another skip connection and we send the output of the

04:12:16.960 | of the

04:12:19.660 | Self-attention to the second normalization, which is this stuff here this normalization

04:12:25.060 | The output of the normalization is sent to the multi-layer perceptron, which is this one here

04:12:30.880 | And then we take the output of the multi-layer perceptron

04:12:33.600 | Which is the feed forward network plus the skip connection that we saved before which is this residual stuff here

04:12:38.960 | And that's this plus sign here and the output is then returned and this is the decoder layer

04:12:45.280 | Now we need to code the multi-layer perceptron and the self-attention

04:12:49.620 | Block, I believe the the faster stuff to do is the multi-layer perceptron. So let's do that first

04:12:58.160 | So let me go there

04:13:00.160 | It's also very similar to the multi-layer perceptron that we have already coded for the

04:13:05.520 | Sigleap, but it's slightly different

04:13:08.240 | So the multi-layer perceptron here, which is also known as feed forward network is basically as we saw before in the sigleap

04:13:14.560 | It is something that two linear layers that first expands the embedding

04:13:20.000 | Vector applies some non-linearity and then reduces it back to the original size and this is what is done here

04:13:27.520 | But in this case, we also have another linear layer called the gate projection

04:13:32.580 | Which is used by the activation function that this gamma language model is using

04:13:37.600 | We saw that different language models have different activation functions, which is based mostly on heuristics on how they work

04:13:45.520 | So let's implement the forward method, which is very simple here and we will see why we need this gate projection

04:13:53.360 | I made a code to convert this very long. I mean this very long this this this line into

04:14:00.000 | Series of steps so that you can see each single step being done independently

04:14:04.980 | but let me describe it what we are doing here basically is

04:14:08.480 | First we are applying the gate projection to the input to this feed forward network, which is a list of embeddings as we saw before

04:14:17.920 | And the function that we are using is the gelu function, which I believe is the same that we are using also for the sigleap

04:14:23.680 | Let me check

04:14:26.560 | Uh, yeah the same function

04:14:30.000 | But we also have this gate projection here

04:14:33.600 | So basically it's adding some learnable parameters before sending it to this activation function

04:14:39.600 | We multiply the output of this activation function with the up projection

04:14:45.120 | The up projection is basically the one that takes the embedding size from the original embedding to the intermediate size

04:14:51.040 | So it's expanded size

04:14:53.120 | And then the result of this multiplication, which is a vector

04:14:57.920 | Which is a tensor of size batch size sequence length and the intermediate size is then reduced back to the original size by this

04:15:04.960 | Down projection because with the up projection you are expanding and the down projection you are putting it back to the original size

04:15:11.440 | So the down projection will take the intermediate size back into the hidden size and this is the multi-layer perceptron of gamma

04:15:17.200 | It's slightly different than the other one because we have this gate projection

04:15:21.360 | Which is additional parameters basically

04:15:23.360 | And it's the same kind of gate projection that we also have if I remember correctly in lama in which we have this regular function

04:15:29.520 | With its own gate projection. It's just parameters that are learnable before applying the non-linearity

04:15:35.220 | We also said that the non-linearity is chosen based on heuristic on how they work well in particular case

04:15:41.280 | But also on some properties that we want from them with respect to the gradient. So some

04:15:46.160 | Activation functions allow the gradient to flow for negative values. Some others don't allow it, etc, etc

04:15:52.640 | So it's all based on practical application. Someone trained tried using it so that it works better and then we start using it

04:16:00.560 | Okay, now we also have the multi-layer perceptron now comes the biggest part

04:16:05.600 | And but not the hardest because we are already familiar with the attention mechanism

04:16:09.280 | So we need we need to code the attention mechanism which will comprise the self-attention the use of the KV cache

04:16:14.960 | The grouped query attention which is something new and the rotary positional encoding. So it will be a little bit of learning experience. So let's start

04:16:22.400 | All right. So let's start coding the next part, which is gamma attention. So we start by creating the class

04:16:30.320 | Let me copy it

04:16:33.120 | And I will do it slowly because this one has a lot of innovations

04:16:36.820 | So let's start by creating the constructor, which is our usual constructor

04:16:41.540 | It takes in the configuration of gamma. We also take another parameter, which is the id of the layer

04:16:47.760 | so the position of the layer in the

04:16:50.000 | Transformer because as you know the decoder the gamma is a decoder

04:16:54.080 | Only model it's made up of many layers and each of these layers will have its own KV cache

04:17:02.480 | So to know which KV cache to use because there is one cache for each layer. We need to also pass the layer index

04:17:09.600 | To each layer so it knows where to put its key and values

04:17:15.120 | Then we save some parameters

04:17:18.080 | So the attention dropout which we will not use the hidden size is the size of the embedding vector of each token

04:17:24.560 | the number of attention heads for the

04:17:26.560 | queries

04:17:29.040 | The number of the head dimension which is how many

04:17:33.520 | Dimensions each head will work with

04:17:37.840 | In the multi-head attention

04:17:41.680 | Which is a part of the entire embedding of each token

04:17:45.200 | How many heads we have for the number for the keys and values in the multi-head attention?

04:17:52.320 | And this is different from those for the query because we are going to talk about grouped query attention

04:17:57.280 | So we can calculate how many groups we have in this grouped query attention, but later I will explain how it works

04:18:02.000 | The maximum positional embeddings which are how many positions we can encode in the positional encoding using the rotary positional encoding

04:18:10.400 | And what is the base frequency of the rotary positional encodings?

04:18:12.980 | Now we have some other stuff

04:18:16.640 | So first of all, we make sure that the hidden size is divisible by the number of heads because as you know

04:18:22.880 | Each head has to watch a part of the embedding of the entire token

04:18:26.560 | So it must be divisible by the number of heads

04:18:28.560 | Then we create our projections which are the wq wk and wv projections that we saw in the multi-head attention

04:18:36.960 | But in this case, we can see that we have not hidden size as input as output

04:18:44.480 | number of features

04:18:47.200 | But the number of features are calculated as the number of heads multiplied by the head dimension

04:18:52.320 | Now why this is different from the multi-head attention that we have implemented for Sigleap?

04:18:57.440 | So if we go to look at Sigleap and we look at the attention

04:19:01.840 | you can see that each of these wq wk and wv metrics matrices is a

04:19:06.640 | Hidden size by hidden size here. It's called the embedding dimension, but okay, it's the same thing

04:19:11.200 | So it's the size of the entire token with the output features being also the same number of dimensions

04:19:17.620 | Here, however, it's slightly different. Why?

04:19:22.160 | If we look at what is the numHeads numHeads is the number of heads for the query and this is actually the

04:19:26.960 | the full the number of heads for the query in grouped query attention is

04:19:32.320 | Equal to the is bigger than the number of heads for the than for the keys and values later

04:19:39.280 | We will see why but for now, let's concentrate on the dimensions. So in this case this wq matrix

04:19:44.720 | So it's called the qproj which stands for which is the wq

04:19:48.800 | Matrix in the multi-head attention has an output a number of output features. So suppose that the number of heads

04:19:55.440 | So number of heads is equal to 8 and suppose that the hidden size is equal to 1024

04:20:02.740 | So the wq matrix will be a matrix that is

04:20:07.040 | 1024 by 8 multiplied by the head dimension, but the head dimension is what the head dimension is how many

04:20:15.820 | Dimensions it had will watch by using the number of heads of the query as a reference

04:20:21.020 | So 1024 divided by 8 which is 128

04:20:25.280 | I think so. Yeah

04:20:28.300 | So it's 8 multiplied by 120. So actually the wq matrix is 1024 by 1024

04:20:35.440 | What changes in grouped query attention is the wk and wv projection actually wk actually will be

04:20:43.480 | 4 because that's the hidden size as input and the output features will be the number of heads for the key values

04:20:51.400 | Which actually we can check here

04:20:54.040 | In the configuration we can see that the number of heads for the

04:20:58.440 | Queries is 8 and the number of heads for the key and values is only one

04:21:04.600 | So actually this is the case of not of grouped query attention. It's multi query attention. So

04:21:09.240 | Let's say okay. Suppose that we have only one head here. Also one multiplied by 128. So it's equal to

04:21:15.480 | 1024 by 128

04:21:18.820 | And the same size is also for wv because as you can see the expression in wv is the same

04:21:25.480 | it's the number of heads for the key value multiplied by the head dimension and then we have the output projection, which is a

04:21:31.400 | then

04:21:33.640 | Hidden size by hidden size because the number of heads multiplied by the head dimension

04:21:37.480 | So it's actually number of heads is 8 which is always referencing the number of heads of the queries

04:21:43.320 | So this is 1024 by 1024

04:21:45.720 | So as you can see the difference with the grouped query attention is that we have less head for the keys and values

04:21:51.160 | Which results in a smaller

04:21:53.400 | Projection for the embedding of each token

04:21:57.320 | When it's used as keys and value. Let's see why so let me open a new

04:22:04.120 | Page and let's switch to the ipad which is here

04:22:07.800 | Okay, when we do um

04:22:10.520 | Normal multi head attention what we have is that each token is divided into multiple groups of dimensions

04:22:17.400 | One dedicated to each head suppose that we have an initial token

04:22:21.800 | Let me use a pen and let's use a smaller size. So imagine that we have a token with

04:22:28.360 | 1024

04:22:32.260 | Dimensions in total if we divide that in eight heads

04:22:36.340 | We will have that each of the head will manage 128 dimensions of this token so one to

04:22:45.060 | 128 then the second head will manage

04:22:48.660 | 129 to 256

04:22:52.480 | Etc, etc until the last one which will be I don't know how to do the calculation. Let me check

04:22:58.020 | 896 I guess

04:23:01.900 | 896 up to 1024, right?

04:23:06.820 | 128 yeah should be correct. So this is the head number eight

04:23:13.940 | This is the head

04:23:18.420 | Two and this is the head one

04:23:20.900 | When we do the product query multiplied by the transpose of the keys each of the query is

04:23:29.360 | Multiplied so dot product with each of the keys, but only in the part

04:23:34.800 | Dedicated to each head because each head is working independently

04:23:38.500 | So suppose that this is our query. So this is our query. Let me write it with a different color. So

04:23:45.060 | this is our

04:23:47.540 | Query and then we have some key

04:23:49.860 | And this key also in the normal multi head attention. We have the same number of heads for the query and the keys

04:23:57.220 | So suppose that we have the same number of heads also here so we can copy this stuff, I guess

04:24:02.760 | Too hard to copy

04:24:08.160 | Okay copy paste

04:24:12.640 | So what will happen with the multi head the normal multi head attention is that each head will do the dot product of the first

04:24:22.200 | Head of the head number one. For example, we'll do the dot product of the first

04:24:27.500 | 128 dimensions of the query with each of the keys because you need to think that we don't have one key. We have multiple keys

04:24:35.160 | Because it's a matrix. The matrix is a sequence by sequence. So each head each query is attending to all the past keys

04:24:43.280 | So here we can write

04:24:47.800 | Key number one key number two

04:24:49.680 | So key number one key number two and key number three and this is the query number one and we do it for all the

04:24:54.640 | Queries so for each token each token will attend all the past tokens as keys

04:24:59.600 | At least in the language modeling

04:25:02.400 | So what will happen is that we are doing a dot product

04:25:06.320 | With the first head will do a dot product of the first

04:25:09.560 | 128 dimensions between the query and the key then again between this query and this key and then between

04:25:16.520 | This query and this key in parallel the head number two will do the same stuff

04:25:22.200 | so the head number two will take the next group of

04:25:25.560 | 128 dimensions or the dimensions from 129 to 256 and will do the dot product with the

04:25:32.800 | next group of

04:25:35.080 | 128 dimensions for each of the keys

04:25:37.380 | So it will do the dot product of this query with this key and then this query with this key

04:25:44.560 | And then this query with this key all in

04:25:48.160 | in parallel

04:25:51.080 | Each head is working in parallel

04:25:53.080 | Now what happens is that and we do it for all the heads

04:25:58.560 | The problem with the multi head attention is that the and this was described in the multi query paper

04:26:06.720 | So if you want I can give you the reference to the paper. It's called

04:26:10.400 | multi query paper

04:26:14.200 | Multi-query

04:26:15.640 | attention paper

04:26:17.640 | And it's this one here in this paper

04:26:20.840 | Basically, Noam Shazir described what is the problem with multi head attention at least from a computation point of view

04:26:27.840 | He claims that with multi head attention

04:26:31.320 | The problem is not in the number of computations that we are doing which is the bottleneck of the computation

04:26:37.760 | but rather the number of

04:26:40.480 | Data transfer that is happening in the GPU because of this multi head attention and for that we need to talk about

04:26:46.800 | How the GPUs work so in a GPU what we have

04:26:52.600 | Is this a GPU has a very big memory called the high bandwidth memory

04:26:59.400 | Which is in the order of gigabyte or tens of gigabyte. I think the

04:27:04.880 | 100 goes up to 80 gigabyte. Then we have some smaller memory called local memory. So local

04:27:12.080 | memory

04:27:14.360 | And this one is in the order of the megabyte. I don't know if it's 10 of megabyte

04:27:19.040 | I think in the tens of megabytes, so it's a one a magnitude of order smaller

04:27:24.920 | three magnitudes of order smaller and

04:27:27.640 | and

04:27:30.040 | Then we have the cores

04:27:32.080 | The cores are many and they all work in parallel all of these cores

04:27:36.600 | So when you do a matrix multiplication, what happens is this

04:27:40.120 | You have the matrix that you are trying to multiply in the high bandwidth memory

04:27:44.820 | The the kernel that manages this matrix multiplication

04:27:50.120 | Which is a CUDA kernel in case you are using an Nvidia

04:27:53.240 | GPU will copy for example the first part of the matrix from the high bandwidth memory to the local memory and

04:28:00.920 | Each core will work with a part of this big matrix to compute this matrix multiplication in parallel

04:28:09.040 | So each one is will be working with a smaller part of this matrix to calculate this this part in parallel

04:28:14.680 | it's much easier to visualize with the summation because for example if you are summing two matrices like this matrix and this matrix and

04:28:21.880 | You get this matrix as output. What happens if you divide it into four parts is that

04:28:28.920 | The result of this part of the matrix only depends on these numbers and these numbers

04:28:33.620 | So the first head can work with these two parts the second core

04:28:37.960 | Sorry, not head the second core can work with these two parts

04:28:41.480 | sum them up to produce this one the third core can work with these two parts and

04:28:47.840 | Resulting in this part and then the last core can work on this part which will result in this part of the matrix

04:28:54.800 | So as you can see the metric summation can be done in parallel by multiple cores each working with a part of the matrix

04:29:01.120 | What happens when we do multi head attention is that?

04:29:06.280 | the

04:29:08.960 | The dimension suppose that because the heads are working in parallel

04:29:12.940 | the first head will copy the first

04:29:16.560 | 128 dimensions of the query to the local memory of the GPU which will then be

04:29:23.720 | Accessed by the cores to compute these dot products

04:29:26.880 | Meanwhile the second head at the same time needs to copy the second

04:29:33.100 | 128 dimension of the each token to the local memory and

04:29:38.680 | Then needs to also copy for each query the second

04:29:42.640 | 128 dimensions from the high bandwidth memory to the local memory so that the cores can work with it

04:29:49.880 | Now what happens in the multi query attention paper. So this paper here what they say is that

04:29:55.680 | The bottleneck of the computation of the attention is not in how many dot products we are doing

04:30:02.960 | But how much it how much time it takes to copy the memory from the high bandwidth

04:30:08.200 | Bandwidth memory to the local memory so that the cores can work with it

04:30:12.160 | Why because in the GPU we have a lot of cores that are very fast at computing computation

04:30:18.240 | But the GPU is not so fast at copying stuff around so the memory copying is very slow compared to how much

04:30:25.040 | Computations it can perform. For example, let's open the

04:30:28.760 | A100 GPU data sheet

04:30:31.680 | It's here you can see that the A100 has okay 80 gigabyte of memory in the high bandwidth memory

04:30:40.840 | And it can do this kind of

04:30:46.160 | Teraflops operations per second if you are working with the 32-bit

04:30:50.080 | But as you can see the GPU memory bandwidth is much slower than the number of operations it can do

04:30:57.060 | Because the teraflow floating-point operations per second means

04:31:01.320 | billions

04:31:03.640 | thousands of millions of

04:31:05.640 | Billions of operations per second so it means thousands of giga operations per second while here we have only

04:31:12.840 | 2,000 gigabyte per second of memory

04:31:15.880 | transfer speed

04:31:18.640 | So basically in in a lot of computations that we do in the GPU

04:31:22.320 | The bottleneck is not how much compute we are using but how much data transfer is happening for this compute and as a matter of fact

04:31:29.560 | Flash attention basically exploits this difference in computation and memory transfer

04:31:36.760 | To reduce the memory transfer and redo computations because you it's faster than to redo computations twice instead of copying

04:31:44.560 | a different stuff from the GPU

04:31:46.560 | To

04:31:49.480 | For the computation. So basically what we do is we are willing to sacrifice computation

04:31:55.160 | To reduce the data transfer. This is what we do with flash attention

04:31:59.400 | This is also one of the reason we use the gradient checkpointing

04:32:02.800 | So gradient checkpointing basically means that during the backward pass we redo some

04:32:06.720 | computations instead of saving them because if we save them then we need to recopy them from the high bandwidth memory to the local

04:32:12.380 | Memory, so it's faster to redo them instead of copying them the already processed one

04:32:17.520 | To speed up the computation

04:32:21.180 | So the one clock time which means the total time to compute the attention is determined

04:32:26.080 | Actually is bottlenecked not by the number of dot products that we are doing but how much data transfer happens

04:32:31.800 | So how to reduce the data transfer that we do when we do the multi head attention

04:32:36.560 | One way is to use less heads for the keys

04:32:41.280 | so what will happen is that the first head imagine we only use one head for the

04:32:47.800 | keys instead of

04:32:50.240 | Having multi head also for the keys and values. So we don't have this part anymore

04:32:54.400 | we only have a multi head for the we have many heads for the

04:33:00.480 | Let's see

04:33:02.480 | We only have one we only have multi head for the queries

04:33:06.840 | So we don't have multi head for the keys or we have less heads for the key

04:33:11.000 | Imagine that we are in the extreme case in which we only have one head for the key and value

04:33:16.080 | But we have multi head for the query. What will happen is that the first core will copy the first

04:33:21.080 | 128 dimensions for the queries from the high bandwidth memory to the local memory and also the

04:33:28.440 | 128 dimensions for each token for the keys

04:33:31.720 | It will perform the computation now. Meanwhile, the also the second head needs to do its computation. So in parallel

04:33:39.200 | So, how can it do it needs to copy the 128 dimensions for the query?

04:33:43.640 | but it does not need to copy then the

04:33:47.440 | next group of 128 heads

04:33:50.740 | Dimensions from for each of the keys because it can be it can reuse the one for the keys

04:33:57.440 | so they each group of

04:33:59.440 | Heads of the queries is sharing some heads for the keys so that they don't need to copy

04:34:06.080 | Again for different heads these dimensions, but they can share the already copied ones

04:34:12.480 | So this is the extreme case of having only one head for the keys, but we can have a group of heads

04:34:19.440 | So we can do for example that

04:34:22.560 | Instead of we have eight heads for the query and then we have four heads for the keys

04:34:27.740 | so the head number one and two for example for the query will share this head here and

04:34:33.720 | Then the head number three and four will share this head here

04:34:38.720 | So the head number one and two for the query will share this head here so that the total amount of transfer for the keys

04:34:44.720 | is only this part here and

04:34:47.080 | Then the head number let's add here add number three and the head number four will share a different

04:34:53.720 | Head of the keys, but it's shared as you can see every two head. We are sharing one head of the keys

04:35:00.880 | So these two head will not need to copy

04:35:03.520 | 128 dimensions each but

04:35:07.080 | 128 dimensions in total for both of these heads

04:35:10.440 | This reduces data transfer which speeds up the computation of the attention

04:35:15.120 | And this is the reason we have here in the computation of the attention the projection for the WK and WV

04:35:23.000 | Has less parameters because we are trying to compress these

04:35:27.120 | tokens into smaller

04:35:30.240 | tokens

04:35:32.720 | Equal to the number of heads that we need for this projection

04:35:37.360 | So for the keys, for example, if we have only two heads for the keys

04:35:41.920 | we will compress these tokens into

04:35:44.420 | 256 dimensions so that

04:35:47.520 | every

04:35:49.960 | Four heads of the query will have one head for the key

04:35:54.300 | Imagine we have four heads for the keys and values then we will have this one will be four

04:35:59.000 | So what will happen is that every two heads of the query will be using one and this one will become 512

04:36:06.640 | Every two head of the query will share one head of the keys. So the total data transfer is reduced

04:36:13.240 | So we speed up the computation of the attention

04:36:15.920 | Of course, you may be wondering but this should also reduce the quality of the model because we have less parameters

04:36:22.120 | We have less expressive power for the keys and values and it's true

04:36:26.040 | So if you look at the paper, they say that in the multi query attention

04:36:30.300 | It reduces the quality of the model, but not much so it's something that we can afford to lose

04:36:36.360 | and the group query attention is basically a

04:36:39.080 | Let's check group query attention paper, which is this one

04:36:44.560 | So in the multi query attention, you have one head for the keys and values

04:36:50.120 | Which is shared for all the heads of the queries in the group query attention

04:36:54.480 | We have a group of heads for the queries sharing one head of the key

04:37:01.280 | So when you have multi query attention, you have only one head here for the query and the keys and values

04:37:07.440 | When you have a group query attention, you have multiple heads

04:37:11.160 | Of the keys sharing one head of the queries sharing one head of the keys and values

04:37:16.760 | So basically the multi query attention

04:37:19.260 | Multi query attention, which is only using one head for the keys and values reduces a lot of the quality

04:37:24.720 | a good compromise is between the full multi head attention and multi query attention is the group query attention which reduces

04:37:32.700 | Slightly less the quality of the model, but still gives you this computational advantage of reducing the quantity of data transfer

04:37:39.960 | another very big advantage of

04:37:42.440 | Group query attention is that you reduce the size of the KB cache because as you remember

04:37:47.480 | We have one KB cache for each layer and in each

04:37:50.720 | KB cache we need to save each token

04:37:54.320 | so if we compress these tokens the total amount of memory required for the KB cache reduces them and

04:38:01.120 | Actually, the KB cache is also one of the bottlenecks in today's language model

04:38:06.280 | So we have these big language models that are like 70 billion parameters or whatever

04:38:12.940 | But the the problem using them is not even actually the GPU memory requirement just for storing the model

04:38:20.700 | But actually for storing this big KB cache because you have to store each single token in each of the layers of the model

04:38:26.920 | Which actually grows very fast if you have a lot of tokens

04:38:29.960 | Okay. Now that we have seen how the group query attention works, we can proceed further

04:38:35.520 | Let's continue our journey

04:38:40.400 | So the next part that we need is this beautiful thing called the rotary positional

04:38:45.560 | Encodings that I will not explain right now. We I will explain them after

04:38:49.260 | Explaining completing the attention module

04:38:52.380 | for now, we just consider them as a black box that adds some information encodes the information of

04:38:58.300 | Position in the tokens and later we will see how it works

04:39:01.700 | Let's implement the forward method. So the forward method is this one

04:39:07.140 | so basically it takes the hidden states, which is the input to the

04:39:12.140 | After the in the decoder layer is the output of the first

04:39:16.300 | RMS normalization

04:39:19.180 | Then we have the attention mask the position in the position that we need to apply to each token because we need to apply the positional

04:39:25.300 | Encodings and then the KB cache in case we are using it and now we will implement it

04:39:30.020 | So the computation of the attention is the same as before

04:39:33.220 | Let me copy a big part. So like this

04:39:38.180 | The first thing we do is we extract the batch size and what how many

04:39:41.880 | What is the length of the queries?

04:39:45.120 | So what is the length of the input sequence because as you remember when we do token generation

04:39:50.220 | During the prefilling the QLAN will be all the inputs prompt

04:39:54.900 | But then during token generation the Q will only be one single token because we want to

04:40:00.060 | Generate all the last part of the attention matrix. So the last row so we need only one query

04:40:05.580 | But how can we have all the keys to attend to because we have something called the KB cache which will store all the keys

04:40:11.580 | So what we are computing here is the same as before

04:40:15.860 | So we are converting the input sequence into query key and values and then we are splitting this

04:40:22.300 | Embeddings into groups of dimensions based on how many heads we have for the query key and values

04:40:31.020 | For the query, we will split it into numHeads number of groups

04:40:35.100 | Each number or each group will have headDim number of dimensions and for the keys and values

04:40:41.420 | We will have numKeyValueHeads number of groups and each group will have headDim number of dimensions to manage

04:40:48.620 | Then we do this transposition so I can show you again. What does this transposition do? So let's do it

04:40:55.500 | Let's go back to our

04:40:58.780 | here

04:41:00.980 | So the first part that we are doing here big up to the transposition is this one

04:41:06.740 | So we are multiplying the input sequence with WQWK and WV and splitting these

04:41:13.220 | embeddings into heads

04:41:15.780 | So that each embedding is a group is a list of groups where each group is managing some dimensions

04:41:23.780 | So now what we end up is basically a sequence of what?

04:41:28.100 | Tokens where each token is made up of groups and each group is managing for example

04:41:33.020 | 128 dimensions

04:41:35.420 | Then we use this transposition because we want to have at the first dimension the heads dimension

04:41:42.660 | So that we have a structure like this

04:41:45.260 | So instead of having a sequence of tokens where each token has groups of dimensions

04:41:51.300 | We want a list of groups where each group is a head

04:41:55.420 | Each head has some tokens how many equal to the sequence length and each token is a mini token

04:42:03.840 | Which is the dimensions dedicated to that specific head. So the head number one will have

04:42:09.460 | 128 dimensions the head number two will have the next number group of

04:42:14.420 | 120 dimensions etc until the last one which will have the last group of 128 dimensions

04:42:20.580 | This allow us to compute the multi-head attention this for this using this

04:42:26.180 | Sequence this sequence this sequence and this sequence all in parallel

04:42:31.620 | Okay, and this is the meaning of this transposition

04:42:37.520 | Transpose the next thing that we do is we apply the rotary positional encodings and now

04:42:44.020 | We didn't talk about the rotary positional encodings and we will talk about later

04:42:48.540 | But for now, you need to think that we are not changing the shape of these keys and queries and values

04:42:55.220 | We are just

04:42:57.100 | modifying them by adding some information that

04:43:00.540 | Encodes their position and it will be done by this method called apply rotary positional embedding

04:43:06.580 | We will see later how it works for now

04:43:10.060 | just think that in the query and the keys we have encoded some information which will be leveraged by the attention mechanism to

04:43:18.020 | Relate tokens to each other differently based on their position basically, but we will see that later. So

04:43:24.100 | Suppose that we have already encoded the positional

04:43:27.200 | Information. So now we need to as you remember when we do work with the KV cache

04:43:32.460 | we pass only one single token as input to the layers of the

04:43:38.620 | Transformer and this single token is added to the KV cache in the keys and the values cache of this

04:43:47.020 | Particular layer then we retrieve the content of this KV cache which includes the newly added the token and all the previously saved

04:43:54.660 | token and then we use this

04:43:56.940 | Output of this KV cache to calculate the attention. So let's implement this KV cache

04:44:02.980 | so it's very simple because it's only one method to implement which basically will just take the

04:44:08.500 | Single token that we are sending in which is this key states will add it to the key cache

04:44:13.940 | will take this value states which is one single token add it to the value cache and then retrieve all the content of the cache as

04:44:20.860 | Output so all the past token it has seen plus the current one

04:44:25.060 | So let's implement it and we go to the beginning of the file

04:44:30.100 | here

04:44:33.500 | Class KV cache. Let's do it like this

04:44:38.740 | So we create a constructor as you can see it is a kind of a buffer where that includes one buffer for each layer of

04:44:44.940 | the model one for the keys and one for the values

04:44:48.980 | We also have this helper method that allow that tells us how many items the KV cache currently stores

04:44:56.780 | So if this KV cache does not contain any item we say zero if it contains something then we return

04:45:03.060 | What is the number of items it stores which as you remember when we add the something to the KV cache we are adding

04:45:10.100 | This tensor here, which is the key value states and value states which are tensors of this shape

04:45:17.700 | So batch size and number of heads sequence length and head dimension

04:45:21.540 | Which means that the sequence length is the second last dimension. So that's why

04:45:27.700 | We return the second last dimensions to retrieve the sequence lengths currently stored in the KV cache

04:45:33.060 | We then implement the update method which is also very simple and I added some

04:45:39.540 | comments to it to make it simple

04:45:41.900 | So basically it means that it this will add the content of this key states and value states to the KV cache of this layer

04:45:49.620 | And then it will return whatever is stored for this layer

04:45:53.820 | So if we have never added anything to the KV cache of this layer, then we create it. So we basically append this tensors

04:46:00.900 | It means that we have nothing else to concatenate it with

04:46:04.660 | However, if we otherwise we are we already have some tokens in the key cache and the value cache of this particular layer

04:46:11.540 | Then we concatenate whatever is already present with the newly incoming token along which dimension along the sequence dimension and the sequence dimension

04:46:19.620 | We saw before is the dimension -2. That's why we concatenate them along the dimension -2

04:46:24.960 | so after concatenating them we retrieve all the content of the

04:46:29.340 | K and V cache and return it for the current layer and this is what is happening here

04:46:35.340 | Here so we add this incoming key values and key states and value states to the KV cache

04:46:43.420 | Then we retrieve them and we use them to compute the attention

04:46:46.900 | Now you need to remember that when we do use the KV cache

04:46:50.700 | There are two phases when working with the model with the KV cache

04:46:54.700 | There is one part called the prefilling in which we have the prompt the prompt in our case will be the image tokens plus

04:47:00.640 | The user prompt so the what the user wants the model to do with this image

04:47:05.920 | It will be a list of tokens. So this key states and this value states will be a list of tokens

04:47:12.220 | So they will be all added to the cache for the first time because initially the cache will be empty and will be retrieved here

04:47:18.380 | When we do token generation, we use the last token output by the model and

04:47:23.020 | We add it one at a time to the KV cache

04:47:26.660 | But we always retrieve all the content of the KV cache to compute the attention because the each query needs to attend all the past

04:47:33.620 | keys and values

04:47:35.620 | It needs to attend all the past keys which are then used to compute the weighted sum using the values

04:47:42.760 | Um, okay, what is the next part of the computation of the attention? Well, well, well here

04:47:50.300 | we have this

04:47:53.180 | repeat

04:47:55.400 | Now we need this method called the repeat KV which basically will repeat the

04:48:00.560 | dimension of the

04:48:03.560 | Of the keys and values that are missing for the heads of the query

04:48:11.880 | Um, okay, let me explain it with the iPad because it's much easier to draw than to explain by words. So let's go here

04:48:19.880 | Let's go here

04:48:23.980 | Okay. So what happens with this repeat method is that we have the projection

04:48:30.160 | Through WK and WV of the token that results in a smaller token

04:48:36.680 | Which gives us some benefit from the KV cache point of view for example

04:48:40.500 | But to compute the attention each head needs to share the heads

04:48:45.360 | Each query heads needs to share the head with other query heads when working with the keys

04:48:52.440 | so for example

04:48:53.760 | The first two heads of the query needs to share one head for the keys

04:48:57.920 | Then the second two heads for the query needs to share one head for the keys

04:49:01.920 | what we do is basically we

04:49:05.360 | Repeat this because we are working with the naive implementation of the attention which does not really

04:49:12.040 | Actually benefit from this optimization. So what we do is basically we just repeat the missing heads as

04:49:18.940 | You can see here. So we we take the heads that are missing and we just repeat them to match the heads

04:49:26.360 | to match the heads of the query so

04:49:31.580 | Like this one so that it's like each head each query head which has its own head also for the keys

04:49:37.480 | This is because actually we are not creating a custom CUDA kernel for the computation of the attention

04:49:43.240 | So we repeat it and we just pretend like the grouped query attention never happened

04:49:49.840 | but for example

04:49:50.760 | If you use a flash attention flash attention actually leverages the reduced number of heads of the keys and values to optimize the computation

04:49:58.680 | of the attention

04:50:00.560 | So basically we are kind of reversing the effect of grouped query attention when calculating the attention because we don't have this

04:50:07.440 | Custom CUDA kernel that can leverage this by not copying the missing heads

04:50:11.720 | The repeatKV function is very simple

04:50:16.680 | So we can implement that as well because it will just repeat the heads that are missing for the keys and values

04:50:23.000 | So let's implement it here

04:50:27.360 | As you can see if we have a tensor and we know that this tensor has the following shape

04:50:32.920 | So the batch the number of heads the sequence length and the head dimension

04:50:36.840 | If we only need to repeat it once then we just return it because we don't have to repeat anything

04:50:41.720 | otherwise, we introduce a new dimension, which is how many times we want to repeat this number of heads and then we

04:50:49.180 | We do this reshaping which will basically repeat this number of heads that much number of time

04:50:57.040 | Actually, the repetition is done by the expand method here. So we introduce a new dimension here

04:51:02.640 | Which is the number of repetitions and then we expand it. This expansion basically repeats whatever content is

04:51:09.440 | This content here for each of the heads in the nrep heads

04:51:15.540 | So basically we are repeating whatever comes after these two dimensions this number of times

04:51:22.680 | and then we remove this helper dimension that we have created the nrep dimension that we only created to repeat the number of heads and

04:51:30.680 | How do we do it? We must multiply the number of repetitions that we need with the number of key value heads

04:51:37.320 | So at the output of this method the number of heads that you will have is the same as the number of heads of the query

04:51:43.600 | So let's go back here

04:51:45.920 | So now it will this key states and value states will have the same number of heads as the query

04:51:51.400 | So now we can just compute the attention like we have always been doing so by doing the query

04:51:55.640 | Multiplied by the transpose of the keys divided by the square root of the model, etc, etc

04:51:59.760 | So let's do it

04:52:02.440 | We also add the attention mask

04:52:05.880 | so we compute the attention weights just like this standard formula query multiplied by the transpose of the keys divided by the square root of

04:52:12.520 | The D model the model is the number of dimensions

04:52:15.200 | managed by each head

04:52:18.120 | We then add the attention mask right before

04:52:21.800 | Using the softmax. So the attention mask

04:52:25.360 | That's why we in our case will always be made of zeros because we don't have any padding

04:52:30.440 | so we don't need to mask anything and also during the prefilling we don't mask anything because

04:52:34.360 | We always let the prompt the user prompt. So the text prompt to also attend feature tokens. Why? Because the polygamma

04:52:42.500 | Autors made this decision and

04:52:46.800 | They decided that the prompt the user prompt or the task prompt does not need to be causal because anyway

04:52:53.480 | It will never be generated by the model. It will always be

04:52:55.840 | set by the user

04:52:58.680 | So we apply the softmax and then the dropout but the dropout we never have so this stuff here is very simple

04:53:06.560 | So we apply the softmax

04:53:08.480 | Row by row then we apply the dropout but the dropout is always zero and we as you know

04:53:13.380 | The dropout is only applied during training but just ignore it like it's not there

04:53:17.960 | Then the output of the multi head attention is multiplied by the value states

04:53:24.160 | So this attention weights is multiplied by the value state value matrix, which will result in that

04:53:31.700 | weighted sum we saw before so each token is

04:53:35.920 | an aggregation of previous tokens based on the

04:53:40.440 | Score defined in the attention matrix. So if you want to visualize it again, I can show it to you again. So let's go here

04:53:47.640 | When we do the multiplication with the V which is here

04:53:52.860 | Basically this output token

04:53:56.240 | Let's say this one here is a contextualized token and that will include information about three tokens. I love pepperoni and

04:54:03.640 | It will be a weighted sum of these three tokens

04:54:08.240 | So I love pepperoni based on the following weights

04:54:11.440 | So basically the token I will contribute to 20% of information the token love will contribute to 40% of information

04:54:18.640 | The token pepperoni will contribute 40% of information and the last token will not contribute any information because it has been masked out

04:54:25.880 | So this is what happens when you multiply the V that you are doing a weighted sum using the attention weights as weights

04:54:35.840 | Then what else we need to do we need to check okay the output shape and that's fine I can do that so

04:54:43.060 | we do this one and

04:54:45.680 | Then we transpose back

04:54:48.160 | Like we did before

04:54:51.360 | So we transpose back to have again the sequence length as the second dimension then the num heads as the third dimension

04:54:58.200 | then we

04:55:01.000 | Concatenate all the heads together just like we saw before so now each token is back to the head hidden size

04:55:08.400 | Dimension where this hidden size is the concatenation of the output of each head

04:55:13.740 | but we if you just concatenate the output of these heads then the each embedding will just be an

04:55:21.800 | Independent calculation of each head concatenated together

04:55:25.640 | So we need some kind of mixing mechanism and this mixing mechanism is given by WO which will mix all these

04:55:32.880 | Dimensions with each other so that the result of each head is kind of mixed with each other through this WO projection

04:55:40.240 | So that this output

04:55:42.400 | Token from this multi head attention is not just a concatenation of multiple independent heads

04:55:49.520 | But it's something that is also mixing the results of this independent heads

04:55:54.600 | And then we result will return the result of this multi head attention

04:55:59.240 | Now one thing that we have considered as a black box so far is the rotary positional encoding

04:56:05.280 | So we have said okay

04:56:06.800 | we are encoding somehow the positional encodings in these queries and keys and then the

04:56:13.040 | Multi head attention will leverage it now. It's time to expand on that and understand how it works. So let's do it

04:56:20.320 | All right. So let's talk about positional encoding guys

04:56:23.800 | so traditionally we are used to work with the

04:56:27.720 | Positional encodings applied directly at the entrance of the transformer, which means that we take some embeddings

04:56:34.400 | So we transform we have our tokens which indicates the position of the token in the vocabulary

04:56:40.180 | We convert them into embeddings using the embedding layer, which is this stuff here

04:56:45.040 | And then we add some other

04:56:50.020 | Vectors to these embeddings that encode the position information of each token because otherwise the model has no

04:56:56.200 | notion of position the model treats each token as you as you saw before each head just does a dot product of two tokens and

04:57:04.320 | If the position information is not encoded in these two tokens that the dot product can only access the embeddings

04:57:10.840 | So it does not have any notion of which token comes first and which comes later

04:57:16.180 | So to encode this information, we basically traditionally we are used to add a positional encoding here to the embeddings of each

04:57:24.080 | Token and so that the embeddings basically encode the information of the position in the original transformer paper. They proposed this

04:57:31.540 | sinusoidal positional encodings which are also known as absolute positional encodings because they encode the absolute position in the

04:57:39.240 | Inside each token. So the token number one will have some dimensions some vector that will encode the position number one

04:57:45.980 | The token number five in the sentence will have the position number five added to it, etc, etc

04:57:51.060 | What we use in most language models nowadays is the rotary positional encodings

04:57:57.580 | Which are in the family of the relative positional encodings and they work as follows. So let's open the paper

04:58:03.420 | They were introduced in this paper called the raw former enhanced transformer with rotary positional embedding

04:58:13.740 | Basically the idea with the this

04:58:15.820 | Positional encodings is that we do not add them directly to the embedding of each token

04:58:22.420 | so that each token encodes the information of its position, but they

04:58:26.060 | modify the attention mechanism in such a way that the attention mechanism takes into

04:58:32.100 | Consideration the position of the tokens to relate them differently based on their position. Let's see how they did

04:58:38.580 | So basically in the paper they say okay

04:58:42.580 | We have this multi multi head attention mechanism that uses the dot product as to relate tokens to each other

04:58:49.720 | so they said okay, can we find an

04:58:52.780 | encoding of the embedding vectors of tokens such that

04:58:58.080 | When we do the dot product, which is an inner product. So this sign here means the inner product

04:59:03.980 | So can we find an encoding for the token called FQ for the query and FK for the keys?

04:59:11.380 | that encodes the position information inside the embedding XM for the query and

04:59:17.940 | XN for the keys such that when we do the dot product

04:59:22.900 | So this function G

04:59:24.580 | this dot product, the output of this dot product

04:59:27.140 | Only depends on the embedding of the first token the embedding of the second token and the relative distance between them

04:59:35.120 | So that's why they are called relative positional encodings because they depend the dot product is modified

04:59:40.660 | so the attention mechanism is modified such that the dot product should depend only on the

04:59:46.660 | Embedding of the first token on the embedding of the second token and the relative distance between them

04:59:52.940 | So we need to find a way to encode

04:59:56.120 | information inside of our embedding such that this dot product will depend only on the embedding of the first

05:00:03.740 | embedding of the second and the relative distance

05:00:06.900 | so how to encode this information inside the

05:00:10.060 | Embeddings. Well, they

05:00:12.740 | Proposed the following case for the 2D case. So imagine we have an embedding vector made up of only two dimensions

05:00:20.740 | How to encode the information of the position in this two-dimensional vector as follows

05:00:29.420 | basically, we create a matrix that is a

05:00:34.260 | Rotation matrix. So if you have ever worked with the rotation matrix like when you do rotation of a vector in 2D space

05:00:41.720 | you basically multiply the vector by this matrix here where the

05:00:45.640 | Argument of the cosine and the sine is a multiple of an angle that defines by how much you want to rotate this vector

05:00:52.700 | by

05:00:55.380 | So if we basically

05:00:58.380 | Multiply the two dimensions of this vector by this matrix here

05:01:03.180 | Which is we will see what is it and then this matrix here, which is a rotation matrix

05:01:08.700 | Then basically we are rotating this vector by some angle defined by this

05:01:14.940 | M theta angle

05:01:18.100 | This will encode the information so the output of this operation

05:01:23.660 | So the output of this operation will be a 2D vector which will encode the information of the position

05:01:30.260 | based on this position M

05:01:32.460 | Such that when we do the dot product of two vectors encoded like this, this dot product is guaranteed to be

05:01:40.740 | To be a function of the embedding of the first vector, embedding of the second vector and the relative distance that was

05:01:50.620 | encoded into them

05:01:52.980 | The difference of the distance that was encoded into them

05:01:56.160 | Basically, but we usually when we have an embedding we do not have a 2D vector

05:02:04.540 | We have a multi-dimensional vector, maybe 1000 dimensions or 2000 dimensions

05:02:09.940 | So they take the 2D case to the general case and the general case basically they say okay instead of

05:02:17.380 | multiplying the

05:02:19.500 | token by

05:02:21.820 | So instead of using this 2D rotation matrix, we need to have this big rotation matrix here for an

05:02:27.900 | D-dimensional vector. So here is the d-dimensional vector

05:02:31.980 | If you look at this vector this matrix here as you can see it is a sparse matrix

05:02:38.820 | Which means that it is mostly made up of zeros and only some elements are non zeros

05:02:44.580 | So if we encode the information using this transformation here by using this matrix here

05:02:50.860 | We will be doing a computation that will result in the following property being verified

05:02:56.380 | which is that the when we do the dot product this dot product will only depend on the

05:03:01.140 | Embedding of the first token the embedding of the second token and the relative distance of the two positions that were that was encoded into

05:03:08.580 | these tokens

05:03:09.980 | But we will be doing a lot of unnecessary computations because a lot of zeros will be

05:03:14.780 | Will be multiplied by other elements which will result in zero. So we are doing a lot of

05:03:20.180 | computation

05:03:21.940 | Uselessly because in a sparse matrix

05:03:23.940 | If most of the elements are non zeros and only some of them are non zeros

05:03:29.780 | That means that you are doing a lot of computations uselessly

05:03:32.620 | Because you already know that in advance that they are going there. They are zeros

05:03:37.220 | So is there a better way to compute this encoding mechanism to reduce this unnecessary?

05:03:44.860 | Computations knowing already that most of them are zeros and we also know where they should be zeros

05:03:50.660 | Well, yes, there is it is possible and they propose another

05:03:54.500 | more computationally efficient

05:03:57.180 | realization of this matrix

05:03:59.540 | Which basically says that if you want to encode the position information inside your tensor inside your embedding

05:04:06.140 | You need to take the embedding

05:04:08.900 | Here this so a d-dimensional vector because we know it's a d-dimensional vector. So where d can be 1000, 2000

05:04:14.940 | Whatever it is. Suppose in our case, it's 1024

05:04:18.180 | You multiply it element wise. So this is element wise multiplication by another matrix constructed as follows

05:04:26.580 | Where the first element is a cosine of m theta 1 and the second element is cosine of m theta 1 etc

05:04:33.460 | Where m is the position that you want to encode in this vector and the theta 1 theta 2 are

05:04:40.540 | Computed using the following formula here. So they show it

05:04:44.900 | here

05:04:47.500 | Theta I is equal to the 10,000 to the power of minus 2 I

05:04:52.020 | Divide by D where I is from 0 to D divide by 2. I remember correctly

05:04:57.620 | They show it here. Yeah, I goes from 1 to D divide by 2

05:05:03.220 | so

05:05:05.220 | Let's go back

05:05:07.500 | So basically what we are doing is we are multiplying each dimension of this vector by a cosine

05:05:13.140 | Where where the argument of the cosine is a multiple of a base theta

05:05:19.380 | Multiplied by the position of the token that we want to encode into this token plus

05:05:26.340 | The dimensions of this vector but rotated and with changed signs

05:05:32.940 | Multiplied element wise with the sign of the same arguments that we use for the cosine

05:05:38.780 | And if you encode your vector like this

05:05:42.540 | And when you do the dot product of two vectors encoded like this

05:05:47.060 | What will happen is that the dot product is guaranteed to be

05:05:50.980 | The number that comes out of this dot product

05:05:54.460 | Will be depending on the embedding of the first vector

05:05:58.340 | So the information that was encoded before adding the positional encoding the embedding of the second vector

05:06:04.360 | So the information that was encoded in the vector before adding the positional encoding and the relative distance plus

05:06:10.580 | they also say that

05:06:12.980 | Basically the rotary positional encoding also have a

05:06:17.260 | decaying effect based on the distance between two tokens

05:06:21.260 | which means that the dot product as we know the dot product is converted into a score by the

05:06:27.980 | Softmax, so it tells us how intense is the relationship between two tokens

05:06:33.020 | So the bigger the dot products the more that that token will contribute to the output

05:06:38.940 | Contextualized embedding as we saw before

05:06:41.140 | So each of the attention scores tells us how much information that token will contribute to the output contextualized embedding

05:06:48.900 | So with the rotary positional encoding what happened is that this dot product will modified in such a way

05:06:56.500 | That the dot product will be high when two tokens are close and as they move apart

05:07:03.740 | So the distance between the two tokens for which we are doing the dot products grows

05:07:08.940 | The dot product will decay will decrease in magnitude

05:07:13.820 | So the output number will be smaller and smaller and smaller based on the relative distance between the two tokens

05:07:19.740 | And they give a relative upper bound based on the relative distance between two tokens

05:07:26.380 | So, rehearse, to encode the positional information of a token using

05:07:32.500 | Rotary positional encoding we need to do the following computation where we take the vector of the token

05:07:39.380 | We multiply it by a special matrix constructed like this

05:07:42.820 | plus again the

05:07:45.380 | Vector of the the token itself, but with dimensions changed in position

05:07:51.260 | So first we create a special vector where we put first the second dimension of the vector, but with the change sign

05:07:58.300 | then the first

05:08:00.820 | Dimension then the fourth dimension with its sign change then the third dimension, etc, etc

05:08:06.860 | And then multiplied by a sign this matrix constructed as follows using the theta values

05:08:13.940 | calculated according to this formula here this one here and

05:08:19.820 | The each of this sign and cosine is basically

05:08:24.180 | Working with an argument that is a multiple of this base theta multiplied by the position that we want to encode into this token

05:08:33.460 | And if you want to visualize in the rotary positional encoding paper

05:08:38.940 | They also say what is the meaning of this rotary positional encoding?

05:08:42.860 | So basically each two dimension as you can see from this matrix here

05:08:46.580 | Each two dimension are being rotated by the same angle

05:08:50.300 | So basically it's we are have a token that is made up of many dimensions

05:08:55.780 | So each pair of dimensions is getting rotated like a 2d vector

05:09:00.500 | So each two dimensions are considered like a two dimensional vector

05:09:05.340 | Which is getting rotated by an angle that is a multiple of the base angle

05:09:10.460 | Multiple with respect to the position that you want to encode

05:09:15.020 | And this is the the meaning of the rotary positional encoding. So the rotary positional encoding to rehearse again

05:09:21.820 | modify the attention mechanism in such a way that the attention score that is generated is dependent on the

05:09:29.580 | Relative distance between two tokens and they also prove in the paper that this attention score

05:09:34.940 | Decays as the distance between the token grows

05:09:37.940 | Okay, now that we have seen how it works. Let's code it

05:09:43.940 | And actually in the code that we are going to write you will see that

05:09:46.660 | I am going to use the HuggingFace implementation of the rotary positional encodings

05:09:51.240 | And we will see that the rotary positional encoding that it's implemented in the HuggingFace library. It's slightly different from the

05:09:58.500 | Hugging with the formula that you see

05:10:01.540 | Here this one here

05:10:04.580 | But it according to the authors it results in the same computation. So

05:10:10.020 | They they do it this way

05:10:12.020 | They I will also share the blog post in which they they explain why they do it this way

05:10:17.060 | So it's a slightly difference, but the idea is the same

05:10:20.340 | So it will result in a slightly different calculation, but the effect is the same. So let's do it

05:10:24.980 | All right, let's implement this rotary positional encoding

05:10:28.180 | So the first thing we need to create is this gamma rotary positional encoding class

05:10:33.060 | So for that we can do it. I think here it's same no problem

05:10:39.460 | Let's do it here

05:10:41.460 | Okay, so then we are giving some parameters dim is the head dimensions because each head because the

05:10:48.820 | Rotary positional encodings modify the attention mechanism

05:10:51.720 | The attention mechanism is performed independently for each attention heads

05:10:56.420 | So each head will have its own positional encoding applied to the tokens

05:11:01.540 | So this dim is the set to the head dimension. So the number of dimensions managed by each head in the multi-head attention

05:11:09.060 | Then we have the max positional embeddings, which tells us

05:11:11.700 | What is the maximum number of positions we can encode?

05:11:15.460 | this is

05:11:17.540 | Set to 8000 actually in the gamma configuration here. It's initialized to 2000, but actually it will be overwritten

05:11:23.480 | And then we have the base parameter theta which is set to 10000 also in the original paper

05:11:29.940 | So let me show you from the paper

05:11:32.100 | Let's go here

05:11:39.220 | I think I can find it

05:11:41.220 | Here as you can see, it's 10000 to the power of minus 2 id. So this stuff here

05:11:47.460 | and

05:11:49.540 | Then we have this inverse frequency. So this inverse frequency is just the formula you can see here

05:11:55.380 | So 10000 to the power of minus 2 i divided by d where i goes from it's written here

05:12:01.700 | i goes from 0 to

05:12:04.100 | 1 to d divided by half

05:12:06.580 | so d divided by 2

05:12:08.580 | And so the formula we are using is actually I think this one here to calculate it

05:12:13.940 | So 10000 to the power of minus 2 i divided by d

05:12:18.260 | So it's 10000 divided

05:12:20.900 | It's 10000 to the power of minus

05:12:24.660 | Minus something but when you have the negative power, it means one over the same thing with the positive power

05:12:32.740 | So that's why we have one over

05:12:36.080 | 10000 to the power of the positive power. So

05:12:38.880 | Let me write it. Actually when you have x to the power of minus 3

05:12:45.440 | It means 1 over x to the power of 3. So that's why you have 1 over

05:12:51.200 | 10000 to the power of something

05:12:54.160 | And what is this something that we are raising to the power 10000 to?

05:12:57.840 | It's a list of numbers that goes from 0 to dimension divided by 2 which is the i

05:13:05.280 | Divide by d where d is the number of dimensions

05:13:09.760 | So of the vector to which we will apply the rotary positional encoding which is according to this formula here

05:13:16.320 | so i goes from 0 to

05:13:18.560 | d divided by

05:13:20.880 | d divided by 2 and d is the number of dimensions of the vector to which we apply the rotary positional encodings in our case

05:13:26.800 | It's equal to the head dimensions because each head will have it positional encodings applied to it

05:13:32.480 | We use this arrangement to generate a list of numbers from 0 to

05:13:37.200 | d divided by 2. So basically it's a 0 to dim by skipping every 2

05:13:43.040 | What else we need to do here I believe we need to go let me check

05:13:51.120 | Okay, so now we can implement the forward method of this

05:13:58.880 | So to calculate the rotary positional encodings we need to generate so now let me check the go back to the paper and then explain the

05:14:06.720 | Forward method so to calculate to apply the rotary positional encodings. We need the vector itself

05:14:13.120 | Oops

05:14:14.800 | The vector itself and then we need to multiply each dimensions by some cosine and each dimensions

05:14:21.220 | Rotated and with its change its sign changed with some signs

05:14:26.300 | computed as follows so given some positions we can for each position m compute the cosine and the sine that will be

05:14:34.300 | Needed to multiply by these vectors and this is what we do in the forward method here

05:14:39.420 | We actually extract the cosines and the sines that will be applied to each tokens

05:14:44.140 | Depending on the positions of these tokens. So for each token, we will have a different position

05:14:50.700 | So this m parameter indicates the position of the token

05:14:54.860 | So for each m we can compute the cosines and the sines and this is what we do in the forward method here

05:15:00.220 | So we take the inverse frequency we add another the

05:15:03.820 | Another dimension, which is I believe it's for the batch dimension

05:15:08.880 | And then we

05:15:12.060 | Disable the auto cast so the auto cast in torch is for mixed precision

05:15:16.700 | so I don't want to go too much into the detail of this stuff, but

05:15:21.500 | Mixed precision is basically when you train a when you train a model

05:15:25.500 | You don't have to work with the floating point 32 numbers always because the most modern gpus

05:15:31.580 | They also support working with the 16 bit numbers

05:15:34.940 | Which makes computations faster and also reduces the memory of these computations. Of course, you use a little bit of precision

05:15:42.400 | But the the precision that you need for some operations is not necessary for some operations. You don't need that much precision. So the

05:15:51.240 | multi-automatic

05:15:53.240 | mixed

05:15:54.680 | Precision, I think it's called

05:15:56.680 | Handles this automatically for you

05:16:00.280 | So it will use the smaller precision for the numbers when computing certain operations and higher precision

05:16:06.600 | So 32-bit when computing other operations such that we are kind of we never lose much

05:16:12.760 | quality in the model

05:16:16.280 | Probably here for the rotary positional encodings. We want to retain the full quality of so the full

05:16:22.600 | Precision, so we disable this automatic

05:16:25.720 | Auto custom

05:16:29.400 | Okay, so

05:16:31.960 | We are basically multiplying each frequency by each position that we want to encode because as you can see from the paper

05:16:38.520 | So let's go here

05:16:40.760 | We need to multiply this m by the base frequency. We have already the base frequencies in this

05:16:46.200 | infrequent freq expanded

05:16:48.820 | So we are multiplying it by each m. So we are computing the arguments of this cosines and sines

05:16:55.480 | here

05:16:58.120 | We concatenate this

05:17:00.120 | Cosines and sines. Why? Because we have them for dim divided by two

05:17:05.640 | So for half the vector, but we need it for the entire vector

05:17:10.520 | And we are concatenating here. Now. This is actually different from what we do in the paper

05:17:15.960 | because in the paper

05:17:18.600 | We need to repeat each argument twice for each successive dimension

05:17:25.240 | So for each two dimension, we need the same argument

05:17:28.200 | what we are doing here with the concatenation is actually we are taking this one then this one then

05:17:35.400 | The theta 3 then theta 4 and then again, we are repeating theta 1 theta 2 theta 3 theta 4 instead of doing theta 1 theta 1

05:17:42.200 | theta 2 theta 2 theta 3 theta 3 so the overall numbers of

05:17:46.200 | Numbers that we will produce in the arguments that we produce is the same

05:17:51.480 | But instead of being like in the paper theta 1 theta 1 theta 2 theta 2 theta 3 theta 3 theta 4 theta 4 blah blah

05:17:59.000 | We are actually doing theta 1 theta 2 theta 3 and then we are repeating them

05:18:05.080 | Theta 1 theta 2 theta 3

05:18:07.720 | Why are we doing this? Now, it's a very long story, but basically it looks like when HuggingFace converted the

05:18:16.200 | Weights of the model for example llama from the original pre-trained model into the HuggingFace

05:18:25.100 | they permuted the

05:18:27.800 | Projection the query and the key projection which is the embedding of the token

05:18:35.320 | Each dimension was permuted

05:18:37.580 | And then to accommodate for this permuted dimension

05:18:42.940 | They are doing again a different computation for the rotary positional encodings

05:18:48.840 | So the overall effect that will result from this computation is the same as the original paper

05:18:54.600 | but they are doing this double permutation because one permutation was already done when doing the

05:19:00.740 | Conversion of the script from the original pre-trained model to the HuggingFace

05:19:04.260 | format

05:19:07.140 | And this issue is explored in the

05:19:09.940 | in the HuggingFace transformer

05:19:13.300 | repository by this user who posted why the positional encodings are done differently than the paper and the

05:19:20.180 | authors the HuggingFace explained saying that

05:19:24.020 | When they converted the weights from the original model to the HuggingFace model

05:19:30.020 | They permuted the dimensions of the wq and wk and wq and wk are the projection metrics that are used to compute

05:19:37.940 | The query and the keys we apply the rotary positional encodings to the query and the keys. So we need to

05:19:43.140 | recompute do another permutation to counter effect the effect of the first permutation. So that's why the

05:19:49.700 | The computation we are doing does not reflect exactly the paper

05:19:53.380 | Let's go forward so we have created the argument of the cosine and the sine

05:19:59.620 | so now we

05:20:00.820 | compute the cosine and the sine

05:20:02.820 | Doing with this argument. So when you calculate call the cosine function on a

05:20:07.700 | tensor it will calculate the cosine using the

05:20:11.140 | Dimensions of this vector as arguments for the cosine and the same we do it for the sine

05:20:16.340 | So the output of this forward method here

05:20:19.860 | in the paper is basically this two thing here that we need for

05:20:25.860 | Applying the rotary positional encoding to each vector and we have computed the cosine and the sine for each

05:20:31.700 | Position that we have in our sequence. So for each m that we have in our sequence

05:20:37.620 | So let me delete this stuff. Otherwise it remains in my notes forever

05:20:41.700 | Let's go forward now. We need to implement another method called apply rotary positional embedding

05:20:48.680 | Which we include here and which I also copied from HuggingFace

05:20:54.580 | What we'll do basically, okay, this will add another dimension, which is the head dimension to these cosines and and sines that we pre-computed

05:21:01.880 | Where did we pre-compute them? Well, we computed them here

05:21:05.540 | So as you can see, we extract the cosines and the sines using the rotary positional encoding class that we have created before

05:21:11.060 | Using the value states is not used. It's just used to extract the data type of the resulting vector

05:21:17.380 | And the position ids that we want to encode

05:21:20.340 | So the m parameter of each of the arguments of the cosine and the sine

05:21:24.420 | So we compute the cosines and the sine and then we use them to apply the rotary positional encoding to the query and the keys

05:21:29.540 | Which will result in the output query and the keys with the rotary positional encoding applied. So now we are implementing this method here

05:21:36.420 | Which will encode the queries

05:21:40.100 | While multiplying the dimension of the vector query with the cosines, which is this part of the formula

05:21:47.380 | So as you can see the vector multiplied by the cosine

05:21:51.140 | And then the rotated vector so with its dimensions changed and the signs changed multiplied by the sign

05:21:58.260 | Which is this part of the formula here. We need to implement this method here rotate half

05:22:03.700 | Which is again not equal to what is in the paper because we need to change the we need to permute the dimensions because the

05:22:11.620 | original vectors so the q and k are permuted by

05:22:15.700 | This query projection and this key projection

05:22:20.980 | This rotate half method basically will take the first part of the

05:22:24.740 | Embedding and then it will take the second part of the embedding with its sign changed. I believe here

05:22:32.820 | And it will concatenate it it's different than the paper because in the paper we need to create

05:22:39.540 | Here we need to create minus x2 then x1 minus x4 x3. But here what we are doing is

05:22:47.540 | minus

05:22:48.900 | Let me check imagine the token is made up of 1000 dimensions. So we are doing minus 500

05:22:54.520 | 5124 dimensions. This is minus 1 513 minus 514 minus 515 blah blah blah

05:23:02.740 | And then we have 0 1 2 3 up to 512

05:23:06.440 | So it's different than this one here

05:23:09.780 | But because of the permutation that was done to the wk and wv wq and wk projections

05:23:16.840 | So

05:23:18.840 | Okay, now we have also implemented the rotary positional encodings which encode the position information

05:23:24.940 | Right before the attention so that the attention mechanism will reflect this encoded information inside of each token

05:23:32.840 | It matches

05:23:34.840 | with the dot product

05:23:36.840 | What else do we need to build here? I believe we have everything. So let me do a very simple

05:23:45.560 | Check I think we have everything

05:23:47.720 | Guys, I think now we can proceed to the inference code. So we need to use this method

05:23:53.240 | So these classes that we have built to actually inference something. Let's do it

05:23:57.320 | All right, guys, let's go to the inference code. So let's create a new file called inference

05:24:03.260 | .py

05:24:06.520 | I also have prepared the test image that I will be using to

05:24:11.160 | Inference the language model. I will ask the language model

05:24:13.640 | What is this building and the language model should tell me that what is the name of this building?

05:24:17.640 | You can prepare any any image that you like

05:24:20.360 | So I also have this inference.py. So

05:24:24.280 | Let's start by writing some code. I will copy a large amount of code

05:24:29.640 | Because it's very nothing. No much machine learning here

05:24:34.120 | So basically i'm using a library called fire. So let's import stuff first

05:24:40.040 | Uh, oops this one

05:24:43.720 | Let's import some stuff so i'm importing a pill for the image loading torch fire fire is a library that allows you to

05:24:51.640 | Pass the command line arguments to a file to a script as parameters to a function

05:24:59.080 | So it will do automatically the parsing of the command line parameters

05:25:02.780 | And what I need to pass as the as command line is the model path

05:25:07.720 | So what are the weights of the model the prompt that we will be using to inference the model

05:25:12.520 | The image that we'll be using as condition for this prompt

05:25:15.880 | And the max number of tokens to generate the temperature that we want to apply later

05:25:20.520 | We will see what is it the top p

05:25:22.520 | The later we will see the do sample if we don't want to use the greedy strategy

05:25:26.520 | And if you don't want to use the cuda or the nps in case you are on the macbook

05:25:31.180 | So we forced to use the cpu as device for the computation in the neural network

05:25:37.880 | The first thing that this method will do is okay, we'll print which device we use

05:25:41.640 | Then it will load the model using this

05:25:44.600 | Method that we will implement later with the load hugging face model given the path and the device will load the model with the hugging

05:25:52.040 | from the hugging face

05:25:54.440 | By copying each tensor in the right position, but because we kept the name the same as the hugging face model

05:26:00.680 | So we don't need to do any name conversion

05:26:04.120 | We copy some we basically take the input and we process it using this polygamma processor which will take his input the tokenizer

05:26:12.220 | And the the prompt and the image and it will

05:26:18.520 | Transform it is input for our gamma model, which will then decode it

05:26:23.480 | And we will do that this in test inference method

05:26:27.720 | So for now, we are just creating the polygamma processor and the model itself using this load hugging face model, which we will create later

05:26:34.760 | Actually, no, let's do it now. So let's create a new file called utils

05:26:39.100 | And this utils file needs to have the following code

05:26:44.440 | So it's importing some stuff and then it's loading the hugging face method. So it's loading the tokenizer, which I said we will be using the

05:26:54.920 | Hugging face one. So we will not be coding the tokenizer

05:26:57.500 | But the weights of the model we can load them and if you look at the hugging face model

05:27:03.560 | If you go to the repository of the model, you will see that each model is a list of

05:27:09.160 | Safe tensor files each of these safe tensor files is actually a dictionary that contains the weights of the model

05:27:16.120 | So you can actually click on this icon here and it will show you what each of these them contains

05:27:21.800 | As you can see this one contains the multi-modal projector weight and bias

05:27:25.640 | This one contains the vision tower embeddings, encoder layers one, layer two, layer three, etc, etc for all the layers

05:27:32.440 | And for each layer it contains

05:27:34.920 | The wq projection, wk projection, wv projection, the weights and the bias

05:27:40.200 | The weights, the bias of the layer normalization, the weight of the layer normalization, etc, etc

05:27:46.600 | and each file contains a dictionary that contains some part of the

05:27:51.880 | Weights of this model

05:27:53.240 | So what i'm doing here is I find all the safe tensor files and then I load them each of them into a dictionary

05:27:59.020 | And then I use this dictionary to load the state dict of our neural network

05:28:03.960 | I also create the model using the config.json file that is present in the repository of the hugging face

05:28:12.200 | Model, so every hugging face model has this config.json

05:28:17.160 | So we create the configuration that is used to create our model using this configuration file

05:28:22.840 | and then I call tie weights which will copy the weights of the

05:28:27.000 | Embedding layer to the language modeling head which is the linear layer that projects the embeddings into logits

05:28:33.800 | And then we return the model and the tokenizer. So here there is no machine learning. I'm just loading the

05:28:39.720 | The weights of the model from the safe tensor files creating the

05:28:46.920 | Model using the configuration saved in config.json

05:28:49.900 | And then loading this state dict which means that I am loading the weights into our class

05:28:56.120 | This this class here into this model class and then i'm tying the weights and returning the tokenizer and the weights

05:29:02.920 | So now we can launch the inference. So we have the model and the tokenizer. We have created the processor

05:29:08.200 | So we have initialized it then we need to launch the inference

05:29:11.160 | Let's see how the inference works. So let's go back to here

05:29:16.360 | This test inference is also not so hard, but we need to do some

05:29:22.280 | Explanation on some parts

05:29:25.460 | So what we are doing is first of all, we take this

05:29:28.760 | Inputs so the image and the prompt

05:29:32.280 | Which is a text and we pass it to the processor and the processor will give us

05:29:37.560 | As you can see from the processing polygamma, it will return us the pixel values

05:29:42.520 | And the input ids and the attention mask

05:29:45.320 | So we get this

05:29:48.200 | These values from the processor. So we need to create this function which is also a simple helper function that allows to

05:29:55.720 | Get the output from the processor

05:29:59.180 | So we load the image we create the prompt which because the processor expects as input the text

05:30:05.640 | As a list and the image as a list even if it only works with one of them with a list of size one

05:30:11.800 | it takes the output of the processor which is the input ids the attention mask and the

05:30:16.120 | Pixel values of the image then it moves to the right device each of them

05:30:20.680 | So move to the right device is also a simple function that moves each tensor to the device specified by this function

05:30:27.640 | this parameter device

05:30:29.960 | And then returns it so now we have the input ids we have the attention mask we have the pixel values

05:30:36.440 | We create a KV cache, which is empty

05:30:40.040 | And what we do for based on how many tokens we need to generate. Oh, I already removed the label

05:30:45.400 | So let me remove it

05:30:46.920 | Based on how many tokens we want to generate with we launch the inference

05:30:51.640 | At the beginning this input ids only includes the prompt

05:30:56.040 | So it includes the image tokens and the text tokens without of course any output tokens because we need to generate the output

05:31:03.240 | So what we are doing at the first iteration of this for loop is the prefilling

05:31:08.440 | So the KV cache is empty the input ids contain the

05:31:11.560 | image tokens

05:31:13.880 | Placeholders and the text tokens the pixel values contains the image

05:31:19.080 | Loaded as a numpy array and then the attention mask, which is just a list of ones because we are never working with padding now

05:31:26.520 | the model itself

05:31:29.240 | So the polygamma model, which is this here will merge the image features that we are passing

05:31:36.520 | So these pixel values it will run them through the image encoder, which will return some image features

05:31:42.200 | These image features are replaced with them

05:31:46.040 | We replace the image placeholder tokens with the image features extracted from the image encoder. So now we have a list of embeddings

05:31:53.660 | Where the first embeddings are the image embeddings and then the text embeddings

05:31:58.060 | And then we send it to the language model for decoding. So let's go back to the inference

05:32:03.480 | So the first iteration of this for loop is the prefilling

05:32:06.380 | Which means that the query key and values are the same sequence length and they contain the tokens of the prompt

05:32:13.080 | The output of the prefilling is a list of embeddings

05:32:17.820 | Which we project into logits

05:32:20.780 | But we take only the last logit to predict the next token

05:32:25.000 | So that's why we take out the logits and we take only the last logit here

05:32:29.160 | So this is the sequence dimension and we take the last item in this sequence dimension

05:32:33.480 | Now, what do we do with this logits?

05:32:36.040 | So now let's go to the iPad actually because I want to explain how top p works

05:32:42.040 | So let's go. Let me check if this is working. Yeah still working

05:32:47.080 | So now we can do this one

05:32:50.040 | Oops

05:32:52.120 | This one. Okay

05:32:55.160 | Let's open a new page. So when you generate logits

05:32:59.320 | basically, it corresponds to a kind of a distribution after you apply the softmax

05:33:05.260 | So the logit is a vector. So let me draw here is a vector

05:33:11.240 | Where the number of dimensions is equal to the vocabulary size

05:33:21.400 | So you have one number for each

05:33:23.480 | Token in the vocabulary and it indicates it's an indication by the model on what the model thinks should be the next token

05:33:32.120 | What we do is we can do to understand. What is the next token?

05:33:36.680 | We need to apply the softmax which will convert each of these numbers. So each of these numbers into a

05:33:42.920 | Probability score so something that sums up to one and it's always non-negative

05:33:49.000 | And we could take for example the highest one to predict what is the net to understand what is the next token

05:33:54.280 | another way is to use the

05:33:56.520 | Sampling method. So this is a list of numbers, right one for each position in the vocabulary

05:34:02.700 | So for example for the token hello, the model could say some score the token pizza

05:34:07.560 | It should give another score for the token

05:34:09.880 | I don't know car it will give another score, etc, etc

05:34:15.240 | We can also

05:34:16.760 | Do sampling which means that we sort all of these numbers that we get

05:34:21.720 | So all of these numbers that we get we sort them in decreasing order

05:34:25.720 | And then we

05:34:30.040 | take the

05:34:31.640 | the top ones

05:34:33.640 | Such that sum up to a probability

05:34:37.180 | Score so with top p what we are doing with the top p of 0.9

05:34:44.060 | Suppose that to the token. Hello, we have assigned the model has assigned a probability

05:34:48.940 | Let's say of 0.2. This one is a 0.5 and this one is 0.1

05:34:53.580 | Then we have some other token. Let's say 0.05 and then another token. That is 0.1

05:34:59.660 | Again, I don't know if this sum up to one but okay and then some other token and some other token

05:35:05.260 | We sort them in decreasing order which means that we sort them like this. So we take hello

05:35:11.500 | zero, oh no, the first one should be pizza

05:35:14.060 | Pizza

05:35:18.060 | 0.5 then we have a hello

05:35:20.200 | 0.2

05:35:24.700 | And then we have a car

05:35:26.940 | 0.1 then we have something else that 0.1. Then we have something else that is 0.05, etc, etc

05:35:33.660 | With the top p, let's say of 0.0, not 0.9. It's a little bit more

05:35:39.740 | 0.0, not 0.9. It's a little too much. Let's say top p

05:35:43.180 | Of 0.7

05:35:47.020 | We will basically sample

05:35:49.180 | from this distribution by only considering

05:35:53.200 | The token such that their cumulative score is at least this one

05:35:58.780 | So we will take basically all the tokens that when they sum up

05:36:03.820 | We sum them up with their probability score. They sum up to this amount and then we sample from them

05:36:09.420 | sample

05:36:11.500 | kind of a weighted sample in which we

05:36:13.740 | Take into consider for example with the 0.7. We will consider only these two tokens

05:36:18.620 | and then we

05:36:20.780 | Sample from we then rearrange these numbers such that again, they sum up to one

05:36:26.540 | So suppose that after applying again the softmax this sum up to this will be changed

05:36:32.700 | So this will become let's say 0.75 and this will become 0.25

05:36:37.040 | Then we sample again from this distribution

05:36:41.600 | So basically what will happen is that 75% of the time we will choose this token and 25% of the time

05:36:48.700 | We will choose this token. This is the meaning of top p. So among all the tokens we talk we sort them

05:36:54.300 | we take only the one that

05:36:56.780 | With who that with the cumulative probability score that reaches this top p

05:37:02.060 | And then we some sample from them just like they are a distribution by themselves

05:37:06.960 | Before sampling them because they need to be a distribution so we need to apply the softmax again

05:37:13.580 | So this is what we do with the top p instead what we do with greedy is that we just take the highest one

05:37:18.860 | And that's it. But with the top p we are actually

05:37:21.740 | sampling from this distribution

05:37:24.940 | But we are not considering everything

05:37:26.940 | To sample because some of them are basically the model is saying don't use this token because the probability score assigned to it

05:37:34.060 | It's very very slow. So why should we even consider it? So that's why we use top p

05:37:37.740 | We only consider the most likely one chosen by the model

05:37:41.340 | So we don't introduce any noise in the generation process

05:37:45.200 | What else I think nothing so let's go

05:37:50.540 | So what we are doing here we are sampling with the top p if we decided to sample

05:37:54.540 | Otherwise, we just take the one with the highest probability score, which is the greedy strategy if we don't want to sample

05:37:59.740 | There is also this thing called temperature. So what is temperature temperature basically means that we want to divide

05:38:06.160 | The as you can see here we divide the logits before applying the softmax

05:38:13.280 | So that the

05:38:17.020 | Basically what happens is that before we apply the softmax these numbers are not

05:38:21.020 | Probability score so they not sum up to one

05:38:24.620 | So for example, this may be 10. This may be 7. This may be 5. This may be 2. This may be 1

05:38:31.420 | This may be 0.1, etc, etc

05:38:33.980 | When we apply the softmax, we are basically sorry when we are applying the temperature we are

05:38:41.820 | Making the difference between them a little smaller

05:38:45.580 | So we basically if the model is giving us the following distribution

05:38:49.980 | So it's telling us that this token is likely but this is very

05:38:53.100 | Much more likely and this is less likely and this is less likely etc, etc

05:38:58.460 | What we are trying to do with the temperature is basically we are reducing the gap between these peaks

05:39:05.340 | So that the when we do the sampling here

05:39:08.860 | We are more likely to choose more diverse tokens

05:39:12.300 | Because then with the temperature what will happen is that the hello instead of being chosen 25% of the time

05:39:18.700 | It will be chosen. Let's say 33% of the time and this will become

05:39:22.140 | 0.66. So basically we are introducing some noise in the choice that we do

05:39:28.780 | But only restricted to the top 0.70%

05:39:32.460 | tokens chosen by the model

05:39:36.380 | I know it's a little difficult to visualize but basically with the temperature

05:39:40.060 | We are trying to make it more likely to choose more diverse tokens because we are reducing the gaps between the probability scores

05:39:47.500 | of the tokens

05:39:49.580 | And then we do sampling with top p which I will

05:39:52.700 | Put later is a simple method that

05:39:55.900 | Does what we saw before so we sort by descending order and then we sample from the distribution

05:40:01.980 | So actually let's do it. I think it's let's do it one by one

05:40:05.260 | So sample top p

05:40:07.260 | Sample top p we can put it here. So as you can see we are sorting in descending order

05:40:14.300 | We are calculating the cumulative sum. We are only taking the one that have the cumulative sum equal to the p parameter

05:40:20.880 | We do it here

05:40:22.700 | so we mask out all the others and then we

05:40:25.340 | Normalize again so that they sum up to one because we have removed some tokens from this distribution

05:40:32.860 | And then we sample from this distribution using this multinomial and then we take the token

05:40:37.580 | chosen by this sampling operation

05:40:40.300 | So we have applied the top p so now we know what is the next token we

05:40:46.300 | Take this token and we add it to this generated tokens array

05:40:51.340 | If the next token corresponds to the stop token, which is the end of sentence token, then we stop the generation

05:40:57.440 | Otherwise we keep generating

05:41:00.300 | And then we take these input IDs as you can see then as for the next iteration

05:41:06.160 | Because we are using the KVCache

05:41:08.860 | At at each inference step we use as query only the last predicted token

05:41:16.300 | So this is what we are doing here. So at the second iteration of this for loop

05:41:19.900 | Our input IDs will only become one single token

05:41:23.660 | And so the first iteration we are doing the prefilling. So the input IDs is all the tokens of the prompt

05:41:30.140 | So the image tokens and the text tokens of what we want to do with this image

05:41:34.300 | At the second iteration these input IDs will only be one token

05:41:39.260 | So how can the model will work with only one token because the model always has access to all the previous

05:41:45.500 | Keys and values because they are have been saved in the KVCache. So when we calculate the attention the model will add this

05:41:52.540 | Single token to the KVCache retrieve whatever is inside the KVCache and use it to calculate the attention

05:42:00.060 | This way we generate tokens

05:42:02.060 | We keep increasing the attention mask by adding one because we want to attend to all the past token in the KVCache

05:42:08.940 | Because we don't have any padding

05:42:11.020 | Usually you are

05:42:13.580 | You are used to think of the padding as something that is present on the right

05:42:16.860 | But actually padding can also be done on the left

05:42:19.020 | So because on the left, we don't have any padding token

05:42:22.540 | So the attention mask is always made up of ones and also in my implementation. I am not never working with the paddings

05:42:29.580 | We generate these tokens we concatenate them together because we save them into an array

05:42:33.980 | So we need to generate a tensor which is then sent to the tokenizer for decoding and then we print

05:42:39.420 | print the output of the model

05:42:42.140 | And now we can finally run the generation. So the inference so I will copy the script that I have already prepared

05:42:48.960 | And this

05:42:52.140 | I have already saved the weights of the model

05:42:54.620 | So if you want to run this code, you need to download the repository of this model clone it locally

05:43:00.060 | and then you use it as a you send the

05:43:03.100 | you set the

05:43:05.900 | Path to where you save it you give the prompt that my prompt is this building is and the model should tell me

05:43:11.580 | What is this building and the image file is this building here. It's a building in Xi'an, China

05:43:17.260 | And then we use this temperature the top p and we do not sample

05:43:23.020 | I want the greedy strategy and I also want to use CUDA. We run the script like this. So now let's run it

05:43:29.100 | I hope there are no problems

05:43:31.740 | I think yeah should be no problem. So launch inference. Let's see

05:43:38.780 | All right guys, so after I have launched the inference actually my computer went a little crazy

05:43:45.580 | So I had to switch back to using the cpu

05:43:48.780 | And then it worked because I don't know why my CUDA sometimes doesn't work and it blocks all my computer

05:43:54.460 | So if you run the inference using the code that we have made it should give this output

05:44:00.220 | So this building is the oldest clock tower in the world

05:44:03.180 | Which is actually I don't know if it's the oldest tower in the world, but actually this is called the jungle

05:44:08.220 | So it's a clock tower in Xi'an. So it's a very famous building and looks like the output is correct

05:44:13.900 | So thank you guys for watching this video. I know it has been a very very long journey

05:44:18.860 | And I had to do a lot of explanations. I had to kind of improvise sometimes to do this explanation

05:44:24.780 | So there it is possible that may there may be some

05:44:27.660 | Imprecisions in my way of explaining because I don't have a transcript that i'm reading

05:44:32.060 | For all of the things that I have talked about

05:44:34.460 | So sometimes, you know

05:44:36.220 | I just look at the code to try to come up with the right words to how to explain it

05:44:41.420 | And of course you cannot find always the right words immediately

05:44:45.420 | Maybe you need to watch it at least for one minute to get the right words

05:44:48.620 | but

05:44:50.220 | Hopefully at least 90% of the content is super

05:44:52.780 | Correct and the other 10% maybe will have some noises

05:44:56.220 | So I will try to clarify the things that I have not been explained correctly in the comments or in the description of the video

05:45:02.060 | Thank you guys for watching this video. So please share it with your friends and

05:45:07.020 | Like it if you like it and subscribe to my channel

05:45:11.260 | A lot of people have asked me. What is the best way to contribute economically?

05:45:15.600 | To me to support me, but I believe I I thankful thank god. I don't need any economic support for now

05:45:22.780 | If I would ever need it, I would be the first one to ask

05:45:25.740 | So if you want to help someone economically, there are many people in the world that you can help

05:45:29.740 | So there are people in war areas in palestine in ukraine. You can help them economically

05:45:34.880 | But for me, I just need you guys to follow me and to share my video. This is the best way to help me out

05:45:40.620 | Also, I work at a company known as writer and my team is currently hiring

05:45:44.620 | We are looking for amazing researchers and you can find more

05:45:48.060 | information in the description of the video

05:45:50.940 | We train our own models. We have plenty of gpus

05:45:54.060 | So if you are a researcher in dealing with the language models, but any area of machine learning you are feel free to

05:46:00.220 | Send your resume. So thank you guys and have a nice day

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Chapters