back to indexCoding a Transformer from scratch on PyTorch, with full explanation, training and inference.
Chapters
0:0 Introduction
1:20 Input Embeddings
4:56 Positional Encodings
13:30 Layer Normalization
18:12 Feed Forward
21:43 Multi-Head Attention
42:41 Residual Connection
44:50 Encoder
51:52 Decoder
59:20 Linear Layer
61:25 Transformer
77:0 Task overview
78:42 Tokenizer
91:35 Dataset
115:25 Training loop
140:5 Validation loop
161:30 Attention visualization
00:00:00.000 |
Hello guys, welcome to another episode about the transformer. In this episode we will be building the transformer from scratch using PyTorch 00:00:07.880 |
so coding it from zero. We will be building the model and we will also build the code for training it for inferencing and for visualizing the attention scores 00:00:16.780 |
stick with me because it's gonna be a long video but I assure you that by the end of the video you will have a deep knowledge of the transformer model 00:00:24.800 |
not only from a conceptual point of view but also from a practical point of view 00:00:29.120 |
we will be building a translation model which means that our model will be able to translate from one language to another 00:00:36.300 |
I chose a data set that is called Opus Books and it's a synthesis taken from famous books 00:00:43.580 |
I chose the English to Italian because I'm Italian so I can understand and I can tell that if the translation is good or not 00:00:51.520 |
but I will show you which point you can change the language so you can test the same model with the language of your choice 00:00:58.800 |
let's get started! Let's open the IDE of our choice, in my case I really love Visual Studio Code 00:01:05.100 |
and let's create our first file which is the model of the transformer 00:01:10.100 |
okay, let's go have a look at the transformer model first so we know which part we are going to build first 00:01:21.100 |
the first part that we will be building is the input embeddings 00:01:24.600 |
as you can see the input embeddings take the input and convert into an embedding 00:01:30.600 |
as you remember from my previous video the input embeddings allows to convert the original sentence into a vector of 512 dimensions 00:01:40.100 |
for example in this sentence "your cat is a lovely cat" first we convert the sentence into a list of input IDs 00:01:48.300 |
that is numbers that correspond to the position of each word inside the vocabulary 00:01:53.900 |
and then each of this number corresponds to an embedding which is a vector of size 512 00:02:02.200 |
the first thing we need to do is to import Torch 00:02:15.500 |
this is the constructor we will need to tell him what is the dimension of the model 00:02:28.500 |
so the dimension of the vector in the paper this is called D model 00:02:33.500 |
and we also need to tell him what is the vocabulary size 00:02:38.500 |
so how many words there are in the vocabulary 00:03:00.300 |
save these two values and now we can create the actual embedding 00:03:04.300 |
actually PyTorch already provides with a layer that does exactly what we want to do 00:03:10.300 |
that is given a number it will provide you with the same vector every time 00:03:17.300 |
it's just a mapping between numbers and a vector of size 512 00:03:25.300 |
so this is done by the embedding layer and n.embedding 00:03:35.300 |
let me check why my autocomplete is not working 00:03:46.300 |
okay so now let's implement the forward method 00:03:54.300 |
what we do in the embedding is that we just use the embedding layer provided by PyTorch to do this mapping 00:04:06.300 |
now actually there is a little detail that is written on the paper 00:04:09.300 |
that is let's have a look at the paper actually 00:04:11.300 |
let's go here and if we check the embedding and softmax 00:04:14.300 |
we will see that in this sentence in the embedding layer 00:04:17.300 |
we multiply the weights of the embedding by square root of D model 00:04:21.300 |
so what the authors do they take the embedding given by this embedding layer 00:04:28.300 |
which I remind you is just a dictionary kind of layer 00:04:31.300 |
that just maps numbers to the same vector every time 00:04:38.300 |
so we just multiply this by math.sqrt of D model 00:04:56.300 |
the next module we are going to build is the positional encoding 00:04:59.300 |
let's have also a look at what are the positional encoding very fast 00:05:03.300 |
so we saw before that our original sentence gets mapped to a list of vectors 00:05:14.300 |
now we want to convey to the model the information about the position of each word inside the sentence 00:05:21.300 |
and this is done by adding another vector of the same size as the embedding 00:05:27.300 |
that includes some special values given by a formula that I will show later 00:05:31.300 |
that tells the model that this particular word occupies this position in the sentence 00:05:36.300 |
so we will create these vectors called the position embedding 00:05:43.300 |
okay let's define the class positional encoding 00:05:55.300 |
okay what we need to give to the constructor is for sure the D model 00:06:00.300 |
because this is the size of the vector that the positional encoding should be 00:06:04.300 |
and the sequence length this is the maximum length of the sentence 00:06:09.300 |
and because we need to create one vector for each position 00:06:37.300 |
okay let's actually build a positional encoding 00:06:41.300 |
okay first of all the positional encoding is a 00:06:44.300 |
we will build a matrix of shape sequence length to D model 00:06:49.300 |
because we need vectors of D model size so 512 00:06:57.300 |
because the maximum length of the sentence is sequence length 00:07:15.300 |
okay before we create the matrix and we know how to create the matrix 00:07:20.300 |
let's have a look at the formula used to create the positional encoding 00:07:23.300 |
so let's go have a look at the formula used to create the positional encoding 00:07:30.300 |
and let's have a look at how to build the vectors 00:07:33.300 |
so as you remember we have a sentence let's say in this case we have three words 00:07:36.300 |
we use these two formulas taken from the paper 00:07:43.300 |
and one for each possible position so up to sequence length 00:07:48.300 |
and in the even positions we apply the first formula 00:07:52.300 |
in the odd positions of the vector we apply the second formula 00:07:56.300 |
in this case I will actually simplify the calculation 00:08:00.300 |
because I saw online it has been simplified also 00:08:03.300 |
so we will do a slightly modified calculation using log space 00:08:10.300 |
so when you apply the exponential and then the log of something inside the exponential 00:08:15.300 |
the result is the same number but it's more numerically stable 00:08:22.300 |
that will represent the position of the word inside the sentence 00:08:26.300 |
and this vector can go from 0 to sequence length -1 00:08:53.300 |
so actually we are creating a tensor of shape sequence length to 1 00:09:10.300 |
okay now we create the denominator of the formula 00:09:35.300 |
and these are the two terms we see inside the formula 00:09:40.300 |
so the first tensor that we build that's called position 00:09:44.300 |
it's this pause here and the second tensor that we build is the denominator here 00:09:48.300 |
but we calculated it in log space for numerical stability 00:09:52.300 |
the value actually will be slightly different but the result will be the same 00:09:56.300 |
the model will learn this positional encoding 00:09:58.300 |
don't worry if you don't fully understand this part 00:10:01.300 |
it's just very special let's say functions that convey this positional information to the model 00:10:07.300 |
and if you watched my previous video you will also understand why 00:10:11.300 |
now we apply this to denominator and denominator to the sine and the cosine 00:10:15.300 |
as you remember the sine is only used for the even positions 00:10:32.300 |
so every position will have the sine but only 00:10:37.300 |
so every word will have the sine but only the even dimensions 00:10:41.300 |
so starting from 0 up to the end and going forward by 2 means 00:10:46.300 |
every from 0 then the number 2 then the number 4 etc etc 00:11:06.300 |
in this case we start from 1 and go forward by 2 00:11:18.300 |
and then we need to add the batch dimension to this tensor 00:11:22.300 |
so that we can apply it to the whole sentences 00:11:27.300 |
because now the shape is sequence length to demodule 00:11:31.300 |
so what we do is we add a new dimension to this PE 00:11:41.300 |
so it will become a tensor of shape 1 to sequence length to demodule 00:11:48.300 |
and finally we can register this tensor in the buffer of this module 00:12:02.300 |
so basically when you have a tensor that you want to keep inside the module 00:12:09.300 |
but you want it to be saved when you save the file of the module 00:12:15.300 |
this way the tensor will be saved in the file along with the state of the module 00:12:28.300 |
we need to add this positional encoding to every word inside the sentence 00:12:37.300 |
plus the positional encoding for this particular sentence 00:12:51.300 |
and we also tell the module that we don't want to learn this positional encoding 00:13:00.300 |
they are not learned along the training process 00:13:06.300 |
this will make this particular tensor not learned 00:13:16.300 |
and that's it, this is the positional encoding 00:13:20.300 |
first we will build the encoder part of the transformer 00:13:26.300 |
and we still have the multihead attention to build the add and norm and the feedforward 00:13:31.300 |
and actually there is another layer which connects this skip connection to all these sublayers 00:13:47.300 |
layer normalization basically means that if you have a batch of n items 00:13:57.300 |
and each sentence is made up of many words with its numbers 00:14:03.300 |
and layer normalization means that for each item in this batch 00:14:09.300 |
independently from the other items of the batch 00:14:12.300 |
and then we calculate the new values for each of them using their own mean and their own variance 00:14:18.300 |
in the layer normalization usually we also introduce some parameters 00:14:30.300 |
one is multiplicative, so it's multiplied by each of these x 00:14:33.300 |
and one is additive, so it's added to each one of these x 00:14:38.300 |
because we want the model to have the possibility to amplify these values 00:14:46.300 |
so the model will learn to multiply this gamma by these values 00:14:51.300 |
in such a way to amplify the values that it wants to be amplified 00:14:54.300 |
ok, let's go to build the code for this layer 00:15:09.300 |
in this case we don't need any parameter except for one 00:15:21.300 |
which is a very small number that you need to give to the model 00:15:24.300 |
and I will also show you why we need this number 00:15:33.300 |
ok, this epsilon is needed because if we look at the slide 00:15:37.300 |
we have this epsilon here in the denominator of this formula here 00:15:45.300 |
divided by the square root of sigma square plus epsilon 00:16:01.300 |
as we know that the CPU or the GPU can only represent numbers 00:16:08.300 |
so we don't want very big numbers or very small numbers 00:16:11.300 |
so usually for numerical stability we use this epsilon 00:17:04.300 |
we need to calculate the mean and the standard deviation 00:17:27.300 |
usually the mean cancels the dimension to which it is applied 00:17:33.300 |
and then we just apply the formula that we saw on the slide 00:18:09.300 |
okay let's go have a look at the next layer we are going to build 00:18:12.300 |
the next layer we are going to build is the feed forward 00:18:16.300 |
and the feed forward is basically a fully connected layer 00:18:21.300 |
that the model uses both in the encoder and in the decoder 00:18:28.300 |
what are the details of this feed forward layer 00:18:32.300 |
the feed forward layer is basically two matrices 00:18:39.300 |
one after another with a relu in between and with a bias 00:18:43.300 |
we can do this in PyTorch using a linear layer 00:18:58.300 |
in the paper we can also see the dimensions of these matrices 00:19:24.300 |
and in the constructor we need to define these two values 00:20:04.300 |
and then we define the second matrix w2 and b2 00:20:29.300 |
because actually as you can see here bias is by default it's true 00:20:34.300 |
so it's already defining a bias matrix for us 00:20:51.300 |
it's a tensor with dimension batch sequence length and d model 00:21:00.300 |
into another tensor of batch to sequence length to dff 00:21:07.300 |
because if we apply this linear it will convert the d model into dff 00:21:11.300 |
and then we apply the linear 2 which will convert it back to d model 00:21:42.300 |
our next block is the most important and most interesting one 00:21:48.300 |
we saw briefly in the last video how the multi-head attention works 00:21:54.300 |
so I will open now the slide again to show to rehearse how it actually works 00:22:03.300 |
as you remember in the encoder we have the multi-head attention 00:22:06.300 |
that takes the input of the encoder and uses it three times 00:22:11.300 |
one time it's called query, one time it's called key and one time it's called values 00:22:16.300 |
you can also think it like a duplication of the input three times 00:22:20.300 |
or you can just say that it's the same input applied three times 00:22:23.300 |
and the multi-head attention basically works like this 00:22:26.300 |
we have our input sequence which is sequence length by d model 00:22:35.300 |
which are exactly the same as the input in this case 00:22:40.300 |
you see that in the decoder it's a slightly different 00:22:42.300 |
and then we multiply this by matrices called w, q, w, k and w, v 00:22:49.300 |
and this results in a new matrix of dimension sequence by d model 00:22:54.300 |
we then split these matrices into h matrices, smaller matrices 00:22:59.300 |
why h? because it's the number of head we want for this multi-head attention 00:23:03.300 |
and we split these matrices along the embedding dimension 00:23:08.300 |
which means that each head we will have access to the full sentence 00:23:12.300 |
but a different part of the embedding of each word 00:23:16.300 |
we apply the attention to each of these smaller matrices using this formula 00:23:21.300 |
which will give us smaller matrices as a result 00:23:26.300 |
so we concatenate them back just like the paper says 00:23:33.300 |
and finally we multiply it by w, o to get the multi-head attention output 00:23:38.300 |
which again is a matrix that has the same dimension as the input matrix 00:23:43.300 |
as you can see the output of the multi-head attention is also sequenced by d model 00:23:48.300 |
in this slide actually I didn't show the batch dimension 00:23:55.300 |
we don't work only with one sentence but with multiple sentences 00:23:58.300 |
so we need to think that we have another dimension here which is the batch 00:24:03.300 |
okay let's go to code this multi-head attention 00:24:09.300 |
so we can see in detail everything how it's done 00:24:13.300 |
but I really wanted you to have an overview again of how it works 00:24:33.300 |
and what we need to give to this multi-head attention as parameter 00:24:37.300 |
for sure the d model of the model which is in our case 512 00:24:42.300 |
the number of heads which we call h just like in the paper 00:24:46.300 |
so h indicates the number of heads we want and then the dropout value 00:25:00.300 |
as you can see we need to divide this embedding vector into h heads 00:25:05.300 |
which means that this d model should be divisible by h 00:25:08.300 |
otherwise we cannot divide equally the same vector 00:25:13.300 |
representing the embedding into equal matrices for each head 00:25:17.300 |
so we make sure that d model is divisible by h basically 00:25:35.300 |
if we watch again my slide we can see that the value d model divided by h is called dk 00:25:41.300 |
as we can see here if we divide the d model by h heads 00:25:50.300 |
and to be aligned with what the paper with the nomenclature used in the paper 00:26:08.300 |
okay let's also define the matrices by which we will multiply the query the key and the values 00:26:21.300 |
this again is a linear so from d model to d model 00:26:25.300 |
why from d model to d model because as you can see from my slides 00:26:33.300 |
so that the output will be sequenced by d model 00:27:03.300 |
finally we also have the output matrix which is called w o here 00:27:21.300 |
because this head is actually the result this head comes from this multiplication 00:27:36.300 |
so our w o is also a matrix that is d model by d model 00:28:04.300 |
and let's see how the multi head attention works in detail during the coding process 00:28:20.300 |
the mask is basically if we want some words to not interact with other words 00:28:28.300 |
and we saw in my previous video but now let's go back to those slides 00:28:33.300 |
as you remember when we calculate the attention 00:29:00.300 |
and if we don't want some words to interact with other words 00:29:15.300 |
because as you remember the softmax on the numerator has e to the power of x 00:29:28.300 |
so basically we hide the attention for those two words 00:30:31.300 |
which has the same dimension as the initial matrix 00:31:20.300 |
we want to split it into two smaller dimensions 00:32:51.300 |
we do the same thing for the key and the value 00:34:06.300 |
let's create a function to calculate the attention 00:34:50.300 |
it's the last dimension of the query, key, and the value 00:35:02.300 |
so that you can understand how we will use it 00:35:21.300 |
so we give it the query, the key, the value, the mask 00:35:36.300 |
that is the query multiplied by the transpose of the key 00:35:52.300 |
so this @ sign means matrix multiplication in PyTorch 00:36:06.300 |
the last dimension is sequence length by decay 00:36:24.300 |
so we want to hide some interaction between words 00:36:30.300 |
so the softmax will take care of the values that we replaced 00:37:04.300 |
replace all the values for which this statement is true 00:37:18.300 |
later we will see also how we will build the mask 00:37:23.300 |
that these are all the values that we don't want 00:37:39.300 |
because they are just filler words to reach the sequence length 00:37:49.300 |
which is a very big number in the negative range 00:39:22.300 |
given by the model for that particular interaction 00:40:19.300 |
concat, just like the formula says from the paper 00:40:35.300 |
before we transformed the matrix into sequence length 00:40:39.300 |
we had the sequence length as the third dimension 00:40:48.300 |
we want the sequence length to be in the second position 00:42:55.300 |
then the output of this is sent to the addNorm 00:45:11.300 |
and the output of the last one is sent to the decoder 00:47:45.300 |
to check the slide so we can understand what we 00:48:37.300 |
self-attention, because the role of the query 00:48:53.300 |
in the decoder it's different because we have 00:51:33.300 |
that are the skip connection, the skip connection 00:52:11.300 |
the positional encodings, we can use the same 00:52:27.300 |
multihead attention with another skip connection 00:54:55.300 |
also the residual connection, in this case we have 00:55:19.300 |
the forward method which is very similar to the 00:56:01.300 |
two masks, one is the one coming from the encoder 00:56:23.300 |
and just like before we calculate the self-attention 00:56:39.300 |
because this is the self-attention block of the 00:56:47.300 |
combine, we need to calculate the cross-attention 00:57:53.300 |
also in this case we will provide with many layers 01:01:57.300 |
and a target embedding, because we are dealing with 01:06:27.300 |
in this case we are talking about translation 01:07:21.300 |
what is the source sequence length and the target sequence length 01:07:43.300 |
that is dealing with two very different languages 01:07:51.300 |
are much higher or much lower than the other ones 01:08:05.300 |
because we want to keep the same values as the paper 01:11:23.300 |
and finally we tell him how much is the dropout 01:12:01.860 |
We also have the cross attention for the decoder block. 01:12:16.880 |
We also have the feedforward, just like the encoder. 01:12:34.280 |
Then we define the decoder block itself, which is decoder block, cross attention and finally 01:13:06.600 |
We now can create the encoder and the decoder. 01:13:24.760 |
We give him all his blocks, which are n and then also the decoder. 01:13:37.560 |
And we create the projection layer, which will convert the model into vocabulary size. 01:13:50.920 |
Of course the target, because we want to take from the source language to the target language. 01:13:55.460 |
So we want to project our output into the target vocabulary. 01:14:13.560 |
An encoder, a decoder, source embedding, target embedding, then source positional encoding, 01:14:29.520 |
target positional encoding, and finally the projection layer. 01:14:39.640 |
Now we can just initialize the parameters using the Xavier uniform. 01:14:45.640 |
This is a way to initialize the parameters to make the training faster so they don't 01:14:56.920 |
I saw many implementations using Xavier, so I think it's a quite good start for the model 01:15:26.440 |
And now that we have built the model, we will go further to use it. 01:15:30.040 |
So we will first have a look at the dataset, then we will build the training loop. 01:15:37.800 |
After the training loop, we will also build the inferencing part and the code for visualizing 01:15:46.240 |
So hold on and take some coffee, take some tea, because it's going to be a little long, 01:15:54.960 |
Now that we have built the code for the model, our next step is to build the training code. 01:16:01.060 |
But before we do that, let's recheck the code, because we may have some typos. 01:16:18.280 |
So we wrote "feedForward" instead of "feedForward" here. 01:16:23.920 |
And so the same problem is also present in every reference to "feedForward". 01:16:29.520 |
And also here, when we are building the decoder block. 01:16:34.360 |
And the other problem is that here, when we build the decoder block, we just wrote "nn.module". 01:16:42.680 |
And then the "feedForward" should be also fixed here and here in the buildTransformer 01:16:49.000 |
Now, I can delete the old one, so we don't need it anymore. 01:17:04.840 |
But before we build the training code, we have to look at the data. 01:17:10.780 |
So as I said before, we are dealing with a translation task. 01:17:14.120 |
And I have chosen this dataset called "opus_books", which we can find on HuggingFace. 01:17:19.620 |
And we will also use the library from HuggingFace to download this dataset for us. 01:17:24.280 |
And this is the only library we will be using beside PyTorch. 01:17:28.200 |
Because of course we cannot reinvent the dataset by ourselves, so we will use this dataset. 01:17:33.880 |
And we will also use the HuggingFace tokenizer library to transform this text into vocabulary. 01:17:42.200 |
Because our goal is to build the transformer, so not to reinvent the wheel about everything. 01:17:48.040 |
So we will be only focusing on building and training the transformer. 01:17:52.800 |
And in my particular case, I will be using the subset "English to Italian", but we will 01:17:57.440 |
build the code in such a way that you can choose the language and the code will act 01:18:03.680 |
If we look at the data, we can see that each data item is a pair of sentences in English 01:18:12.480 |
For example, there was no possibility of taking a walk that day, which in Italian means "In 01:18:20.000 |
So we will train our transformer to translate from the source language, which is English, 01:18:32.240 |
So first we will make the code to download this dataset and to create the tokenizer. 01:18:40.080 |
Let's go back to the slides to just have a brief overview of what we are going to do 01:18:45.440 |
The tokenizer is what comes before the input embeddings. 01:18:51.440 |
So for example, "Your cat is a lovely cat", but this sentence will come from our dataset. 01:18:56.280 |
The goal of the tokenizer is to create this token. 01:18:59.680 |
So split this sentence into single words, which has many strategies. 01:19:04.480 |
As you can see here, we have a sentence, which is "Your cat is a lovely cat". 01:19:09.180 |
And the goal of the tokenizer is to split this sentence into single words, which can 01:19:16.320 |
There is the BPE tokenizer, there is the word-level tokenizer, there is the sub-word-level, word-part 01:19:24.220 |
The one we will be using is the simplest one called the word-level tokenizer. 01:19:27.980 |
So the word-level tokenizer basically will split this sentence, let's say by space. 01:19:32.420 |
So each space defines the boundary of a word, and so into the single words, and each word 01:19:41.420 |
So this is the job of the tokenizer, to build the vocabulary of these numbers and to map 01:19:51.100 |
When we build the tokenizer, we can also create special tokens, which we will use for the 01:19:56.620 |
For example, the tokens called padding, the token called the start-of-sentence, end-of-sentence, 01:20:02.540 |
which are necessary for training the transformer. 01:20:07.540 |
So let's build first the code for building the tokenizer and to download the dataset. 01:20:27.380 |
And we also, because we are using a library from HuggingFace, we also need to import these 01:20:34.940 |
We will be using the datasets library, which you can install using pip. 01:20:50.060 |
And we will also be using the tokenizers library also from HuggingFace, which you can install 01:21:02.600 |
We also need the, which tokenizer we need, so we will use the word-level tokenizer. 01:21:19.800 |
And there is also the trainers, so the tokenizer, the class that will train the tokenizer. 01:21:27.520 |
So that will create the vocabulary given the list of sentences. 01:21:40.800 |
And we will split the word according to the white space. 01:21:47.460 |
So I will build first the methods to create the tokenizer, and I will describe each parameter. 01:21:55.240 |
For now, you will not have the bigger picture, but later when we combine all these methods 01:22:01.840 |
So let's first make the method that builds the tokenizer. 01:22:13.380 |
And this method takes the configuration, which is the configuration of our model. 01:22:18.760 |
The dataset and the language for which we are going to build the tokenizer. 01:22:25.920 |
We define the tokenizer path, so the file where we will be save this tokenizer. 01:22:46.800 |
First of all, this path is coming from the pathlib, so from pathlib. 01:22:53.580 |
This is a library that allows you to create absolute path given relative paths. 01:22:58.420 |
And we pretend that we have a configuration called the tokenizer file, which is the path 01:23:05.360 |
And this path is formattable using the language. 01:23:08.280 |
So for example, we can have something like this, for example, something like this. 01:23:23.360 |
And this will be, given the language, it will create a tokenizer English or tokenizer Italian, 01:23:32.140 |
So if the tokenizer doesn't exist, we create it. 01:23:45.200 |
I took all this code actually from HuggingFace. 01:23:49.580 |
I just taken their quick tour of their tokenizers library. 01:23:54.500 |
And it's really easy to use it, and saves you a lot of time. 01:23:58.680 |
Because tokenizer, to build a tokenizer is really reinventing the wheel. 01:24:11.060 |
And we will also introduce the unknown word, unknown. 01:24:16.800 |
If our tokenizer sees a word that it doesn't recognize in its vocabulary, it will replace 01:24:24.120 |
It will map it to the number corresponding to this word, unknown. 01:24:31.580 |
The pre-tokenizer means basically that we split by whitespace. 01:24:36.320 |
And then we train, we build the trainer to train our tokenizer. 01:25:24.340 |
So it will split words using the whitespace and using the single words. 01:25:31.340 |
One is unknown, which means that if you cannot find that particular word in the vocabulary, 01:25:39.700 |
It will also have the padding, which we will use to train the transformer, the start of 01:25:45.180 |
sentence and the end of sentence special tokens. 01:25:47.820 |
In frequency means that a word, for a word to appear in our vocabulary, it has to have 01:26:04.020 |
We use this method, which means we build first a method that gives all the sentences from 01:26:38.220 |
Okay, so let's build also this method called getAllSentence so that we can iterate through 01:26:59.140 |
the data set to get all the sentences corresponding to the particular language for which we are 01:27:17.200 |
As you remember, each item in the data set, it's a pair of sentences, one in English, 01:27:22.680 |
We just want to extract one particular language. 01:27:35.400 |
And from this pair, we extract only the one language that we want. 01:27:43.520 |
Now let's write the code to load the data set and then to build the tokenizer. 01:27:49.360 |
We will call this method getDataset and which also takes the configuration of the model, 01:28:04.800 |
Okay, HuggingFace allows us to download its data sets very easily. 01:28:12.680 |
We just need to tell him what is the name of the data set. 01:28:17.920 |
And then tell him what is the subset we want. 01:28:20.920 |
We want the subset that is English to Italian, but we want to also make it configurable for 01:28:41.040 |
We will have two parameters in the configuration. 01:28:44.140 |
One is called languageSource and one is called languageTarget. 01:28:57.040 |
Later we can also define what split we want of this data set. 01:29:00.960 |
In our case, there is only the training split in the original data set from HuggingFace, 01:29:07.640 |
but we will split by ourself into the validation and the training data. 01:29:27.080 |
This is the raw data set and we also have the target. 01:29:46.240 |
Okay, now, because we only have the training split from HuggingFace, we can split it by 01:29:52.760 |
by ourself into a training and the validation. 01:29:55.520 |
We keep 90% of the data for training and 10% for validation. 01:30:45.880 |
The method randomSplit allows, it's a method from PyTorch that allows to split a data set 01:31:02.520 |
So in this case, it means split this data set into this two smaller data set, one of 01:31:23.280 |
Let's also import the one that we will need later, TotalOrder and randomSplit. 01:31:39.100 |
The data set that our model will use to access the tensors directly, because now we just 01:31:44.560 |
created the tokenizer and we just loaded the data, but we need to create the tensors that 01:31:54.080 |
Let's call it bilingual data set and for that we create a new file. 01:32:31.440 |
We will call the data set, we will call it bilingual data set. 01:32:41.620 |
Okay as usual we define the constructor and in this constructor we need to give him the 01:32:49.920 |
data set downloaded from HuggingFace, the tokenizer of the source language, the tokenizer 01:32:55.880 |
of the target language, the source language, the name of the source language, the name 01:33:00.920 |
of the target language and the sequence length that we will use. 01:33:37.320 |
We can also save the tokens, the particular tokens that we will use to create the tensors 01:33:46.840 |
So we need the start of sentence, end of sentence and the padding token. 01:33:50.640 |
So how do we convert the token start of sentence into a number, into the input ID? 01:33:57.840 |
There is a special method of the tokenizer to do that, so let's do it. 01:34:02.920 |
So this is the start of sentence token, we want to build it into a tensor. 01:34:10.680 |
This tensor will contain only one number which is given by, we can use this tokenizer from 01:34:17.480 |
the source or the target, it doesn't matter because they both contain these particular 01:34:24.640 |
This is the method to convert the token into a number, so start of sentence and the type 01:34:33.540 |
of this token, of this tensor is, we want it long because the vocabulary can be more 01:34:46.600 |
than 32-bit long, the vocabulary size, so we usually use the long 64-bit. 01:34:55.160 |
And we do the same for the end of sentence and the padding token. 01:35:23.100 |
We also need to define the length method of this dataset, which tells the length of the 01:35:33.260 |
dataset itself, so basically just the length of the dataset from hugging face, and then 01:35:55.540 |
First of all we will extract the original pair from the hugging face dataset, then we 01:36:38.500 |
And finally we convert each text into tokens, and then into input IDs. 01:36:48.220 |
We will first, the tokenizer will first split the sentence into single words, and then will 01:36:54.020 |
map each word into its corresponding number in the vocabulary, and it will do it in one 01:36:58.700 |
pass only, this is done by the encode method.ids, this gives us the input IDs, so the numbers 01:37:18.600 |
corresponding to each word in the original sentence, and it will be given as an array. 01:37:36.260 |
Now as you remember, we also need to pad the sentence to reach the sequence length. 01:37:43.420 |
This is really important because we want our model to always work, I mean the model always 01:37:49.220 |
works with a fixed length, sequence length, but we don't have enough words in every sentence, 01:37:54.960 |
so we use the padding token, so this PAD here, as the padding token to fill the sentence 01:38:04.020 |
So we calculate how many padding tokens we need to add for the encoder side and for the 01:38:07.940 |
decoder side, which is basically how many we need to reach the sequence length. 01:38:24.260 |
So we already have this amount of tokens, we need to reach this one, but we will add 01:38:28.780 |
also the start of sentence token and the end of sentence token to the encoder side, so 01:38:50.900 |
If you remember my previous video, when we do the training, we add only the start of 01:38:57.420 |
sentence token to the decoder side, and then in the label we only add the end of sentence 01:39:04.660 |
So in this case we only need to add one token, special token to the sentence. 01:39:09.160 |
We also make sure that this sequence length that we have chosen is enough to represent 01:39:15.500 |
all the sentences in our dataset, and if we chose too small one, we want to raise an exception. 01:39:25.500 |
So basically this number of padding tokens should never become negative. 01:39:42.340 |
Okay, now let's build the two tensors for the encoder input and for the decoder input, 01:40:02.820 |
So one sentence will be sent to the input of the encoder, one sentence will be sent 01:40:08.000 |
to the input of the decoder, and one sentence is the one that we expect as the output of 01:40:28.360 |
We can cut the tensor of the start, okay, we can cut three tensors. 01:40:35.460 |
First is the start of sentence token, then the tokens of the source text, then the end 01:40:56.660 |
of sentence token, and then enough padding tokens to reach the sequence length. 01:41:05.260 |
We already calculated how many impeding tokens we need to add to this sentence, so let's 01:41:36.160 |
And this is the encoder input, so let me write some comment here. 01:41:50.780 |
Then we build the decoder input, which is also a concatenation of tokens. 01:42:01.680 |
In this case we don't have the start of sentence, we just have the start of sentence. 01:42:26.640 |
And finally we add enough padding tokens to reach the sequence length. 01:42:33.560 |
We already calculated how many we need, just use this value now. 01:43:00.120 |
In the label we only add the end of sentence token. 01:43:16.400 |
Because we need the same number of padding tokens as for the decoder input. 01:43:25.480 |
Just for debugging, let's double check that we actually reach the sequence length. 01:43:37.160 |
Ok, now that we have made this check, let me also write some comments here. 01:44:01.240 |
Here we are only adding SOS to the decoder input. 01:44:09.680 |
And here is add EOS to the label, what we expect as output from the decoder. 01:44:25.280 |
Now we can return all these tensors so that our training can use them. 01:44:32.360 |
We return a dictionary comprised of encoder input. 01:44:46.880 |
Then we have the decoder input, which is also just a sequence length number of tokens. 01:45:11.240 |
As you remember, we are increasing the size of the encoder input sentence by adding padding 01:45:20.180 |
But we don't want these padding tokens to participate in the self-attention. 01:45:24.820 |
So what we need is to build a mask that says that we don't want these tokens to be seen 01:45:38.400 |
We just say that all the tokens that are not padding are okay. 01:45:44.120 |
All the tokens that are padding are not okay. 01:45:50.680 |
We also unscreeze to add this sequence dimension and also to add the batch dimension later. 01:46:03.160 |
So this is 1, 1 sequence length, because this will be used in the self-attention mechanism. 01:46:12.200 |
However, for the decoder, we need a special mask that is a causal mask, which means that 01:46:19.640 |
each word can only look at the previous word and each word can only look at non-padding 01:46:28.240 |
So we don't want, again, we don't want the padding tokens to participate in the self-attention. 01:46:32.560 |
We only want real words to participate in this. 01:46:35.680 |
And we also don't want each word to watch at words that come after it, but only that 01:46:45.160 |
So I will use a method here called causal mask that will build it. 01:46:50.600 |
So now I just call it to show you how it's used, and then we will proceed to build it. 01:47:00.760 |
So in this case, we don't want the padding tokens, and we add the necessary dimensions. 01:47:09.480 |
And also we do a Boolean end with causal mask, which is a method that we will build right 01:47:22.360 |
And this causal mask needs to build a matrix of size sequence length to sequence length. 01:47:27.240 |
What is sequence length is basically the size of our decoder input. 01:47:39.000 |
So this is one, two, sequence length, combined with, so the end with one sequence length, 01:47:50.640 |
sequence length, and this can be broadcasted. 01:48:02.480 |
Causal mask basically means that we want, let's go back to the slides actually, as you 01:48:07.280 |
remember from the slides, we want each word in the decoder to only watch words that come 01:48:13.800 |
So what we want is to make all these values above this diagonal that represents the multiplication, 01:48:20.360 |
this matrix represents the multiplication of the queries by the keys in the self-attention 01:48:28.640 |
So your cannot watch the word cat is a lovely cat. 01:48:33.360 |
It can only watch itself, but this word here, for example, this word lovely can watch everything 01:48:40.040 |
So from your up to lovely itself, but not the word cat that comes after it. 01:48:45.540 |
So what we do is we want all these values here to be masked out. 01:48:51.440 |
So which also means that we want all the values above this diagonal to be masked out. 01:48:56.880 |
And there is a very practical method in PyTorch to do it. 01:49:04.700 |
So the mask is basically torch.triu, which means give me every value that is above the 01:49:15.440 |
So we want a matrix, which matrix, matrix made of all ones. 01:49:23.180 |
And this method will return every value above the diagonal and everything else will become 01:49:30.880 |
So we want diagonal one type, we want it to be integer. 01:49:40.000 |
And what we do is return mask is equal to zero. 01:49:43.460 |
So this will return all the values above the diagonal and everything below the diagonal 01:49:51.760 |
So we say, okay, everything that is zero should will become true with this expression and 01:49:56.040 |
everything that is not zero will become false. 01:50:03.540 |
So this mask will be one by sequence length by sequence length, which is exactly what 01:50:26.000 |
Sequence length and then we have the source text just for visualization, we can send it 01:50:44.380 |
Now let's go back to our training method to continue writing the training loop. 01:50:50.620 |
So now that we have the data set, we can create it. 01:50:55.120 |
We can create two data sets, one for training, one for validation, and then we send it to 01:51:01.120 |
a data loader and finally to our training loop. 01:51:18.280 |
We also import the causal mask, which we will need later. 01:51:43.480 |
What is our source language? it's in the configuration. 01:52:09.480 |
But the only difference is that we use this one now and the rest is same. 01:52:15.960 |
We also, just for choosing the max sequence length, we also want to watch what is the 01:52:21.200 |
maximum length of each sentence in the source and the target for each of the two splits 01:52:28.760 |
So that if we choose a very small sequence length, we will know. 01:52:46.840 |
Basically what we do, I load each sentence from each language, from the source and the 01:52:52.640 |
I convert into IDs using the tokenizer and I check the length. 01:52:56.340 |
If the length is, let's say 180, we can choose 200 as sequence length, because it will cover 01:53:02.640 |
all the possible sentences that we have in this data set. 01:53:06.360 |
If it's, let's say 500, we can use 510 or something like this, because we also need 01:53:11.160 |
to add the start of sentence and the end of sentence tokens to these sentences. 01:53:39.420 |
This is the source IDs, then let's create also the target IDs, and this is the language 01:53:50.240 |
And then we just say the source maximum length is the maximum of the 01:54:02.640 |
and the length of the current sentence, the target is the target and the target IDs. 01:54:11.760 |
Then we print these two values, we also do it for the target. 01:54:30.440 |
Now we can proceed to create the data loaders. 01:54:42.760 |
We define the batch size according to our configuration, which we still didn't define, 01:54:47.160 |
but you can already guess what are its values. 01:55:07.500 |
For the validation, I will use a batch size of one, because I want to process each sentence 01:55:17.640 |
And this method returns the data loader of the training, the data loader of the validation, 01:55:24.180 |
the tokenizer of the source language and the tokenizer of the target language. 01:55:34.960 |
So let's define a new method called getModel, which will, according to our configuration, 01:55:40.880 |
our vocabulary size, build the model, the transformer model. 01:55:52.260 |
So the model is, we didn't import the model, so let's import it. 01:56:12.280 |
The source vocabulary size and the target vocabulary size. 01:56:25.200 |
And we have the sequence length of the source language and the sequence length of the target 01:56:37.080 |
And then we have the dModule, which is the size of the embedding. 01:56:43.800 |
We can keep all the rest, the default, as in the paper. 01:56:49.640 |
If the model is too big for your GPU to be trained on, you can try to reduce the number 01:56:56.640 |
Of course, it will impact the performance of the model. 01:57:00.640 |
But I think given the dataset, which is not so big and not so complicated, it should not 01:57:06.260 |
be a big problem because we are not building a huge dataset anyway. 01:57:10.640 |
OK, now that we have the model, we can start building the training loop. 01:57:15.800 |
But before we build the training loop, let me as define this configuration because it 01:57:19.960 |
keeps coming and I think it's better to define the structure now. 01:57:26.400 |
So let's create a new file called config.py in which we define two methods. 01:57:32.660 |
One is called getConfig and one is to map to get the path where we will save the weights 01:57:55.240 |
You can choose something bigger if your computer allows it. 01:57:58.440 |
The number of epochs for which we will be training, I would say 20 is enough. 01:58:03.840 |
The learning rate, I am using 10 to the power of -4. 01:58:16.880 |
It's possible to change the learning rate during training. 01:58:21.320 |
It's quite common to give a very high learning rate and then reduce it gradually with every 01:58:27.360 |
We will not be using it because it will just complicate the code a little more and this 01:58:33.840 |
The goal of this video is to teach how the transformer works. 01:58:41.080 |
I have already checked the sequence length that we need for this particular dataset from 01:58:46.760 |
English to Italian, which is 350 is more than enough. 01:58:50.760 |
And the D model that we will be using is the default of 512. 01:59:07.640 |
We will save the model into the folder called weights. 01:59:16.760 |
And the file name of which model will be T model, so transformer model. 01:59:25.480 |
I also built the code to preload the model in case we want to restart the training after 01:59:47.560 |
So tokenizer n and tokenizer it according to the language. 01:59:52.720 |
And this is the experiment name for TensorBoard on which we will save the losses while training. 02:00:09.080 |
Now let's define another method that allows us to find the part where we need to save 02:00:19.520 |
Why I'm creating such a complicated structure is because I will provide also notebooks to 02:00:29.720 |
So we just need to change these parameters to make it work on Google Colab and save the 02:00:36.560 |
I have already created actually this code and it will be provided on GitHub and I will 02:01:02.240 |
Okay, the file is built according to model base name, then the epoch.pt. 02:01:47.560 |
Okay, now let's go back to our training loop. 02:01:52.880 |
Okay, we can build the training loop now finally. 02:01:59.960 |
Okay, first we need to define which device on which we will put all the tensors. 02:02:44.960 |
We make sure that the weights folder is created. 02:03:28.040 |
To get the vocabulary size, there is method called get_vocab_size. 02:03:37.720 |
And I think we don't have any other parameter. 02:03:41.160 |
And finally, we transfer the model to our device. 02:03:47.520 |
We also start TensorBoard.TensorBoard allows to visualize the loss, the graphics, the charts. 02:04:48.000 |
Okay, since we also have the configuration that allow us to resume the training in case 02:05:01.000 |
the model crashes or something crashes, let's implement that one. 02:05:05.480 |
And that will allow us to restore the state of the model and the state of the optimizer. 02:05:31.160 |
Let's import this method we defined in the data set. 02:06:57.480 |
Okay, the loss function we will be using is the cross entropy loss. 02:07:07.880 |
We need to tell him what is the ignore index. 02:07:10.160 |
So we want him to ignore the padding token basically. 02:07:14.080 |
We don't want the padding token to contribute to the loss. 02:07:38.160 |
Label smoothing basically allows our model to be less confident about its decision. 02:07:45.040 |
So how to say, imagine our model is telling us to choose the word number three and with 02:07:53.600 |
So what we will do with label smoothing is take a little percentage of that probability 02:07:57.320 |
and distribute to the other tokens so that our model becomes less sure of its choices. 02:08:04.280 |
So kind of less over fit and this actually improves the accuracy of the model. 02:08:12.020 |
So we will use a label smoothing of 0.1 which means from every highest probability token 02:08:19.240 |
take 0.1% of score and give it to the others. 02:08:28.240 |
Okay let's build finally the training loop, we tell the model to train. 02:08:47.320 |
I build a batch iterator for the data loader using tqodm which will show a very nice progress 02:09:34.280 |
Okay finally we get the tensors, the encoder input. 02:09:54.880 |
The decoder input is batch of decoder input and we also move it to our device, batch to 02:10:44.240 |
Because in the one case we are only telling him to hide only the padding tokens, in the 02:10:51.240 |
other case we are also telling him to hide all these subsequent words, for each word 02:10:57.120 |
to hide all the subsequent words to mask them out. 02:11:02.440 |
Okay now we run the tensors through the transformer. 02:11:10.880 |
So first we calculate the output of the encoder and we encode using what the encoder input 02:11:24.360 |
Then we calculate the decoder output using the encoder output, the source, the 02:11:36.920 |
mask of the encoder, then the decoder input and the decoder mask. 02:11:46.800 |
Okay as we know this the result of this so the output of the model.encode will be a batch 02:11:59.800 |
Also the output of the decoder will be batch sequence length d model. 02:12:08.120 |
But we want to map it back to the vocabulary so we need the projection. 02:12:19.960 |
And this will produce a B so batch sequence length and target vocabulary size. 02:12:29.440 |
Okay now that we have the output of the model we want to compare it with our label. 02:12:34.360 |
So first let's extract the label from the batch. 02:12:45.180 |
So what is the label it's B so batch to sequence length in which each position tell so the 02:12:52.700 |
label is already for each B and sequence length so for each dimension tells us what is the 02:13:00.140 |
position in the vocabulary of that particular word and we want these two to be comparable 02:13:08.660 |
so we first need to compute the loss into this I show you now projection output view 02:13:28.620 |
Okay what does this do this basically transforms the I show you here this size into this size 02:13:40.600 |
B multiplied by sequence length and then target vocabulary size vocabulary size. 02:13:49.140 |
Okay because we want to compare it with this. 02:13:52.600 |
This is how the cross entropy wants the tensors to be. 02:14:04.660 |
Okay now we can we have calculated the loss we can update our progress bar this one with 02:14:39.560 |
This is this will show the loss on our progress bar. 02:15:07.260 |
Okay now we can back propagate the loss so loss.backward and finally we update the weights 02:15:17.620 |
of the model so that is the job of the optimizer and finally we can zero out the grad and we 02:15:28.040 |
move the global step by one the global step is being used mostly for TensorBoard to keep 02:15:32.820 |
track of the loss we can save the model every epoch okay model file name which we get from 02:15:45.700 |
our special methods this one we tell him the configuration we have and the name of the 02:15:54.140 |
file which is the epoch but with zeros in front and we save our model. 02:16:06.860 |
It is very good idea when we want to be able to resume the training to also save not only 02:16:12.740 |
the state of the model but also the state of the optimizer because the optimizer also 02:16:18.340 |
keep tracks of some statistics one for each weight to understand how to move each weight 02:16:24.820 |
independently and usually actually I saw that the optimizer dictionary is quite big so even 02:16:35.860 |
if it's big if you want your training to be resumable you need to save it otherwise the 02:16:40.800 |
optimizer will always start from zero and we'll have to figure out from zero even if 02:16:45.980 |
you start from a previous epoch how to move each weight so every time we save some snapshot 02:16:58.660 |
State of the model this is all the weights of the model we also want to save the optimizer 02:17:09.020 |
let's do also the global step and we want to save all this into the file 02:17:28.240 |
name so model file name and that's it now let's build the code to run this so if name 02:17:44.280 |
I really find the warnings frustrating so I want to filter them out because I have some 02:17:50.260 |
a lot of libraries especially CUDA I already know what's the content and so I don't want 02:17:57.040 |
to visualize them every time but for sure for you guys I suggest watching them at least 02:18:02.680 |
once to understand if there is any big problem otherwise they're just complaining from CUDA 02:18:22.440 |
okay let's try to run this code and see if everything is working fine we should what 02:18:29.760 |
we expect is that the code should download the data set the first time then it should 02:18:35.240 |
create the tokenizer and save it into its file and it should also start training the 02:18:42.960 |
model for 30 epochs of course it will never finish but let's do it let me check again 02:18:48.760 |
the configuration tokenizer okay let's run it 02:19:13.520 |
okay it's building the tokenizer and we have some problem here sequence length okay finally 02:19:20.040 |
the model is training I show you recap you guys what I had mistaken first of all the 02:19:26.920 |
sequence length was written incorrectly there was a capital L here and also in the data 02:19:32.720 |
set I forgot to save it here and here I had it also written capitalized so L was capital 02:19:41.120 |
and now the training is going on and as you can see the training is quite fast or at least 02:19:48.440 |
on my computer actually not so fast but because I chose a batch size of 8 I could try to increase 02:19:55.680 |
it and it's happening on CUDA the loss is decreasing and the weights will be saved here 02:20:03.080 |
so if we reach the end of the epoch it will create the first weight here so let's wait 02:20:07.720 |
until the end of the epoch and see if the weight is actually created before actually 02:20:12.800 |
finishing the training of the model let's do another thing we also would like to visualize 02:20:19.000 |
the output of the model while we are training and this is called validation so we want to 02:20:23.960 |
check how our model is evolving while it is getting trained so what we want to build is 02:20:31.160 |
a validation loop which will allow us to evaluate the model which also means that we want to 02:20:36.940 |
inference from this model and check some sample sentences and see if how they get translated 02:20:43.520 |
so let's start building the validation loop the first thing we do is we build a new method 02:20:48.280 |
called run validation and this method will accept some parameters that we will use for 02:21:01.920 |
now I just write all of them and later I explain how they will be used 02:21:05.500 |
so, we have a new method called run validation and this method will accept some parameters, we will use them later. 02:21:38.500 |
okay the first thing we do to run the validation is we put our model into evaluation mode so 02:21:46.260 |
we do model.eval and this means that this tells PyTorch that we are going to evaluate 02:21:52.100 |
our model and then what we will do we will inference two sentences and see what is the 02:22:30.740 |
so with torch.nodegrad we are disabling the gradient calculation for this for every tensor 02:22:47.400 |
that we will run inside this with block and this is exactly what we want we just want 02:22:52.100 |
to inference from the model we don't want to train it during this loop so let's get 02:22:58.700 |
a batch from the validation data set because we want to inference only two so we keep a 02:23:05.380 |
count of how many we have already processed and we get the input from this current batch 02:23:13.340 |
I want to remind you that for the validation ds we only have a batch size of 1 [typing] 02:23:28.120 |
this is the encoder input and we can also get the encoder mask 02:23:43.100 |
let's just verify that the size of the batch is actually 1 [typing] 02:24:01.040 |
and now let's go to the interesting part so as you remember when we calculate the when 02:24:09.360 |
we want to inference the model we need to calculate the encoder output only once and 02:24:14.440 |
reuse it for every token that the model will output from the decoder so let's create another 02:24:20.600 |
function that will run the greedy decoding on our model and we will see that it will 02:24:26.520 |
run the encoder only once so let's call this function greedy decode [typing] 02:24:54.080 |
okay let's create some tokens that we will need so the SOS token which is the start of 02:25:01.140 |
sentence we can get it from either tokenizer it doesn't matter if it's the target or the 02:25:19.660 |
target EOS okay and then we what we do is we pre-compute the encoder output and reuse 02:25:35.740 |
it for every token we get from the decoder so 02:25:42.180 |
we just give the source and the source mask which is the encoder input and the encoder 02:25:57.340 |
mask we can also call it encoder input and encoder mask then we get the then we okay 02:26:06.800 |
how do we do the inferencing the first thing we do is we give to the decoder the start 02:26:11.820 |
of sentence token so that the decoder will output the first token of the sentence of 02:26:18.060 |
the translated sentence then at every iteration just like we saw in my slides at every iteration 02:26:24.540 |
we add the previous token to the to the decoder input and so that the decoder can output the 02:26:31.660 |
next token then we take the next token we put it again in front of the input to the 02:26:36.620 |
decoder and we get the successive token so let's build a decoder input for the first 02:26:42.780 |
iteration which is only the start of sentence token 02:27:06.780 |
we fill this one with the start of sentence token 02:27:17.280 |
and it has the same type as the encoder input okay now we will keep in asking the decoder 02:27:26.740 |
to output the next token until we reach either the end of sentence token or the max length 02:27:32.420 |
we have defined here so we can do a while true and then our first stopping condition 02:27:39.260 |
is if we the decoder output which is becomes the input of the next step becomes large larger 02:27:58.140 |
here why do we have two dimensions one is for the batch and one is for the tokens of 02:28:24.820 |
we can use our function causal mask to say that we don't want the input to watch future 02:28:39.780 |
and we don't need the other mask because here we don't have any padding token as you can 02:29:07.820 |
we reuse the output of the encoder for every iteration of the loop we reuse the source 02:29:15.260 |
mask so the input the mask of the encoder then we give the decoder input and along with 02:29:21.220 |
its mask the decoder mask and then we get the next token 02:29:31.860 |
so we get the probabilities of the next token using the projection layer 02:29:39.860 |
but we only want the projection of the last token so the next token after the last we 02:29:59.460 |
so we get the token with the maximum probability this is the greedy search 02:30:18.620 |
and then we get this word and we append it back to this one because it will become the 02:30:33.700 |
so we take the decoder input and we append the next token so we create another tensor 02:31:08.900 |
yeah should be correct okay if the next token so if the next word or token is equal equal 02:31:25.700 |
to the end of sentence token then we also stop the loop 02:31:31.260 |
and this is our greedy search now we can just return the output so the output is basically 02:31:37.660 |
the decoder input because every time we are appending the next token to it and we remove 02:31:47.660 |
and that's our greedy decoding now we can use it here in this function so in the validation 02:31:54.060 |
function so we can finally get the model output is equal to greedy decode in which we give 02:32:19.620 |
and then we want to compare this model output with what we expected so with the label so 02:32:26.900 |
let's append all of these so what we give to the input we gave to the model what the 02:32:32.820 |
model output the output of the model so the predicted and what we expected as output we 02:32:38.500 |
save all of this in this lists and then at the end of the loop we will print them on 02:32:57.780 |
to get the text of the output of the model we need to use the tokenizer again to convert 02:33:18.500 |
the tokens back into text and we use of course the target tokenizer because this is the target 02:33:40.460 |
okay and now we save them all of this into their respective lists 02:34:13.820 |
while we are using why we are we using this function called print message and why not 02:34:18.540 |
just use the print of the Python because we are using here in the main loop in the training 02:34:23.940 |
loop we are using here tqdm which is our really nice looking progress bar but it is not suggested 02:34:32.260 |
to print directly on the console when this progress bar is running so to print on the 02:34:38.460 |
console there is one method called the print provided by tqdm and we will give this method 02:34:45.740 |
to this function so that the output does not interfere with the progress bar printing 02:35:05.860 |
okay so we have a message here and we have a message here and we have a message here. 02:35:34.300 |
and if we have already processed number of examples then we just break so why we have 02:35:48.260 |
created these lists actually we can also send all of this to to a tensor board so we can 02:35:58.700 |
so for example if we have tensor board enabled we can send all of this to the tensor board 02:36:04.280 |
and to do that actually we need another library that allow us to calculate some metrics I 02:36:10.900 |
think we can skip this part but if you are really interested in the code I published 02:36:17.620 |
on github you will find that I use this library called the torch metrics that allows us to 02:36:24.020 |
calculate the char error rate and the bleu metric which is really useful for translation 02:36:31.820 |
tasks and the word error rate so if you really interested you can find the code on the github 02:36:39.420 |
but for our demonstration I think it's not necessary so and actually this we can also 02:36:47.220 |
remove it given that we are not doing this part okay so now that we have our run validation 02:36:55.140 |
method we can just call it okay what I usually do is I run the validation at every few steps 02:37:05.700 |
but because we want to see it as soon as possible what we will do is we will first run it at 02:37:13.880 |
every iteration and we also put this model.train inside of this loop so that every time after 02:37:22.100 |
we run the validation the model is back into its training mode so now we can just run validation 02:37:29.500 |
and we give it all the parameter that it needs to run the validation so give it model okay 02:37:57.140 |
for printing message are we printing any message we are so let's create a lambda and we just 02:38:09.020 |
do and this is the message to write with the tqdm then we need to give the global step 02:38:25.060 |
and the writer which we will not use but okay now I think we can run the training again 02:38:58.660 |
all right looks like it is working so the model is okay it's running the validation 02:39:05.780 |
at every step which is not desirable at all but at least we know that the greedy search 02:39:10.820 |
is working and it's not at least looks like it is working and the model is not predicting 02:39:17.060 |
anything useful actually it's just predicting a bunch of commas because it's not training 02:39:23.980 |
at all but if we train the model after a while we should see that at after a few epochs the 02:39:29.940 |
model should become better and better and better so let's stop this training and let's 02:39:36.980 |
put this one back to where it belongs so at the end of every epoch here and this one we 02:39:43.880 |
can keep it here no problem yeah okay I will now skip fast forward to a model that has 02:39:52.860 |
been pre-trained I pre-trained it for a few hours so that we can inference it and we can 02:39:58.980 |
visualize the attention I have copied the pre-trained weights that I pre-calculated 02:40:05.660 |
and I also created this notebook reusing the functions that we have defined before in the 02:40:10.900 |
train file the code is very simple actually I just copy and pasted the code from the train 02:40:15.980 |
file I just load the model and run the validation the same method that we just wrote and then 02:40:22.460 |
I ran the validation on the pre-trained let's run it again for example and as you can see 02:40:28.860 |
the model is inferencing 10 examples sentences and the result is not bad I mean we can see 02:40:35.020 |
that LevinSmile, LevinSorrisse, LevinSorrisse it's matching and most of them matching actually 02:40:40.420 |
we could also say that it's nearly over fit for this particular data but this is the power 02:40:47.740 |
of the transformer I didn't train it for many days I just trained it for a few hours if 02:40:53.020 |
I remember correctly and the results are really really good and now let's write let's make 02:40:58.760 |
the notebook that we will use to visualize the attention of this pre-trained model given 02:41:06.080 |
the file that we built before so train.py you can also train your own model choosing 02:41:10.820 |
the language of your choice which I highly recommend that you change the language and 02:41:15.060 |
try to see how the model is performing and try to diagnose why the model is performing 02:41:21.240 |
bad if it's performing bad or if it's performing well try to understand how can you improve 02:41:26.620 |
it further so let's try to visualize the attention so let's create a new notebook let's call 02:41:33.660 |
it let's say attention visualization okay so the first thing we do we import all the 02:42:14.060 |
I will also be using this library called Altair it's a visualization library for charts it's 02:42:25.180 |
nothing related to deep learning actually it's just a visualization function and in 02:42:30.420 |
particular the visualization function actually I found it online it's not written by me just 02:42:34.660 |
like most of the visualization functions you can find easily on the internet if you want 02:42:38.240 |
to build a chart or if you want to build a histogram etc so I am using this library mostly 02:42:43.920 |
because I copied the code from the internet to visualize it but all the rest is my own 02:42:48.700 |
code so let's import it okay let's import all of this and of course you will have to 02:43:16.780 |
install this particular library when you run the code on your computer let's also define 02:43:22.280 |
the device you can just copy the code from here 02:43:38.760 |
and then we load the model which we can copy from here like this okay let's paste it here 02:43:51.640 |
and this one becomes vocabulary source and vocabulary target 02:44:18.320 |
okay now let's make a function to load the batch 02:44:22.560 |
okay so we have a function called load_batch which is a function that loads the batch and 02:45:27.960 |
I will convert the tokens now using the tokenizer 02:47:10.520 |
Okay, now I will build the necessary functions 02:47:27.520 |
is nothing interesting from a learning point of view 02:47:36.240 |
So I will copy it because it's quite long to write 02:47:39.040 |
and the salient part I will explain, of course. 02:47:53.120 |
For example, the attention we have in three positions. 02:48:09.000 |
How to get the information about the attention? 02:48:14.800 |
We choose which layer we want to get the attention from. 02:48:36.960 |
we not only return the output to the next layer, 02:48:51.200 |
Now we can just retrieve it and visualize it. 02:50:10.680 |
And then we visualize what is the source and the target. 02:51:12.440 |
So the first occurrence of the padding character. 02:51:15.000 |
Because this is the batch taken from the dataset, 02:51:17.320 |
which is already the tensor built for training, 02:51:23.000 |
the number of actual characters in our sentence. 02:51:48.680 |
this function was wrong, so now it should work. 02:51:54.120 |
Okay, this sentence is too small, let's get a longer one. 02:52:00.920 |
You cannot remain as you are, especially you. 02:52:08.920 |
Okay, let's print the attention for the layers, 02:52:15.960 |
if you remember, the parameter is n is equal to six. 02:52:28.320 |
one, two, three, four, five, six, seven, and seven. 02:52:31.860 |
Okay, let's first visualize the encoder self-attention. 02:53:07.040 |
So it's the same sentence that is attending itself. 02:53:11.160 |
So we need to provide the input sentence of the encoder 02:53:21.120 |
Okay, let's say we want to visualize no more than 20. 02:53:38.260 |
we expect the values along the diagonals to be high 02:53:42.280 |
because it's the dot product of each token with itself. 02:53:50.160 |
For example, we see that the start of sentence token 02:53:54.840 |
at least for the head zero and the layer zero, 02:54:02.880 |
But other heads, they do learn some very small mapping. 02:54:10.840 |
we can see the actual value of the self-attention, 02:54:16.360 |
For example, we can see the attention is very strong here, 02:54:19.480 |
so the word especially and specially are related, 02:54:32.520 |
So because each head will watch different aspect 02:54:47.360 |
We also hope that they learn different kind of mapping 02:54:57.360 |
we also have different WQ, WK, and WV metrics. 02:55:01.880 |
So they should also learn different relationships. 02:55:06.560 |
Now we can also want, we may also want to visualize 02:55:13.960 |
Let me just copy the code and just change the parameters. 02:55:36.100 |
but the tokens that will be on the rows and the columns 02:55:43.160 |
So decoder input tokens and decoder input tokens. 02:55:50.600 |
because we are using the decoder self-attention. 02:56:07.720 |
The one I find most interesting is the cross attention. 02:56:13.320 |
Okay, let me just copy the code and run it again. 02:56:32.520 |
So here on the rows we will show the encoder input 02:56:36.960 |
and on the columns we will show the decoder input tokens 02:56:43.280 |
Okay, this is more or less how the interaction 02:56:56.960 |
So this is where we find the cross attention calculated 02:57:01.960 |
using the keys and the values coming from the encoder 02:57:09.600 |
So this is actually where the translation task happens. 02:57:24.880 |
So I invite you guys to run the code by yourself. 02:57:31.160 |
is to write the code along with me with the video. 02:57:35.040 |
You can pause the video, you can write the code by yourself. 02:57:39.800 |
Okay, let me give you some practical examples. 02:57:42.400 |
For example, when I'm writing the model code, 02:57:48.440 |
for one particular layer and then stop the video, 02:57:59.920 |
And if you really cannot, after one, two minutes, 02:58:02.440 |
you cannot really figure out what is the problem, 02:58:08.200 |
Some things, of course, you cannot come up by yourself. 02:58:14.200 |
it's basically just an application of formulas. 02:58:22.320 |
So how all the layers are interacting with each other. 02:58:30.440 |
the training part actually is quite standard. 02:58:38.840 |
The interesting part is how we calculate the loss 02:58:59.760 |
And I hope in the next videos to make more examples 02:59:04.680 |
of transformers and other models that I am familiar with 02:59:13.320 |
that you don't understand or you want me to explain better. 02:59:16.960 |
I will also for sure follow the comment section.