Coding a Transformer from scratch on PyTorch, with full explanation, training and inference.

Hello guys, welcome to another episode about the transformer. In this episode we will be building the transformer from scratch using PyTorch so coding it from zero. We will be building the model and we will also build the code for training it for inferencing and for visualizing the attention scores stick with me because it's gonna be a long video but I assure you that by the end of the video you will have a deep knowledge of the transformer model not only from a conceptual point of view but also from a practical point of view we will be building a translation model which means that our model will be able to translate from one language to another I chose a data set that is called Opus Books and it's a synthesis taken from famous books I chose the English to Italian because I'm Italian so I can understand and I can tell that if the translation is good or not but I will show you which point you can change the language so you can test the same model with the language of your choice let's get started!

Let's open the IDE of our choice, in my case I really love Visual Studio Code and let's create our first file which is the model of the transformer okay, let's go have a look at the transformer model first so we know which part we are going to build first and then we will build each part one by one the first part that we will be building is the input embeddings as you can see the input embeddings take the input and convert into an embedding what is the input embedding?

as you remember from my previous video the input embeddings allows to convert the original sentence into a vector of 512 dimensions for example in this sentence "your cat is a lovely cat" first we convert the sentence into a list of input IDs that is numbers that correspond to the position of each word inside the vocabulary and then each of this number corresponds to an embedding which is a vector of size 512 so let's build this layer first the first thing we need to do is to import Torch and then we need to create our class this is the constructor we will need to tell him what is the dimension of the model so the dimension of the vector in the paper this is called D model and we also need to tell him what is the vocabulary size so how many words there are in the vocabulary (typing) save these two values and now we can create the actual embedding actually PyTorch already provides with a layer that does exactly what we want to do that is given a number it will provide you with the same vector every time and this is exactly what embedding does it's just a mapping between numbers and a vector of size 512 512 here in this our case is the D model so this is done by the embedding layer and n.embedding and vocab size and D model let me check why my autocomplete is not working (typing) okay so now let's implement the forward method (typing) what we do in the embedding is that we just use the embedding layer provided by PyTorch to do this mapping so return self.embedding(x) now actually there is a little detail that is written on the paper that is let's have a look at the paper actually let's go here and if we check the embedding and softmax we will see that in this sentence in the embedding layer we multiply the weights of the embedding by square root of D model so what the authors do they take the embedding given by this embedding layer which I remind you is just a dictionary kind of layer that just maps numbers to the same vector every time and this vector is learned by the model so we just multiply this by math.sqrt of D model (typing) you also need to import math okay now the input embeddings are ready let's go to the next module the next module we are going to build is the positional encoding let's have also a look at what are the positional encoding very fast so we saw before that our original sentence gets mapped to a list of vectors by the embeddings layer and this is our embeddings now we want to convey to the model the information about the position of each word inside the sentence and this is done by adding another vector of the same size as the embedding so of size 512 that includes some special values given by a formula that I will show later that tells the model that this particular word occupies this position in the sentence so we will create these vectors called the position embedding and we will add them to the embedding okay let's go do it okay let's define the class positional encoding (typing) and we define the constructor okay what we need to give to the constructor is for sure the D model because this is the size of the vector that the positional encoding should be and the sequence length this is the maximum length of the sentence and because we need to create one vector for each position and we also need to give the dropout dropout is to make the model less over fit (typing) okay let's actually build a positional encoding okay first of all the positional encoding is a we will build a matrix of shape sequence length to D model why sequence length to D model?

because we need vectors of D model size so 512 but we need sequence length number of them because the maximum length of the sentence is sequence length so let's do it (typing) okay before we create the matrix and we know how to create the matrix let's have a look at the formula used to create the positional encoding so let's go have a look at the formula used to create the positional encoding this is the slide from my previous video and let's have a look at how to build the vectors so as you remember we have a sentence let's say in this case we have three words we use these two formulas taken from the paper we create a vector of size 512 and one for each possible position so up to sequence length and in the even positions we apply the first formula in the odd positions of the vector we apply the second formula in this case I will actually simplify the calculation because I saw online it has been simplified also so we will do a slightly modified calculation using log space this is for numerical stability so when you apply the exponential and then the log of something inside the exponential the result is the same number but it's more numerically stable so first we create a vector called position that will represent the position of the word inside the sentence and this vector can go from 0 to sequence length -1 so actually we are creating a tensor of shape sequence length to 1 okay now we create the denominator of the formula and these are the two terms we see inside the formula let's go back to the slide so the first tensor that we build that's called position it's this pause here and the second tensor that we build is the denominator here but we calculated it in log space for numerical stability the value actually will be slightly different but the result will be the same the model will learn this positional encoding don't worry if you don't fully understand this part it's just very special let's say functions that convey this positional information to the model and if you watched my previous video you will also understand why now we apply this to denominator and denominator to the sine and the cosine as you remember the sine is only used for the even positions and the cosine only for the odd positions so we will apply it twice let's do it so apply so every position will have the sine but only so every word will have the sine but only the even dimensions so starting from 0 up to the end and going forward by 2 means every from 0 then the number 2 then the number 4 etc etc position multiplied by diphtherm then we do the same for the cosine in this case we start from 1 and go forward by 2 it means 1, 3, 5 etc and then we need to add the batch dimension to this tensor so that we can apply it to the whole sentences so to all the batch of sentence because now the shape is sequence length to demodule but we will have a batch of sentences so what we do is we add a new dimension to this PE and this is done using unsqueeze and in the first position so it will become a tensor of shape 1 to sequence length to demodule and finally we can register this tensor in the buffer of this module so what is the buffer of the module let's first do it register buffer so basically when you have a tensor that you want to keep inside the module not as a parameter, learned parameter but you want it to be saved when you save the file of the module you should register it as a buffer this way the tensor will be saved in the file along with the state of the module then we do the forward method so as you remember from before we need to add this positional encoding to every word inside the sentence so let's do it so we just do x is equal to x plus the positional encoding for this particular sentence and we also tell the module that we don't want to learn this positional encoding because they are fixed they will always be the same they are not learned along the training process so we just do it require squared false this will make this particular tensor not learned and then we apply the dropout and that's it, this is the positional encoding let's have a look at the next module first we will build the encoder part of the transformer which is this left side here and we still have the multihead attention to build the add and norm and the feedforward and actually there is another layer which connects this skip connection to all these sublayers so let's start with the easiest one let's start with layer normalization which is this add and norm as you remember from my previous video let's have a look at the layer normalization a little briefing layer normalization basically means that if you have a batch of n items in this case only 3 each item will have some features let's say that these are actually sentences and each sentence is made up of many words with its numbers so this is our 3 items and layer normalization means that for each item in this batch we calculate a mean and a variance independently from the other items of the batch and then we calculate the new values for each of them using their own mean and their own variance in the layer normalization usually we also introduce some parameters called gamma and beta some people call it alpha and beta some people call it alpha and bias ok, it doesn't matter one is multiplicative, so it's multiplied by each of these x and one is additive, so it's added to each one of these x why?

because we want the model to have the possibility to amplify these values when he needs this value to be amplified so the model will learn to multiply this gamma by these values in such a way to amplify the values that it wants to be amplified ok, let's go to build the code for this layer let's define the layer normalization class and constructor as usual in this case we don't need any parameter except for one that I will show you now which is epsilon and usually EPS stands for epsilon which is a very small number that you need to give to the model and I will also show you why we need this number in this case we use 10 to the power of -6 let's save it ok, this epsilon is needed because if we look at the slide we have this epsilon here in the denominator of this formula here so x with cap is equal to xj minus mu divided by the square root of sigma square plus epsilon why we need this epsilon?

because imagine this denominator if sigma happens to be 0 or very close to 0 this x new will become very big which is undesirable as we know that the CPU or the GPU can only represent numbers up to a certain position and scale so we don't want very big numbers or very small numbers so usually for numerical stability we use this epsilon also to avoid division by 0 let's go forward so now let's introduce the two parameters that we will use for the layer normalization one is called alpha which will be multiplied and one is bias which will be added usually the additive is called bias it's always added and the alpha is the one that is multiplied in this case we will use nn.parameter this makes the parameter learnable and we define also the bias this I want to remind you is multiplied and this is added let's define the forward okay as you remember we need to calculate the mean and the standard deviation or the variance for both of these we will calculate the standard deviation of the last dimension so everything after the batch and we keep the dimension so this parameter keep dimension means that usually the mean cancels the dimension to which it is applied but we want to keep it and then we just apply the formula that we saw on the slide alpha multiplied by what?

x minus its mean divided by the standard deviation plus self.eps everything added to bias and this is our layer normalization okay let's go have a look at the next layer we are going to build the next layer we are going to build is the feed forward you can see here and the feed forward is basically a fully connected layer that the model uses both in the encoder and in the decoder let's first have a look at the paper to see what are the details of this feed forward layer in the paper the feed forward layer is basically two matrices one w1 one w2 that are multiplied by this x one after another with a relu in between and with a bias we can do this in PyTorch using a linear layer in which we define the first one to be the matrix with the w1 and b1 and the second one to be the w2 and the b2 and in between we apply a relu in the paper we can also see the dimensions of these matrices so the first one is basically d model to dff and the second one is from dff to d model so dff is 2048 and d model is 512 let's go build it class feed forward block we also build in this case the constructor and in the constructor we need to define these two values that we saw on the paper so d model dff and also in this case dropout we define the first matrix so w1 and b1 to be the linear one and it's from d model to dff and then we apply the dropout actually we define the dropout and then we define the second matrix w2 and b2 so let me write the comments here it's w1 and b1 of dff to d model and this is w2 and b2 why we have b2?

because actually as you can see here bias is by default it's true so it's already defining a bias matrix for us okay let's define the forward method in this case what we are going to do is we have an input sentence which is batch it's a tensor with dimension batch sequence length and d model first we will convert it using linear 1 into another tensor of batch to sequence length to dff because if we apply this linear it will convert the d model into dff and then we apply the linear 2 which will convert it back to d model we apply the dropout in between and this is our feed forward block let's go have a look at the next block our next block is the most important and most interesting one and it's the multi-head attention we saw briefly in the last video how the multi-head attention works so I will open now the slide again to show to rehearse how it actually works and then we will do it practically by coding as you remember in the encoder we have the multi-head attention that takes the input of the encoder and uses it three times one time it's called query, one time it's called key and one time it's called values you can also think it like a duplication of the input three times or you can just say that it's the same input applied three times and the multi-head attention basically works like this we have our input sequence which is sequence length by d model we transform into three matrices q, k and v which are exactly the same as the input in this case because we are talking about the encoder you see that in the decoder it's a slightly different and then we multiply this by matrices called w, q, w, k and w, v and this results in a new matrix of dimension sequence by d model we then split these matrices into h matrices, smaller matrices why h?

because it's the number of head we want for this multi-head attention and we split these matrices along the embedding dimension not along the sequence dimension which means that each head we will have access to the full sentence but a different part of the embedding of each word we apply the attention to each of these smaller matrices using this formula which will give us smaller matrices as a result then we combine them back so we concatenate them back just like the paper says so concatenation of head one up to head h and finally we multiply it by w, o to get the multi-head attention output which again is a matrix that has the same dimension as the input matrix as you can see the output of the multi-head attention is also sequenced by d model in this slide actually I didn't show the batch dimension because we are talking about one sentence but when we code the transformer we don't work only with one sentence but with multiple sentences so we need to think that we have another dimension here which is the batch okay let's go to code this multi-head attention I will do it a little more slower so we can see in detail everything how it's done but I really wanted you to have an overview again of how it works and why we are doing what we are doing so let's go code it class also in this case we define the constructor and what we need to give to this multi-head attention as parameter for sure the d model of the model which is in our case 512 the number of heads which we call h just like in the paper so h indicates the number of heads we want and then the dropout value we save these values as you can see we need to divide this embedding vector into h heads which means that this d model should be divisible by h otherwise we cannot divide equally the same vector representing the embedding into equal matrices for each head so we make sure that d model is divisible by h basically and this will make the check if we watch again my slide we can see that the value d model divided by h is called dk as we can see here if we divide the d model by h heads we get a new value which is called dk and to be aligned with what the paper with the nomenclature used in the paper we will also call it dk so dk is d model divided by h okay let's also define the matrices by which we will multiply the query the key and the values and also the output matrix w o this again is a linear so from d model to d model why from d model to d model because as you can see from my slides this is d model by d model so that the output will be sequenced by d model so this is wq this is wk and this is wv finally we also have the output matrix which is called w o here this w o is h by dv by d model so h by dv, dv is what?

dv is actually equal to dk because it's the d model divided by h but why it's called dv here and dk here? because this head is actually the result this head comes from this multiplication and the last multiplication is by v and in the paper they call this value dv but on a practical level it's equal to dk so our w o is also a matrix that is d model by d model because h by dv is equal to d model and this is w o finally we create the dropout let's implement the forward method and let's see how the multi head attention works in detail during the coding process we define the query, the key and the values and there is this mask so what is this mask?

the mask is basically if we want some words to not interact with other words we mask them and we saw in my previous video but now let's go back to those slides to see what is the mask doing as you remember when we calculate the attention using this formula so softmax of q multiplied by kt divided by square root of dk and then by v we get this head matrix but before we multiply by v so only this multiplication here with q by k we get this matrix which is each word with each other word it's a sequence by sequence matrix and if we don't want some words to interact with other words we basically replace their value so their attention score with something that is very small before we apply the softmax and when we apply the softmax these values will become zero because as you remember the softmax on the numerator has e to the power of x so if x goes to minus infinity so a very small number e to the power of minus infinity will become very small so very close to zero so basically we hide the attention for those two words so this is the job of the mask just following my slide we do the multiplication one by one so as we remember we calculate first the query are multiplied by the wq so self.wq multiplied with the query gives us a new matrix which is called the q prime in my slides I just call it query here we do the same with the keys and the same with the values let me also write the dimensions so we are going from batch, sequence, length, d, model with this multiplication we are going to another matrix which is batch, sequence, length, d, model and you can see that from the slides so when we do sequence by d, model multiplied by d, model by d, model we get a new matrix which has the same dimension as the initial matrix so sequence by d, model and it's the same for all three of them now what we want to do is we want to divide this query key and value into smaller matrices so that we can give each small matrix to a different head so let's do it we will divide into using the view method of PyTorch which means that we keep the batch dimension because we don't want to split the sentence we want to split the embedding into h parts we also want to keep the second dimension which is the sequence because we don't want to split it and the third dimension so the d, model we want to split it into two smaller dimensions which is h by d, k so self.h, self.d,k as you remember d, k is basically d, model divided by h so this multiplied by this gives you d, model and then we transpose one, two why do we transpose?

because we prefer to have the h dimension instead of being the third dimension we want it to be the second dimension and this way each view, each head will see all the sentence so we'll see this dimension so the sequence length by d, k let me also write the comment here so we are going from batch, sequence length, d, model to batch, sequence length, h, d, k and then by using the transposition we are going to batch, h, sequence length, and d, k this is really important because we want each batch we want each head to watch this stuff so the sequence length by d, k which means that each head will see the full sentence so each word in the sentence but only a smaller part of the embedding we do the same thing for the key and the value ok, now that we have these smaller matrices so let me go back to the slide so I can show you where we are so we did this multiplication we obtained query, key, and values we split into smaller matrices now we need to calculate the attention using this formula here before we can calculate the attention let's create a function to calculate the attention so if we create a new function that can be used also later so self, attention let's define it as a static method so static method means basically that you can call this function without having an instance of this class you can just say multi head attention block dot attention instead of having an instance of this class we also give him the dropout layer ok, what we do is we get the decay what is the decay?

it's the last dimension of the query, key, and the value and we will be using this function here let me first call it so that you can understand how we will use it and then we define it so we want from this function we want two things, the output and we want the attention scores so the output of the softmax attention scores and we will call it like this so we give it the query, the key, the value, the mask and the dropout layer now let's go back here so we have the decay now what we do is first we apply the first part of the formula that is the query multiplied by the transpose of the key divided by the square root of decay so these are our attention scores query matrix multiplication so this @ sign means matrix multiplication in PyTorch we transpose the last two dimensions -1 means transpose the last two dimensions so this will become the last dimension is sequence length by decay it will become decay by sequence length and then we divide this by math.decay before, as we saw before before applying the softmax we need to apply the mask so we want to hide some interaction between words we apply the mask and then we apply the softmax so the softmax will take care of the values that we replaced how do we apply the mask?

we just all the values that we want to mask we replace them with very very small values so that the softmax will replace them with 0 so if a mask is defined apply it this means basically replace all the values for which this statement is true with this value the mask we will define in such a way that where this value, this expression is true we want it to be replaced by this later we will see also how we will build the mask for now just take it for granted that these are all the values that we don't want to have in the attention so we don't want for example some word to watch future words for example when we will build a decoder or we don't want the padding values to interact with other values because they are just filler words to reach the sequence length we will replace them with -1 to the power of -10 to the power of 9 which is a very big number in the negative range which basically represents -infinity and then when we apply now the softmax it will be replaced by 0 we apply it to this dimension ok, let me write some comments so in this case we have batch by h so each head will and then sequence length and sequence length alright, if we also have a dropout so if dropout is not known we also apply the dropout and finally as we saw in the original slide we multiply the output of the softmax by the vmatrix matrix multiplication so we return attention scores multiplied by value and also the attention score itself so why are we returning a tuple?

because we want this of course we need it for the model because we need to give it to the next layer but this will be used for visualization so the output of the self-attention so the multi-head attention in this case is actually going to be here and we will use it for visualizing so for visualizing what is the score given by the model for that particular interaction let me also write some comments here so here we are doing like this batch and let's go back here now we have our multi-head attention so the output of the multi-head attention what we do is finally we, ok let's go back to the slide first where we are we calculated these smaller matrices here so we applied the softmax Q by KT divided by the square root of DV and then we multiplied it also by V we can see it here which gives us this small matrix here head 1, head 2, head 3 and head 4 now we need to combine them together concat, just like the formula says from the paper and finally multiply it by WO so let's do it we transpose because before we transformed the matrix into sequence length we had the sequence length as the third dimension we wanted back in the first place to combine them because the resulting tensor we want the sequence length to be in the second position so let me write it first what we want to do batch we started from this one sequence length first we do a transposition and then what we want is this so this transposition takes us here and then we do a view but we cannot do it we need to use contiguous this means basically that PyTorch to transform the shape of a tensor needs to put the memory to be contiguous so we can just do it in place -1 and self.h multiplied by self.dk which as you remember this is the model because we defined dk to be here the model by h divide by h ok and finally we multiply this x by wo which is our output matrix of x this will give us we go from batch and this is and this is our multi-head attention block we have I think all the ingredients now to combine them all together we just miss one small layer let's go have a look at it first there is one last layer we need to build which is the connection we can see here for example here we have some output of this layer, so addNorm that is taken here with this connection and this one part is sent here then the output of this is sent to the addNorm and then combined together by this layer so we need to create this layer that manages this skip connection so we take the input we give it to we skip it by one layer we take the output of the previous layer so in this case the multi-head attention we give it to this layer but also combining with this part so let's build this layer I will call it residual connection because it's basically a skip connection ok let's build this residual connection as usual we define the constructor and in this case we just need a dropout as you remember the skip connection is between the add and the norm and the previous layer so we also need the norm which is our layer normalization which we defined before and then we define the forward method and the sublayer which is the previous layer what we do is we take the X and we combine it with the output of the next layer which in this case is called sublayer and we apply the dropout so this is the definition of add and norm actually there is a slight difference that we first apply the normalization and then we apply the sublayer in the case of the paper they apply first the sublayer and then the normalization I saw many implementations and most of them actually did it like this so we will also stick with this particular as you remember we have these blocks are combined together by this bigger block here and we have N of them so this big block we will call it encoder block and each of this encoder block is repeated N times where the output of the previous is sent to the next one and the output of the last one is sent to the decoder so we need to create this block which will contain one multi head attention two add and norm and one feed forward so let's do it we will call this block the encoder block because the decoder has three blocks inside the encoder has only two and as I saw before we have the self attention block inside which is the multi head attention we call it self attention because in the case of the encoder it is applied to the same input with three different roles the role of query, of the key and the value which is our feed forward and then we have a dropout which is a floating point and then we define and then we define the two residual connections we use the module list which is a way to organize a list of modules in this case we need two of them okay let's define the forward method I define the source mask, what is the source mask?

it's the mask that we want to apply to the input of the encoder, and why do we need a mask for the input of the encoder? because we want to hide the interaction of the padding word with other words, we don't want the padding word to interact with other words so we apply the mask and let's do the first residual connection let's go back to check the video actually to check the slide so we can understand what we are doing now so the first skip connection is this X here is going to here, but before it's added and we add a norm, we first need to apply the multi-head attention, so we take this X we send it to the multi-head attention and at the same time we also send it here and then we combine the two so the first skip connection is between X and then the other X is coming from the self-attention so this is the function so I will define the sub-layer using a lambda, so this basically means first apply the self-attention self-attention in which we give the query key and the value is our X so our input, so this is why it's called self-attention, because the role of the query key and the value is X itself, so the input itself, so it's the sentence that is watching itself, so each word of one sentence is interacting with other words of the same sentence, we will see that in the decoder it's different because we have the cross-attention, so the keys coming from the decoder are watching the sorry, the query coming from the decoder are watching the key and the values coming from the encoder we give it the source mask, so what is this, basically we are calling this function, the forward function of the multi-head attention block, so we give query key value and the mask this will be combined with this by using the residual connection then again we do the second one, the second one is the feed forward we don't need lambda here actually and then we return X so this means combine the feed forward and then the X itself, so the output of the previous layer, which is this one and then apply the residual connection this defines our encoder block now we can define the encoder object, so because the encoder is made up of many encoder blocks, we can have up to N of them according to the paper, so let's define the encoder how many layers we will have, we will have N, so we have many layers and they are applied one after another, so this is a module list and at the end we will apply a layer normalization so we apply one layer after another the output of the previous layer becomes the input for the next layer, here I forgot something and finally we apply the normalization and this concludes our journey around the encoder let's go have a brief overview of what we have done we have taken the inputs, send it to the we didn't, ok, we didn't combine all the blocks together for now we just built this big block here called encoder which contains two smaller blocks that are the skip connection, the skip connection first one is between the multihead attention and this X that is sent here, the second one is between this feedforward and this X that is sent here, we have N of these blocks one after another the output of the last will be sent to the decoder before but before we apply the normalization now we built the decoder part, now in the decoder the output embeddings are the same as the input embeddings I mean the class that we need to define is the same, so we will just initialize it twice and the same goes for the positional encodings, we can use the same values that we use for the encoder, also for the decoder what we need to define is this big block here which is made of masked multihead attention, add a norm, so one skip connection here, another multihead attention with another skip connection and the feedforward with the skip connection here, the way we define the multihead attention class actually already takes into consideration the masks, so we don't need to reinvent the wheel, also for the decoder we can just define the decoder block which is this big block here made of three sublayers and then we build the decoder using this n number of this decoder blocks, so let's do it let's define first the decoder block in the decoder we have the self attention which is, let's go back this is a self attention because we have this input that is used three times in the masked multihead attention, so this is called self attention because the same input plays the role of the query, the key and the values which means that the same sentence is each word in the sentence is matched with each other word in the same sentence, but in this part here we will have an attention calculated using the query coming from the decoder while the key and the values will come from the encoder so this is not a self attention, this is called cross attention because we are crossing two kind of different objects together and matching them somehow to calculate the relationship between them, ok let's define this is the cross attention block which is basically the multihead attention but we will give it the different parameters this is our feedforward and then we have a dropout dropout ok, we defined also the residual connection, in this case we have three of them wonderful, ok let's build the forward method which is very similar to the encoder with a slight difference that I will highlight we need x, what is x?

it's the input of the decoder but we also need the output of the encoder we need the source mask which is the mask applied to the encoder and the target mask which is the mask applied to the decoder why they are called source mask and target mask? because in this particular case we are dealing with a translation task so we have a source language, in this case it's english and we have a target language which in our case is italian, so you can call it encoder mask or decoder mask but basically we have two masks, one is the one coming from the encoder one is the one coming from the decoder so in our case we will call it source, so the source mask is the one coming from the encoder, so the source language and the target mask is the one coming from the decoder, so the target language and just like before we calculate the self-attention first, which is the first part of the decoder block in which the query, the key and the values are the same input but with the mask of the decoder because this is the self-attention block of the decoder and then we need to combine, we need to calculate the cross-attention which is our second residual connection we give him ok, in this case we are giving the query coming from the decoder so the x, the key and the values coming from the encoder and the mask of the encoder and finally the feedforward block just like before and that's it, we have all the ingredients actually to build the decoder now which is just n times this block one after another just like we did for the encoder also in this case we will provide with many layers so layers this is just a model list and we will also have a normalization at the end just like we did before, we apply the input to one layer and then we use the output of the previous layer and give it as an input of the next layer each layer is a decoder block so we need to give it x we need to give it the encoder output, then the source mask and the target mask so each of them is this, we are calling the forward method here, so nothing different and finally we apply the normalization and this is our decoder there is one last ingredient we need to have what is a full transformer, so let's have a look at it the last ingredient we need is this layer here the linear layer as you remember from my slides the output of the multi-head attention is something that is sequenced by D-model so here, we expect to have the output to be sequenced by D-model if we don't consider the batch dimension however, we want to map these words back into the vocabulary so that's why we need this linear layer which will convert the embedding into a position of the vocabulary I will call this layer, call the projection layer, because it's projecting the embedding into the vocabulary, let's go build it what we need for this layer is the D-model, so the D-model which is an integer and the vocabulary size this is basically a linear layer that is converting from D-model to vocabulary size so .projectionlayer is let's define the forward method ok, what we want to do let me write this little comment we want to batch sequence length to D-model converted into batch sequence length vocabulary size and in this case we will also already apply the softmax and actually we will apply the log softmax for numerical stability like I showed before to the last dimension and that's it, this is our projection layer, now we have all the ingredients we need for the transformer, so let's define our transformer block in a transformer we have an encoder which is our encoder, we have a decoder which is our decoder we have a source embedding why we need a source embedding and a target embedding, because we are dealing with multiple languages, so we have one input embedding for the source language and one input embedding for the target language and we have the target embedding then we have the source position and the target position which will be the same actually and then we have the projection layer we just save this now we define three methods, one to encode one to decode and one to project we will apply them in succession why we don't just build one forward method because as we will see during inferencing we can reuse the output of the encoder, we don't need to calculate it every time and also we prefer to keep these outputs separate also for visualizing the attention so for the encoder we have the source of the because we have the source language and the source mask so what we do is we apply first the embedding then we apply the positional encoding and finally we apply the encoder then we define the decode method which takes the encoder output which is the tensor the source mask which is the tensor the target and the target mask oops and what we do is target we first apply the target embedding to the target sentence then we apply the positional encoding to the target sentence and finally we decode this is basically the forward method of this decoder so we have the same order of parameters yes finally we define the project method in which we just apply the projection so we take from the embedding to the vocabulary size ok, this is also this is the last block we had to build but we didn't make a method to combine all these blocks together, so we built many blocks we need one that given the hyperparameters of the transformer builds for us one single transformer initializing all the encoder, decoder, the embeddings etc.

so let's build this function, let's call it buildTransformer that given all the hyperparameters will build the transformer for us and also initialize the parameters with some initial values what we need to define a transformer, for sure in this case we are talking about translation ok, this model that we are building we will be using for translation but you can use it for any task so the naming I'm using are basically the ones used in the translation task later you can change the naming but the structure is the same so you can use it for any other task for which the transformer is applicable so the first thing we need is the vocabulary size of the source and the target because we need to build the embedding because the embedding need to convert from the token of the vocabulary into a vector of size 512 so it needs to know how big is the vocabulary, so how many vectors it needs to create then the target which is also an integer then we need to tell him what is the source sequence length and the target sequence length this is very important they could also be the same in our case it will be the same but they can also be different for example in case you are using the transformer that is dealing with two very different languages for example for translation in which the tokens needed for the source languages are much higher or much lower than the other ones so you don't need to keep the same length you can use different lengths the next hyperparameter is the dmodel which we initialize with 512 because we want to keep the same values as the paper then we define the hyperparameter n which is the number of layers so the number of encoder blocks that we will be using is according to the paper is 6 then we define the hyperparameter h which is the number of heads we want and according to the paper it is 8 the dropout is 0.1 and finally we have the hidden layer dff of the feedforward layer which is 2048 as we saw before on the paper and this builds a transformer ok so first we do is we create the embedding layers so source embedding then the target embedding then we create the positional encoding layers we don't need to create two positional encoding layers because actually they do the same job and they also don't add any parameter but because they have the dropout and also because I want to make it verbal so you can understand each part without making any optimization I think actually it's fine because this is for educational purpose so I don't want to optimize the code I want to make it as much comprehensible as possible so I do every part I need I don't take shortcuts and then we create the encoder blocks we have n of them so let's define let's create an empty array so we have n of them so each encoder block has a self-attention so encoder self-attention which is a multi-head attention block, the multi-head attention requires the demodule the edge and the dropout value then we have a feed-forward block as you can see also the names I'm using are quite long mostly because I want to make it as comprehensible as possible for everyone so each encoder block is made of a self-attention and a feed-forward and finally we tell him how much is the dropout Finally we add this encoder block.

And then we can create the decoder blocks. We also have the cross attention for the decoder block. We also have the feedforward, just like the encoder. Then we define the decoder block itself, which is decoder block, cross attention and finally the feedforward and the dropout. And finally we save it in its array.

We now can create the encoder and the decoder. We give him all his blocks, which are n and then also the decoder. And we create the projection layer, which will convert the model into vocabulary size. Which vocabulary? Of course the target, because we want to take from the source language to the target language.

So we want to project our output into the target vocabulary. And then we build the transformer. What does it need? An encoder, a decoder, source embedding, target embedding, then source positional encoding, target positional encoding, and finally the projection layer. And that's it. Now we can just initialize the parameters using the Xavier uniform.

This is a way to initialize the parameters to make the training faster so they don't just start with random values. And there are many algorithms to do it. I saw many implementations using Xavier, so I think it's a quite good start for the model to learn from. Finally return our beloved transformer.

And this is it. This is how you build the model. And now that we have built the model, we will go further to use it. So we will first have a look at the dataset, then we will build the training loop. After the training loop, we will also build the inferencing part and the code for visualizing the attention.

So hold on and take some coffee, take some tea, because it's going to be a little long, but it's going to be worth it. Now that we have built the code for the model, our next step is to build the training code. But before we do that, let's recheck the code, because we may have some typos.

I actually already made this check. And there are a few mistakes in the code. I compared the old with the new one. It is very minor problems. So we wrote "feedForward" instead of "feedForward" here. And so the same problem is also present in every reference to "feedForward". And also here, when we are building the decoder block.

And the other problem is that here, when we build the decoder block, we just wrote "nn.module". Instead, it should be "nn.moduleList". And then the "feedForward" should be also fixed here and here in the buildTransformer method. Now, I can delete the old one, so we don't need it anymore. Let me check the model.

It's the correct one, with "feedForward". Yes, okay. Our next step is to build the training code. But before we build the training code, we have to look at the data. What kind of data are we going to work with? So as I said before, we are dealing with a translation task.

And I have chosen this dataset called "opus_books", which we can find on HuggingFace. And we will also use the library from HuggingFace to download this dataset for us. And this is the only library we will be using beside PyTorch. Because of course we cannot reinvent the dataset by ourselves, so we will use this dataset.

And we will also use the HuggingFace tokenizer library to transform this text into vocabulary. Because our goal is to build the transformer, so not to reinvent the wheel about everything. So we will be only focusing on building and training the transformer. And in my particular case, I will be using the subset "English to Italian", but we will build the code in such a way that you can choose the language and the code will act accordingly.

If we look at the data, we can see that each data item is a pair of sentences in English and in Italian. For example, there was no possibility of taking a walk that day, which in Italian means "In quel giorno era impossibile passeggiare". So we will train our transformer to translate from the source language, which is English, into the target language, which is Italian.

So let's do it. We will do it step by step. So first we will make the code to download this dataset and to create the tokenizer. So what is a tokenizer? Let's go back to the slides to just have a brief overview of what we are going to do with this data.

The tokenizer is what comes before the input embeddings. So we have an English sentence. So for example, "Your cat is a lovely cat", but this sentence will come from our dataset. The goal of the tokenizer is to create this token. So split this sentence into single words, which has many strategies.

As you can see here, we have a sentence, which is "Your cat is a lovely cat". And the goal of the tokenizer is to split this sentence into single words, which can be done in many ways. There is the BPE tokenizer, there is the word-level tokenizer, there is the sub-word-level, word-part tokenizer.

There are many tokenizers. The one we will be using is the simplest one called the word-level tokenizer. So the word-level tokenizer basically will split this sentence, let's say by space. So each space defines the boundary of a word, and so into the single words, and each word will be mapped to one number.

So this is the job of the tokenizer, to build the vocabulary of these numbers and to map each word into a number. When we build the tokenizer, we can also create special tokens, which we will use for the transformer. For example, the tokens called padding, the token called the start-of-sentence, end-of-sentence, which are necessary for training the transformer.

But we will do it step-by-step. So let's build first the code for building the tokenizer and to download the dataset. Let's create a new file. Let's call it train.py. Okay. Let's import our usual library. So torch, we will also import torch.nm. And we also, because we are using a library from HuggingFace, we also need to import these two libraries.

We will be using the datasets library, which you can install using pip. So datasets, we will be using load dataset. And we will also be using the tokenizers library also from HuggingFace, which you can install with pip. We also need the, which tokenizer we need, so we will use the word-level tokenizer.

And there is also the trainers, so the tokenizer, the class that will train the tokenizer. So that will create the vocabulary given the list of sentences. And we will split the word according to the white space. I will build one method by one. So I will build first the methods to create the tokenizer, and I will describe each parameter.

For now, you will not have the bigger picture, but later when we combine all these methods together, you will have the bigger picture. So let's first make the method that builds the tokenizer. So we will call it getOrBuildTokenizer. And this method takes the configuration, which is the configuration of our model.

We will define it later. The dataset and the language for which we are going to build the tokenizer. We define the tokenizer path, so the file where we will be save this tokenizer. And we do it path of config. Okay, let me define some things. First of all, this path is coming from the pathlib, so from pathlib.

This is a library that allows you to create absolute path given relative paths. And we pretend that we have a configuration called the tokenizer file, which is the path to the tokenizer file. And this path is formattable using the language. So for example, we can have something like this, for example, something like this.

And this will be, given the language, it will create a tokenizer English or tokenizer Italian, for example. So if the tokenizer doesn't exist, we create it. I took all this code actually from HuggingFace. It's nothing complicated. I just taken their quick tour of their tokenizers library. And it's really easy to use it, and saves you a lot of time.

Because tokenizer, to build a tokenizer is really reinventing the wheel. And we will also introduce the unknown word, unknown. So what does it mean? If our tokenizer sees a word that it doesn't recognize in its vocabulary, it will replace it with this word, unknown. It will map it to the number corresponding to this word, unknown.

The pre-tokenizer means basically that we split by whitespace. And then we train, we build the trainer to train our tokenizer. Okay, this is the trainer. What does it mean? It means it will be a word-level trainer. So it will split words using the whitespace and using the single words.

And it will also have four special tokens. One is unknown, which means that if you cannot find that particular word in the vocabulary, just replace it with unknown. It will also have the padding, which we will use to train the transformer, the start of sentence and the end of sentence special tokens.

In frequency means that a word, for a word to appear in our vocabulary, it has to have a frequency of at least two. Now we can train the tokenizer. We use this method, which means we build first a method that gives all the sentences from our data set and we will build it later.

Okay, so let's build also this method called getAllSentence so that we can iterate through the data set to get all the sentences corresponding to the particular language for which we are creating the tokenizer. As you remember, each item in the data set, it's a pair of sentences, one in English, one in Italian.

We just want to extract one particular language. This is the item representing the pair. And from this pair, we extract only the one language that we want. And this is the code to build the tokenizer. Now let's write the code to load the data set and then to build the tokenizer.

We will call this method getDataset and which also takes the configuration of the model, which we will define later. So let's load the data set. We will call it dsRow. Okay, HuggingFace allows us to download its data sets very easily. We just need to tell him what is the name of the data set.

And then tell him what is the subset we want. We want the subset that is English to Italian, but we want to also make it configurable for you guys to change the language very fast. So let's build this subset dynamically. We will have two parameters in the configuration. One is called languageSource and one is called languageTarget.

Later we can also define what split we want of this data set. In our case, there is only the training split in the original data set from HuggingFace, but we will split by ourself into the validation and the training data. So let's build the tokenizer. This is the raw data set and we also have the target.

Okay, now, because we only have the training split from HuggingFace, we can split it by by ourself into a training and the validation. We keep 90% of the data for training and 10% for validation. Okay, let's build the tokenizer. The method randomSplit allows, it's a method from PyTorch that allows to split a data set using the size that we give as input.

So in this case, it means split this data set into this two smaller data set, one of this size and one of this size. But let's import the method from Torch. Let's also import the one that we will need later, TotalOrder and randomSplit. Now we need to create the data set.

The data set that our model will use to access the tensors directly, because now we just created the tokenizer and we just loaded the data, but we need to create the tensors that our model will use. So let's create the data set. Let's call it bilingual data set and for that we create a new file.

So here we import Torch and that's it. We will call the data set, we will call it bilingual data set. Okay as usual we define the constructor and in this constructor we need to give him the data set downloaded from HuggingFace, the tokenizer of the source language, the tokenizer of the target language, the source language, the name of the source language, the name of the target language and the sequence length that we will use.

Okay we save all these values. We can also save the tokens, the particular tokens that we will use to create the tensors for the model. So we need the start of sentence, end of sentence and the padding token. So how do we convert the token start of sentence into a number, into the input ID?

There is a special method of the tokenizer to do that, so let's do it. So this is the start of sentence token, we want to build it into a tensor. This tensor will contain only one number which is given by, we can use this tokenizer from the source or the target, it doesn't matter because they both contain these particular tokens.

This is the method to convert the token into a number, so start of sentence and the type of this token, of this tensor is, we want it long because the vocabulary can be more than 32-bit long, the vocabulary size, so we usually use the long 64-bit. And we do the same for the end of sentence and the padding token.

We also need to define the length method of this dataset, which tells the length of the dataset itself, so basically just the length of the dataset from hugging face, and then we need to define the get item method. First of all we will extract the original pair from the hugging face dataset, then we extract the source text and the target text.

And finally we convert each text into tokens, and then into input IDs. What does it mean? We will first, the tokenizer will first split the sentence into single words, and then will map each word into its corresponding number in the vocabulary, and it will do it in one pass only, this is done by the encode method.ids, this gives us the input IDs, so the numbers corresponding to each word in the original sentence, and it will be given as an array.

We did the same for the decoder. Now as you remember, we also need to pad the sentence to reach the sequence length. This is really important because we want our model to always work, I mean the model always works with a fixed length, sequence length, but we don't have enough words in every sentence, so we use the padding token, so this PAD here, as the padding token to fill the sentence until it reaches the sequence length.

So we calculate how many padding tokens we need to add for the encoder side and for the decoder side, which is basically how many we need to reach the sequence length. Minus two, why minus two here? So we already have this amount of tokens, we need to reach this one, but we will add also the start of sentence token and the end of sentence token to the encoder side, so we also have minus two here.

And here only minus one. If you remember my previous video, when we do the training, we add only the start of sentence token to the decoder side, and then in the label we only add the end of sentence token. So in this case we only need to add one token, special token to the sentence.

We also make sure that this sequence length that we have chosen is enough to represent all the sentences in our dataset, and if we chose too small one, we want to raise an exception. So basically this number of padding tokens should never become negative. Okay, now let's build the two tensors for the encoder input and for the decoder input, but also for the label.

So one sentence will be sent to the input of the encoder, one sentence will be sent to the input of the decoder, and one sentence is the one that we expect as the output of the decoder. And that output we will call label. Usually it's called target or label.

I call it label. We can cut the tensor of the start, okay, we can cut three tensors. First is the start of sentence token, then the tokens of the source text, then the end of sentence token, and then enough padding tokens to reach the sequence length. We already calculated how many impeding tokens we need to add to this sentence, so let's just do it.

And this is the encoder input, so let me write some comment here. This is add SOS and AOS to the source text. Then we build the decoder input, which is also a concatenation of tokens. In this case we don't have the start of sentence, we just have the start of sentence.

And finally we add enough padding tokens to reach the sequence length. We already calculated how many we need, just use this value now. And then we build the label. In the label we only add the end of sentence token. Because we need the same number of padding tokens as for the decoder input.

Just for debugging, let's double check that we actually reach the sequence length. Ok, now that we have made this check, let me also write some comments here. Here we are only adding SOS to the decoder input. And here is add EOS to the label, what we expect as output from the decoder.

Now we can return all these tensors so that our training can use them. We return a dictionary comprised of encoder input. What is the encoder input? It's basically of size sequence length. Then we have the decoder input, which is also just a sequence length number of tokens. I forgot a comma here.

And then we have the encoder mask. So what is the encoder mask? As you remember, we are increasing the size of the encoder input sentence by adding padding tokens. But we don't want these padding tokens to participate in the self-attention. So what we need is to build a mask that says that we don't want these tokens to be seen by the self-attention mechanism.

And so we build the mask for the encoder. How do we build this mask? We just say that all the tokens that are not padding are okay. All the tokens that are padding are not okay. We also unscreeze to add this sequence dimension and also to add the batch dimension later.

And we convert into integers. So this is 1, 1 sequence length, because this will be used in the self-attention mechanism. However, for the decoder, we need a special mask that is a causal mask, which means that each word can only look at the previous word and each word can only look at non-padding words.

So we don't want, again, we don't want the padding tokens to participate in the self-attention. We only want real words to participate in this. And we also don't want each word to watch at words that come after it, but only that words come before it. So I will use a method here called causal mask that will build it.

Later we will build it also. So now I just call it to show you how it's used, and then we will proceed to build it. So in this case, we don't want the padding tokens, and we add the necessary dimensions. And also we do a Boolean end with causal mask, which is a method that we will build right now.

And this causal mask needs to build a matrix of size sequence length to sequence length. What is sequence length is basically the size of our decoder input. And this, let me write a comment for you. So this is one, two, sequence length, combined with, so the end with one sequence length, sequence length, and this can be broadcasted.

Let's go define this method, causal mask. So what is causal mask? Causal mask basically means that we want, let's go back to the slides actually, as you remember from the slides, we want each word in the decoder to only watch words that come before it. So what we want is to make all these values above this diagonal that represents the multiplication, this matrix represents the multiplication of the queries by the keys in the self-attention mechanism.

What we want is to hide all these values. So your cannot watch the word cat is a lovely cat. It can only watch itself, but this word here, for example, this word lovely can watch everything that comes before it. So from your up to lovely itself, but not the word cat that comes after it.

So what we do is we want all these values here to be masked out. So which also means that we want all the values above this diagonal to be masked out. And there is a very practical method in PyTorch to do it. So let's do it. Let's go build this method.

So the mask is basically torch.triu, which means give me every value that is above the diagonal that I am telling you. So we want a matrix, which matrix, matrix made of all ones. And this method will return every value above the diagonal and everything else will become zero. So we want diagonal one type, we want it to be integer.

And what we do is return mask is equal to zero. So this will return all the values above the diagonal and everything below the diagonal will become zero. But we want actually the opposite. So we say, okay, everything that is zero should will become true with this expression and everything that is not zero will become false.

So we apply it here to build this mask. So this mask will be one by sequence length by sequence length, which is exactly what we want. Okay, let's add also the label. The label is also, oh, I forgot the comma. Sequence length and then we have the source text just for visualization, we can send it source text and then the target text.

And this is our data set. Now let's go back to our training method to continue writing the training loop. So now that we have the data set, we can create it. We can create two data sets, one for training, one for validation, and then we send it to a data loader and finally to our training loop.

We forgot to import the data set. So let's import it here. We also import the causal mask, which we will need later. What is our source language? it's in the configuration. What is our target language? And what is our sequence length? It's also in the configuration. We do the same for the validation.

But the only difference is that we use this one now and the rest is same. We also, just for choosing the max sequence length, we also want to watch what is the maximum length of each sentence in the source and the target for each of the two splits that we created here.

So that if we choose a very small sequence length, we will know. Basically what we do, I load each sentence from each language, from the source and the target language. I convert into IDs using the tokenizer and I check the length. If the length is, let's say 180, we can choose 200 as sequence length, because it will cover all the possible sentences that we have in this data set.

If it's, let's say 500, we can use 510 or something like this, because we also need to add the start of sentence and the end of sentence tokens to these sentences. This is the source IDs, then let's create also the target IDs, and this is the language of target.

And then we just say the source maximum length is the maximum of the and the length of the current sentence, the target is the target and the target IDs. Then we print these two values, we also do it for the target. And that's it. Now we can proceed to create the data loaders.

We define the batch size according to our configuration, which we still didn't define, but you can already guess what are its values. We want it to be shuffled. For the validation, I will use a batch size of one, because I want to process each sentence one by one. And this method returns the data loader of the training, the data loader of the validation, the tokenizer of the source language and the tokenizer of the target language.

Now we can start building the model. So let's define a new method called getModel, which will, according to our configuration, our vocabulary size, build the model, the transformer model. So the model is, we didn't import the model, so let's import it. Model transformer. What is the first? The source vocabulary size and the target vocabulary size.

And then we have the sequence length. And we have the sequence length of the source language and the sequence length of the target language. We will use the same. For both. And then we have the dModule, which is the size of the embedding. We can keep all the rest, the default, as in the paper.

If the model is too big for your GPU to be trained on, you can try to reduce the number of heads or the number of layers. Of course, it will impact the performance of the model. But I think given the dataset, which is not so big and not so complicated, it should not be a big problem because we are not building a huge dataset anyway.

OK, now that we have the model, we can start building the training loop. But before we build the training loop, let me as define this configuration because it keeps coming and I think it's better to define the structure now. So let's create a new file called config.py in which we define two methods.

One is called getConfig and one is to map to get the path where we will save the weights of the model. OK, let's define the batch size. I choose 8. You can choose something bigger if your computer allows it. The number of epochs for which we will be training, I would say 20 is enough.

The learning rate, I am using 10 to the power of -4. You can use other values. I thought this learning rate is reasonable. It's possible to change the learning rate during training. It's quite common to give a very high learning rate and then reduce it gradually with every epoch.

We will not be using it because it will just complicate the code a little more and this is not actually the goal of this video. The goal of this video is to teach how the transformer works. I have already checked the sequence length that we need for this particular dataset from English to Italian, which is 350 is more than enough.

And the D model that we will be using is the default of 512. The language source is English. So we are going from English. The language target is Italian. We are going to translate into Italian. We will save the model into the folder called weights. And the file name of which model will be T model, so transformer model.

I also built the code to preload the model in case we want to restart the training after maybe it crashed. And this is the tokenizer file. So it will be saved like this. So tokenizer n and tokenizer it according to the language. And this is the experiment name for TensorBoard on which we will save the losses while training.

I think there is a comma here. Okay. Now let's define another method that allows us to find the part where we need to save the weights. Why I'm creating such a complicated structure is because I will provide also notebooks to run this training on Google Colab. So we just need to change these parameters to make it work on Google Colab and save the weights directly on your Google Drive.

I have already created actually this code and it will be provided on GitHub and I will also provide the link in the video. Okay, the file is built according to model base name, then the epoch.pt. Let's import also here the path library. Okay, now let's go back to our training loop.

Okay, we can build the training loop now finally. So train model given the configuration. Okay, first we need to define which device on which we will put all the tensors. So define the device. Then we also print. We make sure that the weights folder is created. And then we load our data set.

To get the vocabulary size, there is method called get_vocab_size. And I think we don't have any other parameter. And finally, we transfer the model to our device. We also start TensorBoard.TensorBoard allows to visualize the loss, the graphics, the charts. Let's also import TensorBoard. Let's go back. Let's also create the optimizer.

I will be using the Adam optimizer. Okay, since we also have the configuration that allow us to resume the training in case the model crashes or something crashes, let's implement that one. And that will allow us to restore the state of the model and the state of the optimizer.

Let's import this method we defined in the data set. We load the file. And we run it. Here we have a typo. Okay, the loss function we will be using is the cross entropy loss. We need to tell him what is the ignore index. So we want him to ignore the padding token basically.

We don't want the padding token to contribute to the loss. And we also will be using label smoothing. Label smoothing basically allows our model to be less confident about its decision. So how to say, imagine our model is telling us to choose the word number three and with a very high probability.

So what we will do with label smoothing is take a little percentage of that probability and distribute to the other tokens so that our model becomes less sure of its choices. So kind of less over fit and this actually improves the accuracy of the model. So we will use a label smoothing of 0.1 which means from every highest probability token take 0.1% of score and give it to the others.

Okay let's build finally the training loop, we tell the model to train. I build a batch iterator for the data loader using tqodm which will show a very nice progress bar. And we need to import tqodm. Okay finally we get the tensors, the encoder input. What is the size of this tensor?

It's batch to sequence length. The decoder input is batch of decoder input and we also move it to our device, batch to sequence length, we get the two masks also. This is the size and then the decoder mask. Okay why these two masks are different? Because in the one case we are only telling him to hide only the padding tokens, in the other case we are also telling him to hide all these subsequent words, for each word to hide all the subsequent words to mask them out.

Okay now we run the tensors through the transformer. So first we calculate the output of the encoder and we encode using what the encoder input and the mask of the encoder. Then we calculate the decoder output using the encoder output, the source, the mask of the encoder, then the decoder input and the decoder mask.

Okay as we know this the result of this so the output of the model.encode will be a batch sequence length d model. Also the output of the decoder will be batch sequence length d model. But we want to map it back to the vocabulary so we need the projection.

So let's get the projection output. And this will produce a B so batch sequence length and target vocabulary size. Okay now that we have the output of the model we want to compare it with our label. So first let's extract the label from the batch. And we also put it on our device.

So what is the label it's B so batch to sequence length in which each position tell so the label is already for each B and sequence length so for each dimension tells us what is the position in the vocabulary of that particular word and we want these two to be comparable so we first need to compute the loss into this I show you now projection output view minus one.

Okay what does this do this basically transforms the I show you here this size into this size B multiplied by sequence length and then target vocabulary size vocabulary size. Okay because we want to compare it with this. This is how the cross entropy wants the tensors to be. And also the label.

Okay now we can we have calculated the loss we can update our progress bar this one with the loss we have calculated. This is this will show the loss on our progress bar. We can also log it on TensorBoard. Let's also flush it. Okay now we can back propagate the loss so loss.backward and finally we update the weights of the model so that is the job of the optimizer and finally we can zero out the grad and we move the global step by one the global step is being used mostly for TensorBoard to keep track of the loss we can save the model every epoch okay model file name which we get from our special methods this one we tell him the configuration we have and the name of the file which is the epoch but with zeros in front and we save our model.

It is very good idea when we want to be able to resume the training to also save not only the state of the model but also the state of the optimizer because the optimizer also keep tracks of some statistics one for each weight to understand how to move each weight independently and usually actually I saw that the optimizer dictionary is quite big so even if it's big if you want your training to be resumable you need to save it otherwise the optimizer will always start from zero and we'll have to figure out from zero even if you start from a previous epoch how to move each weight so every time we save some snapshot I always include it.

State of the model this is all the weights of the model we also want to save the optimizer let's do also the global step and we want to save all this into the file name so model file name and that's it now let's build the code to run this so if name I really find the warnings frustrating so I want to filter them out because I have some a lot of libraries especially CUDA I already know what's the content and so I don't want to visualize them every time but for sure for you guys I suggest watching them at least once to understand if there is any big problem otherwise they're just complaining from CUDA okay let's try to run this code and see if everything is working fine we should what we expect is that the code should download the data set the first time then it should create the tokenizer and save it into its file and it should also start training the model for 30 epochs of course it will never finish but let's do it let me check again the configuration tokenizer okay let's run it okay it's building the tokenizer and we have some problem here sequence length okay finally the model is training I show you recap you guys what I had mistaken first of all the sequence length was written incorrectly there was a capital L here and also in the data set I forgot to save it here and here I had it also written capitalized so L was capital and now the training is going on and as you can see the training is quite fast or at least on my computer actually not so fast but because I chose a batch size of 8 I could try to increase it and it's happening on CUDA the loss is decreasing and the weights will be saved here so if we reach the end of the epoch it will create the first weight here so let's wait until the end of the epoch and see if the weight is actually created before actually finishing the training of the model let's do another thing we also would like to visualize the output of the model while we are training and this is called validation so we want to check how our model is evolving while it is getting trained so what we want to build is a validation loop which will allow us to evaluate the model which also means that we want to inference from this model and check some sample sentences and see if how they get translated so let's start building the validation loop the first thing we do is we build a new method called run validation and this method will accept some parameters that we will use for now I just write all of them and later I explain how they will be used so, we have a new method called run validation and this method will accept some parameters, we will use them later.

okay the first thing we do to run the validation is we put our model into evaluation mode so we do model.eval and this means that this tells PyTorch that we are going to evaluate our model and then what we will do we will inference two sentences and see what is the output of the model.

so with torch.nodegrad we are disabling the gradient calculation for this for every tensor that we will run inside this with block and this is exactly what we want we just want to inference from the model we don't want to train it during this loop so let's get a batch from the validation data set because we want to inference only two so we keep a count of how many we have already processed and we get the input from this current batch I want to remind you that for the validation ds we only have a batch size of 1 this is the encoder input and we can also get the encoder mask let's just verify that the size of the batch is actually 1 and now let's go to the interesting part so as you remember when we calculate the when we want to inference the model we need to calculate the encoder output only once and reuse it for every token that the model will output from the decoder so let's create another function that will run the greedy decoding on our model and we will see that it will run the encoder only once so let's call this function greedy decode okay let's create some tokens that we will need so the SOS token which is the start of sentence we can get it from either tokenizer it doesn't matter if it's the target or the target EOS okay and then we what we do is we pre-compute the encoder output and reuse it for every token we get from the decoder so we just give the source and the source mask which is the encoder input and the encoder mask we can also call it encoder input and encoder mask then we get the then we okay how do we do the inferencing the first thing we do is we give to the decoder the start of sentence token so that the decoder will output the first token of the sentence of the translated sentence then at every iteration just like we saw in my slides at every iteration we add the previous token to the to the decoder input and so that the decoder can output the next token then we take the next token we put it again in front of the input to the decoder and we get the successive token so let's build a decoder input for the first iteration which is only the start of sentence token we fill this one with the start of sentence token and it has the same type as the encoder input okay now we will keep in asking the decoder to output the next token until we reach either the end of sentence token or the max length we have defined here so we can do a while true and then our first stopping condition is if we the decoder output which is becomes the input of the next step becomes large larger than max length or which is max length here why do we have two dimensions one is for the batch and one is for the tokens of the of the decoder input now we also need to create a mask for this we can use our function causal mask to say that we don't want the input to watch future words and we don't need the other mask because here we don't have any padding token as you can see now we calculate the output we reuse the output of the encoder for every iteration of the loop we reuse the source mask so the input the mask of the encoder then we give the decoder input and along with its mask the decoder mask and then we get the next token so we get the probabilities of the next token using the projection layer but we only want the projection of the last token so the next token after the last we have given to the encoder now we can use the max so we get the token with the maximum probability this is the greedy search and then we get this word and we append it back to this one because it will become the input of the next iteration and we concat so we take the decoder input and we append the next token so we create another tensor for that yeah should be correct okay if the next token so if the next word or token is equal equal to the end of sentence token then we also stop the loop and this is our greedy search now we can just return the output so the output is basically the decoder input because every time we are appending the next token to it and we remove the batch dimension so we squeeze it and that's our greedy decoding now we can use it here in this function so in the validation function so we can finally get the model output is equal to greedy decode in which we give him all the parameters and then we want to compare this model output with what we expected so with the label so let's append all of these so what we give to the input we gave to the model what the model output the output of the model so the predicted and what we expected as output we save all of this in this lists and then at the end of the loop we will print them on to get the text of the output of the model we need to use the tokenizer again to convert the tokens back into text and we use of course the target tokenizer because this is the target the language okay and now we save them all of this into their respective lists and we can also print it on the console while we are using why we are we using this function called print message and why not just use the print of the Python because we are using here in the main loop in the training loop we are using here tqdm which is our really nice looking progress bar but it is not suggested to print directly on the console when this progress bar is running so to print on the console there is one method called the print provided by tqdm and we will give this method to this function so that the output does not interfere with the progress bar printing so we print some bars and then we print all the messages okay so we have a message here and we have a message here and we have a message here.

and if we have already processed number of examples then we just break so why we have created these lists actually we can also send all of this to to a tensor board so we can so for example if we have tensor board enabled we can send all of this to the tensor board and to do that actually we need another library that allow us to calculate some metrics I think we can skip this part but if you are really interested in the code I published on github you will find that I use this library called the torch metrics that allows us to calculate the char error rate and the bleu metric which is really useful for translation tasks and the word error rate so if you really interested you can find the code on the github but for our demonstration I think it's not necessary so and actually this we can also remove it given that we are not doing this part okay so now that we have our run validation method we can just call it okay what I usually do is I run the validation at every few steps but because we want to see it as soon as possible what we will do is we will first run it at every iteration and we also put this model.train inside of this loop so that every time after we run the validation the model is back into its training mode so now we can just run validation and we give it all the parameter that it needs to run the validation so give it model okay for printing message are we printing any message we are so let's create a lambda and we just do and this is the message to write with the tqdm then we need to give the global step and the writer which we will not use but okay now I think we can run the training again and see if the validation works all right looks like it is working so the model is okay it's running the validation at every step which is not desirable at all but at least we know that the greedy search is working and it's not at least looks like it is working and the model is not predicting anything useful actually it's just predicting a bunch of commas because it's not training at all but if we train the model after a while we should see that at after a few epochs the model should become better and better and better so let's stop this training and let's put this one back to where it belongs so at the end of every epoch here and this one we can keep it here no problem yeah okay I will now skip fast forward to a model that has been pre-trained I pre-trained it for a few hours so that we can inference it and we can visualize the attention I have copied the pre-trained weights that I pre-calculated and I also created this notebook reusing the functions that we have defined before in the train file the code is very simple actually I just copy and pasted the code from the train file I just load the model and run the validation the same method that we just wrote and then I ran the validation on the pre-trained let's run it again for example and as you can see the model is inferencing 10 examples sentences and the result is not bad I mean we can see that LevinSmile, LevinSorrisse, LevinSorrisse it's matching and most of them matching actually we could also say that it's nearly over fit for this particular data but this is the power of the transformer I didn't train it for many days I just trained it for a few hours if I remember correctly and the results are really really good and now let's write let's make the notebook that we will use to visualize the attention of this pre-trained model given the file that we built before so train.py you can also train your own model choosing the language of your choice which I highly recommend that you change the language and try to see how the model is performing and try to diagnose why the model is performing bad if it's performing bad or if it's performing well try to understand how can you improve it further so let's try to visualize the attention so let's create a new notebook let's call it let's say attention visualization okay so the first thing we do we import all the libraries we will need.

I will also be using this library called Altair it's a visualization library for charts it's nothing related to deep learning actually it's just a visualization function and in particular the visualization function actually I found it online it's not written by me just like most of the visualization functions you can find easily on the internet if you want to build a chart or if you want to build a histogram etc so I am using this library mostly because I copied the code from the internet to visualize it but all the rest is my own code so let's import it okay let's import all of this and of course you will have to install this particular library when you run the code on your computer let's also define the device you can just copy the code from here and then we load the model which we can copy from here like this okay let's paste it here and this one becomes vocabulary source and vocabulary target okay now let's make a function to load the batch okay so we have a function called load_batch which is a function that loads the batch and oops.

I will convert the batch into tokens. I will convert the tokens now using the tokenizer (keyboard clicking) (keyboard clicking) (keyboard clicking) (keyboard clicking) (keyboard clicking) (keyboard clicking) (keyboard clicking) (keyboard clicking) And of course, for the decoder, we use the target vocabulary. So the target tokenizer. (keyboard clicking) So let's just infer using our greedy decode algorithm.

So we provide the model. (keyboard clicking) We return all this information. (keyboard clicking) Okay, now I will build the necessary functions to visualize the attention. I will copy some functions from another file because actually what we are going to build is nothing interesting from a learning point of view with regards to the deep learning.

It's mostly functions to visualize the data. So I will copy it because it's quite long to write and the salient part I will explain, of course. And this is the function. Okay, what does this function do? Basically, we have the attention that we will get from the encoder. How to get the attention from the encoder?

For example, the attention we have in three positions. First is in the encoder. The second one is in the decoder at the beginning of the decoder, so the self-attention of the decoder. And then we have the cross-attention between the encoder and the decoder. So we can visualize three type of attention.

How to get the information about the attention? Well, we load the other model. We have the encoder. We choose which layer we want to get the attention from. And then from each layer, we can get the self-attention block and then its attention scores. How do, where does this variable come from?

If you remember when we defined the attention calculation here, here, when we calculate the attention, we not only return the output to the next layer, we also give this attention scores, which is the output of the softmax. And we save it here in this variable, self.attentionscores. Now we can just retrieve it and visualize it.

So this function will, based on which attention we want to get from which layer and from which head, will select the matrix, the correct matrix. This function builds a data frame to visualize the information. So the tokens and the score extracted from this matrix here. So it will, this matrix, we extract the row and the column.

And then we also build the chart. The chart is built with Altair. And what we will build, actually, is we will get the attention for all the, I built this method to get the attention for all the heads and all the layers that we pass to this function as input.

So let me run this cell now. Okay, let's create a new cell. And then let's just run it. Okay, first we want to visualize the sentence that we are dealing with. So the batch. Order input tokens. So we load a batch. And then we visualize what is the source and the target.

(keys clacking) (keys clacking) And then also the target. And finally we calculate also the length. What is the length? Okay, it's basically all the characters that come before the padding character. So the first occurrence of the padding character. Because this is the batch taken from the dataset, which is already the tensor built for training, so they already include the padding.

In our case, we just want to retrieve the number of actual characters in our sentence. So this one, we can, the number of actual words in our sentence, so we can check the number of words that come before padding. So let's run this one. And there is some problem.

(keys clacking) Ah, here I forgot to, this function was wrong, so now it should work. Okay, this sentence is too small, let's get a longer one. Okay, let me check the quality. You cannot remain as you are, especially you. (speaking in foreign language) Okay, looks not bad. Okay, let's print the attention for the layers, let's say, zero, one, and two.

Because we have six of them, if you remember, the parameter is n is equal to six. So we will just visualize three layers. And we will visualize all the heads. We have eight of them for each layer. So the head number zero, one, two, three, four, five, six, seven, and seven.

Okay, let's first visualize the encoder self-attention. And we do get all attention maps. Which one we want? So the encoder one. And we want these layers and these heads. And what are the row tokens? The encoder input tokens. And what do we want in the column? Because we are gonna build a grid.

So as you know, the attention is a grid that correlates rows with columns. In our case, we are talking about the self-attention of the encoder. So it's the same sentence that is attending itself. So we need to provide the input sentence of the encoder on both the rows and the columns.

And what is the maximum number of length that we want to visualize? Okay, let's say we want to visualize no more than 20. So the minimum of 20 and sentence length. Okay, this is our visualization. We can see, and as we expected, actually, when we visualize the attention, we expect the values along the diagonals to be high because it's the dot product of each token with itself.

And we can see also that there are other interesting relationship. For example, we see that the start of sentence token and the end of sentence token, at least for the head zero and the layer zero, they are not related to other words, like I would expect, actually. But other heads, they do learn some very small mapping.

If we hover over each of the grid cells, we can see the actual value of the self-attention, so the score of the self-attention. For example, we can see the attention is very strong here, so the word especially and specially are related, so it's the same word with itself, but also especially and now.

And we can visualize this kind of attention for all the layers. So because each head will watch different aspect of each word because we are distributing the word embedding among the heads equally, so each head will see a different part of the embedding of the word. We also hope that they learn different kind of mapping between the words.

And this is actually the case. And between one layer and the next, we also have different WQ, WK, and WV metrics. So they should also learn different relationships. Now we can also want, we may also want to visualize the attention of the decoder. So let's do it. Let me just copy the code and just change the parameters.

Okay, here we want the decoder one, we want the same layers, et cetera, but the tokens that will be on the rows and the columns are the decoder tokens. So decoder input tokens and decoder input tokens. Let's visualize. And also we should see Italian language now because we are using the decoder self-attention.

And it is. So here we see a different kind of attention on the decoder side. And also here we have multiple heads that should learn different mapping and also different layers should learn different mappings between words. The one I find most interesting is the cross attention. So let's have a look at that.

Okay, let me just copy the code and run it again. Okay, so if you remember the method, it's encoder, decoder, same layer. So here on the rows we will show the encoder input and on the columns we will show the decoder input tokens because it's a cross attention between the encoder and the decoder.

Okay, this is more or less how the interaction between the encoder and the decoder works and how it happens. So this is where we find the cross attention calculated using the keys and the values coming from the encoder while the query is coming from the decoder. So this is actually where the translation task happens.

And this is how the model learns to relate these two sentences to each other to actually calculate the translation. So I invite you guys to run the code by yourself. So the first suggestion I give you is to write the code along with me with the video. You can pause the video, you can write the code by yourself.

Okay, let me give you some practical examples. For example, when I'm writing the model code, I suggest you watch me write the code for one particular layer and then stop the video, write it by yourself, take some time. Don't watch the solution right away. Try to figure out what is going wrong.

And if you really cannot, after one, two minutes, you cannot really figure out what is the problem, you can have a glimpse at the video. But try to do it by yourself. Some things, of course, you cannot come up by yourself. So for example, for the positional encoding and all this calculation, it's basically just an application of formulas.

But the point is you should at least be able to come with a structure by yourself. So how all the layers are interacting with each other. This is my first recommendation. And while about the training loop, the training part actually is quite standard. So it's very similar to other training loops that you may have seen.

The interesting part is how we calculate the loss and how we use the transformer model. And the last thing that is really important is how we inference the model, which is in this greedy decode. So thank you everyone for watching the video and for staying with me for so long.

I can assure you that it was worth it. And I hope in the next videos to make more examples of transformers and other models that I am familiar with and I also want to explore with you guys. So let me know if there is something that you don't understand or you want me to explain better.

I will also for sure follow the comment section. and please write me. Thank you and have a nice day.

Coding a Transformer from scratch on PyTorch, with full explanation, training and inference.

Chapters

Transcript