back to index

Coding a Transformer from scratch on PyTorch, with full explanation, training and inference.


Chapters

0:0 Introduction
1:20 Input Embeddings
4:56 Positional Encodings
13:30 Layer Normalization
18:12 Feed Forward
21:43 Multi-Head Attention
42:41 Residual Connection
44:50 Encoder
51:52 Decoder
59:20 Linear Layer
61:25 Transformer
77:0 Task overview
78:42 Tokenizer
91:35 Dataset
115:25 Training loop
140:5 Validation loop
161:30 Attention visualization

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hello guys, welcome to another episode about the transformer. In this episode we will be building the transformer from scratch using PyTorch
00:00:07.880 | so coding it from zero. We will be building the model and we will also build the code for training it for inferencing and for visualizing the attention scores
00:00:16.780 | stick with me because it's gonna be a long video but I assure you that by the end of the video you will have a deep knowledge of the transformer model
00:00:24.800 | not only from a conceptual point of view but also from a practical point of view
00:00:29.120 | we will be building a translation model which means that our model will be able to translate from one language to another
00:00:36.300 | I chose a data set that is called Opus Books and it's a synthesis taken from famous books
00:00:43.580 | I chose the English to Italian because I'm Italian so I can understand and I can tell that if the translation is good or not
00:00:51.520 | but I will show you which point you can change the language so you can test the same model with the language of your choice
00:00:58.800 | let's get started! Let's open the IDE of our choice, in my case I really love Visual Studio Code
00:01:05.100 | and let's create our first file which is the model of the transformer
00:01:10.100 | okay, let's go have a look at the transformer model first so we know which part we are going to build first
00:01:18.600 | and then we will build each part one by one
00:01:21.100 | the first part that we will be building is the input embeddings
00:01:24.600 | as you can see the input embeddings take the input and convert into an embedding
00:01:29.100 | what is the input embedding?
00:01:30.600 | as you remember from my previous video the input embeddings allows to convert the original sentence into a vector of 512 dimensions
00:01:40.100 | for example in this sentence "your cat is a lovely cat" first we convert the sentence into a list of input IDs
00:01:48.300 | that is numbers that correspond to the position of each word inside the vocabulary
00:01:53.900 | and then each of this number corresponds to an embedding which is a vector of size 512
00:02:00.200 | so let's build this layer first
00:02:02.200 | the first thing we need to do is to import Torch
00:02:05.700 | and then we need to create our class
00:02:15.500 | this is the constructor we will need to tell him what is the dimension of the model
00:02:28.500 | so the dimension of the vector in the paper this is called D model
00:02:33.500 | and we also need to tell him what is the vocabulary size
00:02:38.500 | so how many words there are in the vocabulary
00:02:43.300 | (typing)
00:03:00.300 | save these two values and now we can create the actual embedding
00:03:04.300 | actually PyTorch already provides with a layer that does exactly what we want to do
00:03:10.300 | that is given a number it will provide you with the same vector every time
00:03:15.300 | and this is exactly what embedding does
00:03:17.300 | it's just a mapping between numbers and a vector of size 512
00:03:21.300 | 512 here in this our case is the D model
00:03:25.300 | so this is done by the embedding layer and n.embedding
00:03:30.300 | and vocab size and D model
00:03:35.300 | let me check why my autocomplete is not working
00:03:40.300 | (typing)
00:03:46.300 | okay so now let's implement the forward method
00:03:49.300 | (typing)
00:03:54.300 | what we do in the embedding is that we just use the embedding layer provided by PyTorch to do this mapping
00:04:01.300 | so return self.embedding(x)
00:04:06.300 | now actually there is a little detail that is written on the paper
00:04:09.300 | that is let's have a look at the paper actually
00:04:11.300 | let's go here and if we check the embedding and softmax
00:04:14.300 | we will see that in this sentence in the embedding layer
00:04:17.300 | we multiply the weights of the embedding by square root of D model
00:04:21.300 | so what the authors do they take the embedding given by this embedding layer
00:04:28.300 | which I remind you is just a dictionary kind of layer
00:04:31.300 | that just maps numbers to the same vector every time
00:04:35.300 | and this vector is learned by the model
00:04:38.300 | so we just multiply this by math.sqrt of D model
00:04:44.300 | (typing)
00:04:47.300 | you also need to import math
00:04:50.300 | okay now the input embeddings are ready
00:04:54.300 | let's go to the next module
00:04:56.300 | the next module we are going to build is the positional encoding
00:04:59.300 | let's have also a look at what are the positional encoding very fast
00:05:03.300 | so we saw before that our original sentence gets mapped to a list of vectors
00:05:08.300 | by the embeddings layer
00:05:12.300 | and this is our embeddings
00:05:14.300 | now we want to convey to the model the information about the position of each word inside the sentence
00:05:21.300 | and this is done by adding another vector of the same size as the embedding
00:05:25.300 | so of size 512
00:05:27.300 | that includes some special values given by a formula that I will show later
00:05:31.300 | that tells the model that this particular word occupies this position in the sentence
00:05:36.300 | so we will create these vectors called the position embedding
00:05:40.300 | and we will add them to the embedding
00:05:42.300 | okay let's go do it
00:05:43.300 | okay let's define the class positional encoding
00:05:46.300 | (typing)
00:05:52.300 | and we define the constructor
00:05:55.300 | okay what we need to give to the constructor is for sure the D model
00:06:00.300 | because this is the size of the vector that the positional encoding should be
00:06:04.300 | and the sequence length this is the maximum length of the sentence
00:06:09.300 | and because we need to create one vector for each position
00:06:13.300 | and we also need to give the dropout
00:06:16.300 | dropout is to make the model less over fit
00:06:20.300 | (typing)
00:06:37.300 | okay let's actually build a positional encoding
00:06:41.300 | okay first of all the positional encoding is a
00:06:44.300 | we will build a matrix of shape sequence length to D model
00:06:47.300 | why sequence length to D model?
00:06:49.300 | because we need vectors of D model size so 512
00:06:54.300 | but we need sequence length number of them
00:06:57.300 | because the maximum length of the sentence is sequence length
00:07:00.300 | so let's do it
00:07:03.300 | (typing)
00:07:15.300 | okay before we create the matrix and we know how to create the matrix
00:07:20.300 | let's have a look at the formula used to create the positional encoding
00:07:23.300 | so let's go have a look at the formula used to create the positional encoding
00:07:27.300 | this is the slide from my previous video
00:07:30.300 | and let's have a look at how to build the vectors
00:07:33.300 | so as you remember we have a sentence let's say in this case we have three words
00:07:36.300 | we use these two formulas taken from the paper
00:07:40.300 | we create a vector of size 512
00:07:43.300 | and one for each possible position so up to sequence length
00:07:48.300 | and in the even positions we apply the first formula
00:07:52.300 | in the odd positions of the vector we apply the second formula
00:07:56.300 | in this case I will actually simplify the calculation
00:08:00.300 | because I saw online it has been simplified also
00:08:03.300 | so we will do a slightly modified calculation using log space
00:08:08.300 | this is for numerical stability
00:08:10.300 | so when you apply the exponential and then the log of something inside the exponential
00:08:15.300 | the result is the same number but it's more numerically stable
00:08:19.300 | so first we create a vector called position
00:08:22.300 | that will represent the position of the word inside the sentence
00:08:26.300 | and this vector can go from 0 to sequence length -1
00:08:36.300 | [typing]
00:08:53.300 | so actually we are creating a tensor of shape sequence length to 1
00:09:00.300 | [typing]
00:09:10.300 | okay now we create the denominator of the formula
00:09:35.300 | and these are the two terms we see inside the formula
00:09:39.300 | let's go back to the slide
00:09:40.300 | so the first tensor that we build that's called position
00:09:44.300 | it's this pause here and the second tensor that we build is the denominator here
00:09:48.300 | but we calculated it in log space for numerical stability
00:09:52.300 | the value actually will be slightly different but the result will be the same
00:09:56.300 | the model will learn this positional encoding
00:09:58.300 | don't worry if you don't fully understand this part
00:10:01.300 | it's just very special let's say functions that convey this positional information to the model
00:10:07.300 | and if you watched my previous video you will also understand why
00:10:11.300 | now we apply this to denominator and denominator to the sine and the cosine
00:10:15.300 | as you remember the sine is only used for the even positions
00:10:19.300 | and the cosine only for the odd positions
00:10:21.300 | so we will apply it twice
00:10:23.300 | let's do it
00:10:24.300 | so apply
00:10:26.300 | [typing]
00:10:32.300 | so every position will have the sine but only
00:10:37.300 | so every word will have the sine but only the even dimensions
00:10:41.300 | so starting from 0 up to the end and going forward by 2 means
00:10:46.300 | every from 0 then the number 2 then the number 4 etc etc
00:10:51.300 | [typing]
00:10:54.300 | position multiplied by diphtherm
00:10:57.300 | [typing]
00:10:59.300 | then we do the same for the cosine
00:11:02.300 | [typing]
00:11:06.300 | in this case we start from 1 and go forward by 2
00:11:09.300 | it means 1, 3, 5 etc
00:11:12.300 | [typing]
00:11:18.300 | and then we need to add the batch dimension to this tensor
00:11:22.300 | so that we can apply it to the whole sentences
00:11:25.300 | so to all the batch of sentence
00:11:27.300 | because now the shape is sequence length to demodule
00:11:29.300 | but we will have a batch of sentences
00:11:31.300 | so what we do is we add a new dimension to this PE
00:11:35.300 | [typing]
00:11:37.300 | and this is done using unsqueeze
00:11:39.300 | and in the first position
00:11:41.300 | so it will become a tensor of shape 1 to sequence length to demodule
00:11:48.300 | and finally we can register this tensor in the buffer of this module
00:11:53.300 | so what is the buffer of the module
00:11:55.300 | let's first do it
00:11:57.300 | register buffer
00:11:59.300 | [typing]
00:12:02.300 | so basically when you have a tensor that you want to keep inside the module
00:12:07.300 | not as a parameter, learned parameter
00:12:09.300 | but you want it to be saved when you save the file of the module
00:12:13.300 | you should register it as a buffer
00:12:15.300 | this way the tensor will be saved in the file along with the state of the module
00:12:19.300 | then we do the forward method
00:12:22.300 | [typing]
00:12:26.300 | so as you remember from before
00:12:28.300 | we need to add this positional encoding to every word inside the sentence
00:12:33.300 | so let's do it
00:12:35.300 | so we just do x is equal to x
00:12:37.300 | plus the positional encoding for this particular sentence
00:12:41.300 | [typing]
00:12:51.300 | and we also tell the module that we don't want to learn this positional encoding
00:12:56.300 | because they are fixed
00:12:58.300 | they will always be the same
00:13:00.300 | they are not learned along the training process
00:13:02.300 | so we just do it
00:13:04.300 | require squared false
00:13:06.300 | this will make this particular tensor not learned
00:13:10.300 | and then we apply the dropout
00:13:12.300 | [typing]
00:13:16.300 | and that's it, this is the positional encoding
00:13:18.300 | let's have a look at the next module
00:13:20.300 | first we will build the encoder part of the transformer
00:13:24.300 | which is this left side here
00:13:26.300 | and we still have the multihead attention to build the add and norm and the feedforward
00:13:31.300 | and actually there is another layer which connects this skip connection to all these sublayers
00:13:36.300 | so let's start with the easiest one
00:13:38.300 | let's start with layer normalization
00:13:39.300 | which is this add and norm
00:13:41.300 | as you remember from my previous video
00:13:43.300 | let's have a look at the layer normalization
00:13:45.300 | a little briefing
00:13:47.300 | layer normalization basically means that if you have a batch of n items
00:13:50.300 | in this case only 3
00:13:52.300 | each item will have some features
00:13:54.300 | let's say that these are actually sentences
00:13:57.300 | and each sentence is made up of many words with its numbers
00:14:00.300 | so this is our 3 items
00:14:03.300 | and layer normalization means that for each item in this batch
00:14:07.300 | we calculate a mean and a variance
00:14:09.300 | independently from the other items of the batch
00:14:12.300 | and then we calculate the new values for each of them using their own mean and their own variance
00:14:18.300 | in the layer normalization usually we also introduce some parameters
00:14:23.300 | called gamma and beta
00:14:25.300 | some people call it alpha and beta
00:14:27.300 | some people call it alpha and bias
00:14:29.300 | ok, it doesn't matter
00:14:30.300 | one is multiplicative, so it's multiplied by each of these x
00:14:33.300 | and one is additive, so it's added to each one of these x
00:14:38.300 | because we want the model to have the possibility to amplify these values
00:14:43.300 | when he needs this value to be amplified
00:14:46.300 | so the model will learn to multiply this gamma by these values
00:14:51.300 | in such a way to amplify the values that it wants to be amplified
00:14:54.300 | ok, let's go to build the code for this layer
00:14:57.300 | let's define the layer normalization class
00:15:06.300 | and constructor as usual
00:15:09.300 | in this case we don't need any parameter except for one
00:15:14.300 | that I will show you now
00:15:16.300 | which is epsilon
00:15:18.300 | and usually EPS stands for epsilon
00:15:21.300 | which is a very small number that you need to give to the model
00:15:24.300 | and I will also show you why we need this number
00:15:27.300 | in this case we use 10 to the power of -6
00:15:31.300 | let's save it
00:15:33.300 | ok, this epsilon is needed because if we look at the slide
00:15:37.300 | we have this epsilon here in the denominator of this formula here
00:15:41.300 | so x with cap is equal to xj minus mu
00:15:45.300 | divided by the square root of sigma square plus epsilon
00:15:49.300 | why we need this epsilon?
00:15:51.300 | because imagine this denominator
00:15:53.300 | if sigma happens to be 0 or very close to 0
00:15:57.300 | this x new will become very big
00:16:00.300 | which is undesirable
00:16:01.300 | as we know that the CPU or the GPU can only represent numbers
00:16:05.300 | up to a certain position and scale
00:16:08.300 | so we don't want very big numbers or very small numbers
00:16:11.300 | so usually for numerical stability we use this epsilon
00:16:14.300 | also to avoid division by 0
00:16:16.300 | let's go forward
00:16:17.300 | so now let's introduce the two parameters
00:16:19.300 | that we will use for the layer normalization
00:16:21.300 | one is called alpha which will be multiplied
00:16:23.300 | and one is bias which will be added
00:16:25.300 | usually the additive is called bias
00:16:28.300 | it's always added
00:16:29.300 | and the alpha is the one that is multiplied
00:16:31.300 | in this case we will use nn.parameter
00:16:36.300 | this makes the parameter learnable
00:16:39.300 | and we define also the bias
00:16:46.300 | this I want to remind you is multiplied
00:16:53.300 | and this is added
00:16:58.300 | let's define the forward
00:17:03.300 | as you remember
00:17:04.300 | we need to calculate the mean and the standard deviation
00:17:06.300 | or the variance for both of these
00:17:08.300 | we will calculate the standard deviation
00:17:10.300 | of the last dimension
00:17:17.300 | so everything after the batch
00:17:20.300 | and we keep the dimension
00:17:24.300 | so this parameter keep dimension means that
00:17:27.300 | usually the mean cancels the dimension to which it is applied
00:17:31.300 | but we want to keep it
00:17:33.300 | and then we just apply the formula that we saw on the slide
00:17:50.300 | alpha multiplied by what?
00:17:52.300 | x minus its mean
00:17:55.300 | divided by the standard deviation
00:17:58.300 | plus self.eps
00:18:01.300 | everything added to bias
00:18:04.300 | and this is our layer normalization
00:18:09.300 | okay let's go have a look at the next layer we are going to build
00:18:12.300 | the next layer we are going to build is the feed forward
00:18:15.300 | you can see here
00:18:16.300 | and the feed forward is basically a fully connected layer
00:18:21.300 | that the model uses both in the encoder and in the decoder
00:18:26.300 | let's first have a look at the paper to see
00:18:28.300 | what are the details of this feed forward layer
00:18:31.300 | in the paper
00:18:32.300 | the feed forward layer is basically two matrices
00:18:35.300 | one w1 one w2 that are multiplied by this x
00:18:39.300 | one after another with a relu in between and with a bias
00:18:43.300 | we can do this in PyTorch using a linear layer
00:18:47.300 | in which we define the first one
00:18:50.300 | to be the matrix with the w1 and b1
00:18:53.300 | and the second one to be the w2 and the b2
00:18:56.300 | and in between we apply a relu
00:18:58.300 | in the paper we can also see the dimensions of these matrices
00:19:02.300 | so the first one is basically d model to dff
00:19:06.300 | and the second one is from dff to d model
00:19:09.300 | so dff is 2048 and d model is 512
00:19:13.300 | let's go build it
00:19:14.300 | class feed forward block
00:19:18.300 | we also build in this case the constructor
00:19:24.300 | and in the constructor we need to define these two values
00:19:30.300 | that we saw on the paper
00:19:31.300 | so d model dff and also in this case dropout
00:19:36.300 | we define the first matrix so w1 and b1
00:19:50.300 | to be the linear one
00:19:52.300 | and it's from d model to dff
00:19:58.300 | and then we apply the dropout
00:20:00.300 | actually we define the dropout
00:20:04.300 | and then we define the second matrix w2 and b2
00:20:11.300 | so let me write the comments here
00:20:13.300 | it's w1 and b1
00:20:16.300 | of dff to d model
00:20:25.300 | and this is w2 and b2
00:20:28.300 | why we have b2?
00:20:29.300 | because actually as you can see here bias is by default it's true
00:20:34.300 | so it's already defining a bias matrix for us
00:20:37.300 | okay let's define the forward method
00:20:43.300 | in this case what we are going to do is
00:20:47.300 | we have an input sentence which is batch
00:20:51.300 | it's a tensor with dimension batch sequence length and d model
00:20:57.300 | first we will convert it using linear 1
00:21:00.300 | into another tensor of batch to sequence length to dff
00:21:07.300 | because if we apply this linear it will convert the d model into dff
00:21:11.300 | and then we apply the linear 2 which will convert it back to d model
00:21:16.300 | we apply the dropout in between
00:21:30.300 | and this is our feed forward block
00:21:40.300 | let's go have a look at the next block
00:21:42.300 | our next block is the most important and most interesting one
00:21:46.300 | and it's the multi-head attention
00:21:48.300 | we saw briefly in the last video how the multi-head attention works
00:21:54.300 | so I will open now the slide again to show to rehearse how it actually works
00:22:00.300 | and then we will do it practically by coding
00:22:03.300 | as you remember in the encoder we have the multi-head attention
00:22:06.300 | that takes the input of the encoder and uses it three times
00:22:11.300 | one time it's called query, one time it's called key and one time it's called values
00:22:16.300 | you can also think it like a duplication of the input three times
00:22:20.300 | or you can just say that it's the same input applied three times
00:22:23.300 | and the multi-head attention basically works like this
00:22:26.300 | we have our input sequence which is sequence length by d model
00:22:30.300 | we transform into three matrices q, k and v
00:22:35.300 | which are exactly the same as the input in this case
00:22:38.300 | because we are talking about the encoder
00:22:40.300 | you see that in the decoder it's a slightly different
00:22:42.300 | and then we multiply this by matrices called w, q, w, k and w, v
00:22:49.300 | and this results in a new matrix of dimension sequence by d model
00:22:54.300 | we then split these matrices into h matrices, smaller matrices
00:22:59.300 | why h? because it's the number of head we want for this multi-head attention
00:23:03.300 | and we split these matrices along the embedding dimension
00:23:07.300 | not along the sequence dimension
00:23:08.300 | which means that each head we will have access to the full sentence
00:23:12.300 | but a different part of the embedding of each word
00:23:16.300 | we apply the attention to each of these smaller matrices using this formula
00:23:21.300 | which will give us smaller matrices as a result
00:23:24.300 | then we combine them back
00:23:26.300 | so we concatenate them back just like the paper says
00:23:30.300 | so concatenation of head one up to head h
00:23:33.300 | and finally we multiply it by w, o to get the multi-head attention output
00:23:38.300 | which again is a matrix that has the same dimension as the input matrix
00:23:43.300 | as you can see the output of the multi-head attention is also sequenced by d model
00:23:48.300 | in this slide actually I didn't show the batch dimension
00:23:51.300 | because we are talking about one sentence
00:23:53.300 | but when we code the transformer
00:23:55.300 | we don't work only with one sentence but with multiple sentences
00:23:58.300 | so we need to think that we have another dimension here which is the batch
00:24:03.300 | okay let's go to code this multi-head attention
00:24:07.300 | I will do it a little more slower
00:24:09.300 | so we can see in detail everything how it's done
00:24:13.300 | but I really wanted you to have an overview again of how it works
00:24:17.300 | and why we are doing what we are doing
00:24:19.300 | so let's go code it
00:24:21.300 | class
00:24:29.300 | also in this case we define the constructor
00:24:33.300 | and what we need to give to this multi-head attention as parameter
00:24:37.300 | for sure the d model of the model which is in our case 512
00:24:42.300 | the number of heads which we call h just like in the paper
00:24:46.300 | so h indicates the number of heads we want and then the dropout value
00:24:52.300 | we save these values
00:25:00.300 | as you can see we need to divide this embedding vector into h heads
00:25:05.300 | which means that this d model should be divisible by h
00:25:08.300 | otherwise we cannot divide equally the same vector
00:25:13.300 | representing the embedding into equal matrices for each head
00:25:17.300 | so we make sure that d model is divisible by h basically
00:25:33.300 | and this will make the check
00:25:35.300 | if we watch again my slide we can see that the value d model divided by h is called dk
00:25:41.300 | as we can see here if we divide the d model by h heads
00:25:48.300 | we get a new value which is called dk
00:25:50.300 | and to be aligned with what the paper with the nomenclature used in the paper
00:25:54.300 | we will also call it dk
00:25:56.300 | so dk is d model divided by h
00:26:08.300 | okay let's also define the matrices by which we will multiply the query the key and the values
00:26:14.300 | and also the output matrix w o
00:26:21.300 | this again is a linear so from d model to d model
00:26:25.300 | why from d model to d model because as you can see from my slides
00:26:30.300 | this is d model by d model
00:26:33.300 | so that the output will be sequenced by d model
00:26:39.300 | so this is wq
00:26:48.300 | this is wk
00:27:00.300 | and this is wv
00:27:03.300 | finally we also have the output matrix which is called w o here
00:27:06.300 | this w o is h by dv by d model
00:27:11.300 | so h by dv, dv is what?
00:27:13.300 | dv is actually equal to dk
00:27:15.300 | because it's the d model divided by h
00:27:18.300 | but why it's called dv here and dk here?
00:27:21.300 | because this head is actually the result this head comes from this multiplication
00:27:27.300 | and the last multiplication is by v
00:27:29.300 | and in the paper they call this value dv
00:27:32.300 | but on a practical level it's equal to dk
00:27:36.300 | so our w o is also a matrix that is d model by d model
00:27:41.300 | because h by dv is equal to d model
00:27:52.300 | and this is w o
00:27:54.300 | finally we create the dropout
00:28:02.300 | let's implement the forward method
00:28:04.300 | and let's see how the multi head attention works in detail during the coding process
00:28:11.300 | we define the query, the key and the values
00:28:16.300 | and there is this mask
00:28:18.300 | so what is this mask?
00:28:20.300 | the mask is basically if we want some words to not interact with other words
00:28:26.300 | we mask them
00:28:28.300 | and we saw in my previous video but now let's go back to those slides
00:28:31.300 | to see what is the mask doing
00:28:33.300 | as you remember when we calculate the attention
00:28:36.300 | using this formula
00:28:37.300 | so softmax of q multiplied by kt
00:28:40.300 | divided by square root of dk and then by v
00:28:43.300 | we get this head matrix
00:28:46.300 | but before we multiply by v
00:28:48.300 | so only this multiplication here with q by k
00:28:53.300 | we get this matrix
00:28:54.300 | which is each word with each other word
00:28:57.300 | it's a sequence by sequence matrix
00:29:00.300 | and if we don't want some words to interact with other words
00:29:03.300 | we basically replace their value
00:29:05.300 | so their attention score
00:29:07.300 | with something that is very small
00:29:09.300 | before we apply the softmax
00:29:11.300 | and when we apply the softmax
00:29:13.300 | these values will become zero
00:29:15.300 | because as you remember the softmax on the numerator has e to the power of x
00:29:19.300 | so if x goes to minus infinity
00:29:22.300 | so a very small number
00:29:23.300 | e to the power of minus infinity
00:29:25.300 | will become very small
00:29:27.300 | so very close to zero
00:29:28.300 | so basically we hide the attention for those two words
00:29:33.300 | so this is the job of the mask
00:29:37.300 | just following my slide
00:29:39.300 | we do the multiplication one by one
00:29:41.300 | so as we remember
00:29:42.300 | we calculate first
00:29:43.300 | the query are multiplied by the wq
00:29:46.300 | so self.wq multiplied with the query
00:29:50.300 | gives us a new matrix
00:29:52.300 | which is called the q prime in my slides
00:29:54.300 | I just call it query here
00:29:59.300 | we do the same with the keys
00:30:01.300 | and the same with the values
00:30:07.300 | let me also write the dimensions
00:30:09.300 | so we are going from
00:30:11.300 | batch, sequence, length, d, model
00:30:16.300 | with this multiplication
00:30:17.300 | we are going to another matrix
00:30:19.300 | which is batch, sequence, length, d, model
00:30:23.300 | and you can see that from the slides
00:30:25.300 | so when we do sequence by d, model
00:30:28.300 | multiplied by d, model by d, model
00:30:30.300 | we get a new matrix
00:30:31.300 | which has the same dimension as the initial matrix
00:30:33.300 | so sequence by d, model
00:30:36.300 | and it's the same for all three of them
00:30:43.300 | now what we want to do is
00:30:45.300 | we want to divide this query key and value
00:30:48.300 | into smaller matrices
00:30:49.300 | so that we can give each small matrix
00:30:52.300 | to a different head
00:30:54.300 | so let's do it
00:30:56.300 | we will divide into
00:30:58.300 | using the view method of PyTorch
00:31:00.300 | which means that we keep the batch dimension
00:31:03.300 | because we don't want to split the sentence
00:31:06.300 | we want to split the embedding
00:31:08.300 | into h parts
00:31:12.300 | we also want to keep the second dimension
00:31:14.300 | which is the sequence
00:31:15.300 | because we don't want to split it
00:31:17.300 | and the third dimension
00:31:19.300 | so the d, model
00:31:20.300 | we want to split it into two smaller dimensions
00:31:23.300 | which is h by d, k
00:31:25.300 | so self.h, self.d,k
00:31:30.300 | as you remember
00:31:31.300 | d, k is basically d, model
00:31:33.300 | divided by h
00:31:34.300 | so this multiplied by this
00:31:36.300 | gives you d, model
00:31:40.300 | and then we transpose
00:31:45.300 | one, two
00:31:46.300 | why do we transpose?
00:31:47.300 | because we prefer to have the h dimension
00:31:52.300 | instead of being the third dimension
00:31:54.300 | we want it to be the second dimension
00:31:56.300 | and this way
00:31:59.300 | each view, each head
00:32:00.300 | will see all the sentence
00:32:02.300 | so we'll see this dimension
00:32:03.300 | so the sequence length by d, k
00:32:07.300 | let me also write the comment here
00:32:11.300 | so we are going from
00:32:13.300 | batch, sequence length, d, model
00:32:16.300 | to batch, sequence length, h, d, k
00:32:24.300 | and then by using the transposition
00:32:26.300 | we are going to
00:32:28.300 | batch, h, sequence length, and d, k
00:32:33.300 | this is really important
00:32:34.300 | because we want each batch
00:32:39.300 | we want each head to watch this stuff
00:32:42.300 | so the sequence length by d, k
00:32:44.300 | which means that
00:32:45.300 | each head will see the full sentence
00:32:47.300 | so each word in the sentence
00:32:49.300 | but only a smaller part of the embedding
00:32:51.300 | we do the same thing for the key and the value
00:32:55.300 | [typing sounds]
00:33:21.300 | [typing sounds]
00:33:47.300 | ok, now that we have these smaller matrices
00:33:50.300 | so let me go back to the slide
00:33:52.300 | so I can show you where we are
00:33:54.300 | so we did this multiplication
00:33:56.300 | we obtained query, key, and values
00:33:58.300 | we split into smaller matrices
00:34:00.300 | now we need to calculate the attention
00:34:02.300 | using this formula here
00:34:04.300 | before we can calculate the attention
00:34:06.300 | let's create a function to calculate the attention
00:34:08.300 | so if we create a new function
00:34:10.300 | that can be used also later
00:34:13.300 | so self, attention
00:34:16.300 | let's define it as a static method
00:34:19.300 | [typing sounds]
00:34:24.300 | so static method means basically
00:34:26.300 | that you can call this function
00:34:28.300 | without having an instance of this class
00:34:30.300 | you can just say
00:34:31.300 | multi head attention block dot attention
00:34:33.300 | instead of having an instance of this class
00:34:36.300 | [typing sounds]
00:34:41.300 | we also give him the dropout layer
00:34:44.300 | ok, what we do is we get the decay
00:34:49.300 | what is the decay?
00:34:50.300 | it's the last dimension of the query, key, and the value
00:34:53.300 | [typing sounds]
00:34:58.300 | and we will be using this function here
00:35:00.300 | let me first call it
00:35:02.300 | so that you can understand how we will use it
00:35:04.300 | and then we define it
00:35:06.300 | so we want from this function
00:35:08.300 | we want two things, the output
00:35:10.300 | and we want the attention scores
00:35:12.300 | so the output of the softmax
00:35:15.300 | attention scores
00:35:17.300 | and we will call it like this
00:35:21.300 | so we give it the query, the key, the value, the mask
00:35:26.300 | and the dropout layer
00:35:28.300 | now let's go back here
00:35:31.300 | so we have the decay
00:35:33.300 | now what we do is
00:35:34.300 | first we apply the first part of the formula
00:35:36.300 | that is the query multiplied by the transpose of the key
00:35:40.300 | divided by the square root of decay
00:35:43.300 | so these are our attention scores
00:35:46.300 | [typing sounds]
00:35:49.300 | query matrix multiplication
00:35:52.300 | so this @ sign means matrix multiplication in PyTorch
00:35:55.300 | [typing sounds]
00:35:58.300 | we transpose the last two dimensions
00:36:01.300 | -1 means transpose the last two dimensions
00:36:04.300 | so this will become
00:36:06.300 | the last dimension is sequence length by decay
00:36:09.300 | it will become decay by sequence length
00:36:12.300 | and then we divide this by math.decay
00:36:18.300 | before, as we saw before
00:36:21.300 | before applying the softmax
00:36:23.300 | we need to apply the mask
00:36:24.300 | so we want to hide some interaction between words
00:36:27.300 | we apply the mask
00:36:28.300 | and then we apply the softmax
00:36:30.300 | so the softmax will take care of the values that we replaced
00:36:33.300 | how do we apply the mask?
00:36:35.300 | we just
00:36:36.300 | all the values that we want to mask
00:36:38.300 | we replace them with very very small values
00:36:40.300 | so that the softmax will replace them with 0
00:36:43.300 | so if a mask is defined
00:36:46.300 | [typing sounds]
00:36:48.300 | apply it
00:36:49.300 | [typing sounds]
00:37:02.300 | this means basically
00:37:04.300 | replace all the values for which this statement is true
00:37:08.300 | with this value
00:37:10.300 | the mask we will define in such a way that
00:37:13.300 | where this value, this expression is true
00:37:16.300 | we want it to be replaced by this
00:37:18.300 | later we will see also how we will build the mask
00:37:21.300 | for now just take it for granted
00:37:23.300 | that these are all the values that we don't want
00:37:26.300 | to have in the attention
00:37:28.300 | so we don't want for example some word
00:37:31.300 | to watch future words
00:37:33.300 | for example when we will build a decoder
00:37:35.300 | or we don't want the padding values
00:37:37.300 | to interact with other values
00:37:39.300 | because they are just filler words to reach the sequence length
00:37:42.300 | we will replace them with -1 to the power of
00:37:45.300 | -10 to the power of 9
00:37:49.300 | which is a very big number in the negative range
00:37:53.300 | which basically represents -infinity
00:37:57.300 | and then when we apply now the softmax
00:38:00.300 | it will be replaced by 0
00:38:02.300 | [typing sounds]
00:38:09.300 | we apply it to this dimension
00:38:12.300 | ok, let me write some comments
00:38:14.300 | so in this case we have
00:38:17.300 | batch by h
00:38:20.300 | so each head will
00:38:22.300 | and then sequence length and sequence length
00:38:26.300 | alright, if we also have a dropout
00:38:29.300 | so if dropout is not known
00:38:32.300 | we also apply the dropout
00:38:34.300 | [typing sounds]
00:38:41.300 | and finally as we saw in the original slide
00:38:44.300 | we multiply the output of the softmax
00:38:47.300 | by the vmatrix
00:38:49.300 | matrix multiplication
00:38:51.300 | so we return
00:38:53.300 | attention scores multiplied by value
00:38:56.300 | and also the attention score itself
00:38:58.300 | so why are we returning a tuple?
00:39:00.300 | because we want this
00:39:02.300 | of course we need it for the model
00:39:04.300 | because we need to give it to the next layer
00:39:06.300 | but this will be used for visualization
00:39:09.300 | so the output of the self-attention
00:39:13.300 | so the multi-head attention in this case
00:39:15.300 | is actually going to be here
00:39:18.300 | and we will use it for visualizing
00:39:20.300 | so for visualizing what is the score
00:39:22.300 | given by the model for that particular interaction
00:39:25.300 | let me also write some comments here
00:39:29.300 | so here we are doing like this
00:39:32.300 | batch
00:39:34.300 | [typing sounds]
00:39:46.300 | and let's go back here
00:39:48.300 | now we have our multi-head attention
00:39:50.300 | so the output of the multi-head attention
00:39:52.300 | what we do is finally
00:39:54.300 | we, ok let's go back to the slide first
00:39:57.300 | where we are
00:39:59.300 | we calculated these smaller matrices here
00:40:02.300 | so we applied the softmax
00:40:04.300 | Q by KT
00:40:06.300 | divided by the square root of DV
00:40:08.300 | and then we multiplied it also by V
00:40:10.300 | we can see it here
00:40:12.300 | which gives us this small matrix here
00:40:15.300 | head 1, head 2, head 3 and head 4
00:40:17.300 | now we need to combine them together
00:40:19.300 | concat, just like the formula says from the paper
00:40:22.300 | and finally multiply it by WO
00:40:24.300 | so let's do it
00:40:26.300 | [typing sounds]
00:40:33.300 | we transpose because
00:40:35.300 | before we transformed the matrix into sequence length
00:40:39.300 | we had the sequence length as the third dimension
00:40:41.300 | we wanted back in the first place
00:40:44.300 | to combine them
00:40:46.300 | because the resulting tensor
00:40:48.300 | we want the sequence length to be in the second position
00:40:50.300 | so let me write it first
00:40:52.300 | what we want to do
00:40:54.300 | batch
00:40:56.300 | we started from this one
00:40:59.300 | sequence length
00:41:01.300 | first we do a transposition
00:41:04.300 | [typing sounds]
00:41:09.300 | and then what we want is this
00:41:12.300 | [typing sounds]
00:41:17.300 | so this transposition takes us here
00:41:20.300 | and then we do a view
00:41:24.300 | but we cannot do it
00:41:26.300 | we need to use contiguous
00:41:28.300 | this means basically that PyTorch
00:41:30.300 | to transform the shape of a tensor
00:41:33.300 | needs to put the memory to be contiguous
00:41:36.300 | so we can just do it in place
00:41:38.300 | [typing sounds]
00:41:48.300 | and self.h
00:41:50.300 | multiplied by self.dk
00:41:52.300 | which as you remember
00:41:54.300 | this is the model
00:41:56.300 | because we defined dk to be
00:41:59.300 | here the model by h
00:42:01.300 | divide by h
00:42:05.300 | and finally we multiply this x by wo
00:42:08.300 | which is our output matrix
00:42:10.300 | [typing sounds]
00:42:15.300 | this will give us
00:42:17.300 | we go from batch
00:42:20.300 | [typing sounds]
00:42:28.300 | and this is
00:42:30.300 | and this is our multi-head attention block
00:42:33.300 | we have I think all the ingredients now
00:42:35.300 | to combine them all together
00:42:37.300 | we just miss one small layer
00:42:39.300 | let's go have a look at it first
00:42:41.300 | there is one last layer we need to build
00:42:43.300 | which is the connection we can see here
00:42:45.300 | for example here we have some
00:42:47.300 | output of this layer, so addNorm
00:42:49.300 | that is taken here
00:42:51.300 | with this connection
00:42:53.300 | and this one part is sent here
00:42:55.300 | then the output of this is sent to the addNorm
00:42:57.300 | and then combined together by this layer
00:42:59.300 | so we need to create this
00:43:01.300 | layer that manages this skip connection
00:43:03.300 | so we take the input
00:43:05.300 | we give it to
00:43:07.300 | we skip it by one layer
00:43:09.300 | we take the output of the previous layer
00:43:11.300 | so in this case the multi-head attention
00:43:13.300 | we give it to this layer
00:43:15.300 | but also combining with this part
00:43:17.300 | so let's build this layer
00:43:19.300 | I will call it residual connection
00:43:22.300 | because it's basically a skip connection
00:43:24.300 | ok let's build this residual connection
00:43:27.300 | [typing]
00:43:35.300 | as usual we define the constructor
00:43:37.300 | and in this case we just need a dropout
00:43:40.300 | [typing]
00:43:51.300 | as you remember the skip connection
00:43:54.300 | is between the add and the norm
00:43:56.300 | and the previous layer
00:43:58.300 | so we also need the norm
00:44:00.300 | which is our layer normalization
00:44:02.300 | which we defined before
00:44:04.300 | and then we define the forward method
00:44:06.300 | [typing]
00:44:09.300 | and the sublayer which is the previous layer
00:44:12.300 | [typing]
00:44:14.300 | what we do is we take the X
00:44:16.300 | and we combine it
00:44:18.300 | with the output of the next layer
00:44:20.300 | which in this case is called sublayer
00:44:22.300 | [typing]
00:44:25.300 | and we apply the dropout
00:44:27.300 | [typing]
00:44:31.300 | so this is the definition of add and norm
00:44:33.300 | actually there is a slight difference
00:44:35.300 | that we first apply the normalization
00:44:37.300 | and then we apply the sublayer
00:44:39.300 | in the case of the paper
00:44:41.300 | they apply first the sublayer
00:44:43.300 | and then the normalization
00:44:45.300 | I saw many implementations
00:44:47.300 | and most of them actually did it like this
00:44:49.300 | so we will also stick with this particular
00:44:51.300 | as you remember
00:44:53.300 | we have these blocks
00:44:55.300 | are combined together
00:44:57.300 | by this bigger block here
00:44:59.300 | and we have N of them
00:45:01.300 | so this big block
00:45:03.300 | we will call it encoder block
00:45:05.300 | and each of this encoder block is repeated
00:45:07.300 | N times where the output
00:45:09.300 | of the previous is sent to the next one
00:45:11.300 | and the output of the last one is sent to the decoder
00:45:13.300 | so we need to create
00:45:15.300 | this block which will contain
00:45:17.300 | one multi head attention
00:45:19.300 | two add and norm
00:45:21.300 | and one feed forward
00:45:23.300 | so let's do it
00:45:25.300 | [typing]
00:45:27.300 | we will call this block
00:45:29.300 | the encoder block
00:45:31.300 | because the decoder has
00:45:33.300 | three blocks inside
00:45:35.300 | the encoder has only two
00:45:37.300 | [typing]
00:45:39.300 | [typing]
00:45:41.300 | [typing]
00:45:43.300 | [typing]
00:45:45.300 | and as I
00:45:47.300 | saw before
00:45:49.300 | we have the self attention block
00:45:51.300 | inside which is the multi head attention
00:45:53.300 | we call it self attention because
00:45:55.300 | in the case of the encoder
00:45:57.300 | it is applied to the
00:45:59.300 | same input with
00:46:01.300 | three different roles
00:46:03.300 | the role of query, of the key and the value
00:46:05.300 | [typing]
00:46:07.300 | [typing]
00:46:09.300 | [typing]
00:46:11.300 | which is our feed forward
00:46:13.300 | and then we have a dropout
00:46:15.300 | which is a floating point
00:46:17.300 | and then we define
00:46:19.300 | [typing]
00:46:21.300 | [typing]
00:46:23.300 | [typing]
00:46:25.300 | [typing]
00:46:27.300 | [typing]
00:46:29.300 | [typing]
00:46:31.300 | [typing]
00:46:33.300 | and then we define the two residual
00:46:35.300 | connections
00:46:37.300 | [typing]
00:46:39.300 | [typing]
00:46:41.300 | we use the module list
00:46:43.300 | which is a way to organize
00:46:45.300 | a list of modules
00:46:47.300 | in this case we need two of them
00:46:49.300 | [typing]
00:46:51.300 | [typing]
00:46:53.300 | [typing]
00:46:55.300 | [typing]
00:46:57.300 | [typing]
00:46:59.300 | [typing]
00:47:01.300 | [typing]
00:47:03.300 | [typing]
00:47:05.300 | [typing]
00:47:07.300 | okay let's define
00:47:09.300 | the forward method
00:47:11.300 | [typing]
00:47:13.300 | [typing]
00:47:15.300 | I define the source
00:47:17.300 | mask, what is the source mask?
00:47:19.300 | it's the mask that we want to apply to the
00:47:21.300 | input of the encoder, and why do we
00:47:23.300 | need a mask for the input of the encoder?
00:47:25.300 | because we want to hide
00:47:27.300 | the interaction of the padding word
00:47:29.300 | with other words, we don't want the padding
00:47:31.300 | word to interact with other words
00:47:33.300 | so we apply the mask
00:47:35.300 | [typing]
00:47:37.300 | [typing]
00:47:39.300 | and let's do the
00:47:41.300 | first residual connection
00:47:43.300 | let's go back to check the video actually
00:47:45.300 | to check the slide so we can understand what we
00:47:47.300 | are doing now
00:47:49.300 | so the first skip connection is
00:47:51.300 | this X here
00:47:53.300 | is going to
00:47:55.300 | here, but before it's
00:47:57.300 | added and
00:47:59.300 | we add a norm, we first need to apply
00:48:01.300 | the multi-head attention, so we take this X
00:48:03.300 | we send it to the multi-head attention
00:48:05.300 | and at the same time we also send it here
00:48:07.300 | and then we combine the two
00:48:09.300 | [typing]
00:48:11.300 | [typing]
00:48:13.300 | so the first skip connection is between
00:48:15.300 | X and then the other
00:48:17.300 | X is coming from the self-attention
00:48:21.300 | this is the function
00:48:23.300 | so I will define the sub-layer
00:48:25.300 | using a lambda, so this basically
00:48:27.300 | means first apply the self-attention
00:48:29.300 | self-attention
00:48:31.300 | in which we give the query key
00:48:33.300 | and the value is our X
00:48:35.300 | so our input, so this is why it's called
00:48:37.300 | self-attention, because the role of the query
00:48:39.300 | key and the value is X
00:48:41.300 | itself, so the input itself, so
00:48:43.300 | it's the sentence that is
00:48:45.300 | watching itself, so each
00:48:47.300 | word of one sentence is
00:48:49.300 | interacting with other words of the
00:48:51.300 | same sentence, we will see that
00:48:53.300 | in the decoder it's different because we have
00:48:55.300 | the cross-attention, so
00:48:57.300 | the keys coming from
00:48:59.300 | the decoder are watching the
00:49:01.300 | sorry, the query coming from the
00:49:03.300 | decoder are watching the key and the values
00:49:05.300 | coming from the encoder
00:49:07.300 | we give it the
00:49:09.300 | source mask, so
00:49:11.300 | what is this, basically we are calling
00:49:13.300 | this function, the forward
00:49:15.300 | function of the multi-head
00:49:17.300 | attention block, so we give query
00:49:19.300 | key value and the mask
00:49:21.300 | this will be combined
00:49:23.300 | with this by using
00:49:25.300 | the residual connection
00:49:29.300 | again we do the second one, the second
00:49:31.300 | one is the feed forward
00:49:33.300 | [typing]
00:49:35.300 | [typing]
00:49:37.300 | [typing]
00:49:39.300 | we don't need lambda here actually
00:49:41.300 | [typing]
00:49:43.300 | [typing]
00:49:45.300 | and then we return X
00:49:47.300 | so this means
00:49:49.300 | combine the feed forward
00:49:51.300 | and then the
00:49:53.300 | X itself, so the output
00:49:55.300 | of the previous layer, which is this one
00:49:57.300 | and then
00:49:59.300 | apply the residual connection
00:50:01.300 | this defines our encoder block
00:50:03.300 | now we can define the
00:50:05.300 | encoder object, so because the
00:50:07.300 | encoder is made up of many encoder
00:50:09.300 | blocks, we can have up to N of them
00:50:11.300 | according to the paper, so
00:50:13.300 | let's define the encoder
00:50:15.300 | [typing]
00:50:17.300 | [typing]
00:50:19.300 | [typing]
00:50:21.300 | [typing]
00:50:23.300 | how many layers we will have, we will
00:50:25.300 | have N, so we have many layers
00:50:27.300 | and they are applied
00:50:29.300 | one after another, so this is a
00:50:31.300 | module list
00:50:33.300 | [typing]
00:50:35.300 | and at the end we will apply a layer
00:50:37.300 | normalization
00:50:39.300 | [typing]
00:50:41.300 | [typing]
00:50:43.300 | [typing]
00:50:45.300 | [typing]
00:50:47.300 | [typing]
00:50:49.300 | [typing]
00:50:51.300 | so we apply one layer after another
00:50:53.300 | [typing]
00:50:55.300 | [typing]
00:50:57.300 | [typing]
00:50:59.300 | the output of the
00:51:01.300 | previous layer becomes the input for the
00:51:03.300 | next layer, here I forgot something
00:51:05.300 | [typing]
00:51:07.300 | and finally we apply the
00:51:09.300 | normalization
00:51:11.300 | and this concludes our
00:51:13.300 | journey around the encoder
00:51:15.300 | let's go have a brief overview
00:51:17.300 | of what we have done
00:51:19.300 | we have taken the inputs, send it to the
00:51:21.300 | we didn't, ok, we didn't
00:51:23.300 | combine all the blocks together for now
00:51:25.300 | we just built this big block here
00:51:27.300 | called encoder
00:51:29.300 | which contains two smaller
00:51:31.300 | blocks
00:51:33.300 | that are the skip connection, the skip connection
00:51:35.300 | first one is between the multihead
00:51:37.300 | attention and this X that is sent
00:51:39.300 | here, the second one is between this
00:51:41.300 | feedforward and this X that is sent
00:51:43.300 | here, we have N of
00:51:45.300 | these blocks one after another
00:51:47.300 | the output of the last will be sent
00:51:49.300 | to the decoder before
00:51:51.300 | but before we apply the normalization
00:51:53.300 | now we
00:51:55.300 | built the
00:51:57.300 | decoder part, now in the
00:51:59.300 | decoder the output embeddings
00:52:01.300 | are the same as the input embeddings
00:52:03.300 | I mean
00:52:05.300 | the class that we need to define
00:52:07.300 | is the same, so we will just initialize it
00:52:09.300 | twice and the same goes for
00:52:11.300 | the positional encodings, we can use the same
00:52:13.300 | values that we use for the
00:52:15.300 | encoder, also for the decoder
00:52:17.300 | what we need to define
00:52:19.300 | is this big block here
00:52:21.300 | which is made of masked multihead
00:52:23.300 | attention, add a norm, so one
00:52:25.300 | skip connection here, another
00:52:27.300 | multihead attention with another skip connection
00:52:29.300 | and the feedforward with the
00:52:31.300 | skip connection here, the way we define
00:52:33.300 | the multihead attention class actually
00:52:35.300 | already takes into consideration
00:52:37.300 | the masks, so we don't need to reinvent
00:52:39.300 | the wheel, also for the decoder
00:52:41.300 | we can just define the
00:52:43.300 | decoder block which is this big block
00:52:45.300 | here made of three sublayers
00:52:47.300 | and then we build
00:52:49.300 | the decoder using this
00:52:51.300 | n number of this
00:52:53.300 | decoder blocks, so
00:52:55.300 | let's do it
00:52:57.300 | let's define
00:52:59.300 | first the decoder block
00:53:01.300 | in the decoder we have
00:53:13.300 | the self attention
00:53:15.300 | which is, let's go back
00:53:17.300 | this is a self attention because
00:53:19.300 | we have this input that is used
00:53:21.300 | three times in the masked multihead
00:53:23.300 | attention, so this is called self
00:53:25.300 | attention because the same input plays the
00:53:27.300 | role of the query, the key and the values
00:53:29.300 | which means that the same sentence is
00:53:31.300 | each word in the sentence is
00:53:33.300 | matched with each other word in the
00:53:35.300 | same sentence, but in
00:53:37.300 | this part here we will have
00:53:39.300 | an attention calculated
00:53:41.300 | using the query
00:53:43.300 | coming from the decoder
00:53:45.300 | while the key and the values will come
00:53:47.300 | from the encoder
00:53:49.300 | so this is not a self
00:53:51.300 | attention, this is called cross attention
00:53:53.300 | because we are crossing
00:53:55.300 | two kind of different
00:53:57.300 | objects together and matching them
00:53:59.300 | somehow to calculate the relationship
00:54:01.300 | between them, ok let's define
00:54:13.300 | this is the cross
00:54:15.300 | attention block which is basically
00:54:17.300 | the multihead attention but we will give it
00:54:19.300 | the different
00:54:21.300 | parameters
00:54:23.300 | this is our feedforward
00:54:27.300 | and then we have a dropout
00:54:29.300 | dropout
00:54:31.300 | ok, we defined
00:54:55.300 | also the residual connection, in this case we have
00:54:57.300 | three of them
00:54:59.300 | wonderful, ok let's build
00:55:19.300 | the forward method which is very similar to the
00:55:21.300 | encoder with a slight difference that I will
00:55:23.300 | highlight
00:55:25.300 | we need
00:55:27.300 | x, what is x?
00:55:29.300 | it's the input of the decoder
00:55:31.300 | but we also need the
00:55:33.300 | output of the encoder
00:55:35.300 | we need the source mask which is
00:55:37.300 | the mask applied to the encoder
00:55:39.300 | and the target mask which is the
00:55:41.300 | mask applied to the decoder
00:55:43.300 | why they are called source mask and target
00:55:45.300 | mask? because in this particular
00:55:47.300 | case we are dealing with a translation task
00:55:49.300 | so we have a source language, in this case
00:55:51.300 | it's english and we have a target
00:55:53.300 | language which in our case is
00:55:55.300 | italian, so
00:55:57.300 | you can call it encoder mask
00:55:59.300 | or decoder mask but basically we have
00:56:01.300 | two masks, one is the one coming from the encoder
00:56:03.300 | one is the one coming from the decoder
00:56:05.300 | so in our case we will call
00:56:07.300 | it source, so the source mask
00:56:09.300 | is the one coming from the encoder, so the
00:56:11.300 | source language and the target mask is the
00:56:13.300 | one coming from the decoder, so the
00:56:15.300 | target language
00:56:17.300 | [typing]
00:56:23.300 | and just like before we calculate the self-attention
00:56:25.300 | first, which is the first
00:56:27.300 | part of the decoder block
00:56:29.300 | [typing]
00:56:33.300 | in which the query, the key and the values
00:56:35.300 | are the same input
00:56:37.300 | but with the mask of the decoder
00:56:39.300 | because this is the self-attention block of the
00:56:41.300 | decoder
00:56:43.300 | [typing]
00:56:45.300 | and then we need to
00:56:47.300 | combine, we need to calculate the cross-attention
00:56:49.300 | which is our second residual connection
00:56:51.300 | [typing]
00:57:01.300 | we give him
00:57:03.300 | ok, in this case we are giving the
00:57:05.300 | query coming from the decoder
00:57:07.300 | so the x, the
00:57:09.300 | key and the values coming from the
00:57:11.300 | encoder
00:57:13.300 | [typing]
00:57:15.300 | and the mask of the encoder
00:57:17.300 | [typing]
00:57:27.300 | and finally the feedforward block
00:57:29.300 | just like before
00:57:31.300 | [typing]
00:57:33.300 | and that's it, we have all the ingredients
00:57:35.300 | actually to build the decoder now
00:57:37.300 | which is just n times
00:57:39.300 | this block one after another
00:57:41.300 | just like we did for the encoder
00:57:43.300 | [typing]
00:57:53.300 | also in this case we will provide with many layers
00:57:55.300 | so layers
00:57:57.300 | this is just a model list
00:57:59.300 | and we will also have a
00:58:01.300 | normalization at the end
00:58:03.300 | [typing]
00:58:27.300 | just like we did before, we apply
00:58:29.300 | the input to one layer
00:58:31.300 | and then we use the output
00:58:33.300 | of the previous layer and give it as
00:58:35.300 | an input of the next layer
00:58:37.300 | [typing]
00:58:45.300 | each layer is a decoder block
00:58:47.300 | so we need to give it x
00:58:49.300 | we need to give it the encoder
00:58:51.300 | output, then the source
00:58:53.300 | mask and the target mask
00:58:55.300 | so each of them is
00:58:57.300 | this, we are calling the forward method
00:58:59.300 | here, so nothing different
00:59:01.300 | [typing]
00:59:05.300 | and finally we apply the normalization
00:59:07.300 | and this is
00:59:09.300 | our decoder
00:59:11.300 | there is one last ingredient
00:59:13.300 | we need to have
00:59:15.300 | what is a full
00:59:17.300 | transformer, so let's have a look at it
00:59:19.300 | the last
00:59:21.300 | ingredient we need is this layer here
00:59:23.300 | the linear layer
00:59:25.300 | as you remember from my slides
00:59:27.300 | the output of the
00:59:29.300 | multi-head attention is something
00:59:31.300 | that is sequenced by D-model
00:59:33.300 | so here, we expect
00:59:35.300 | to have the output to be
00:59:37.300 | sequenced by D-model
00:59:39.300 | if we don't consider the batch dimension
00:59:41.300 | however, we want to map these words
00:59:43.300 | back into the vocabulary
00:59:45.300 | so that's why we need this linear layer
00:59:47.300 | which will convert the embedding
00:59:49.300 | into a position of the vocabulary
00:59:51.300 | I will
00:59:53.300 | call this layer, call the projection
00:59:55.300 | layer, because it's projecting the
00:59:57.300 | embedding into the vocabulary, let's go
00:59:59.300 | build it
01:00:01.300 | [typing]
01:00:09.300 | what we need for this layer
01:00:11.300 | is the D-model, so the D-model
01:00:13.300 | which is an integer
01:00:15.300 | and the vocabulary size
01:00:17.300 | [typing]
01:00:19.300 | this is basically
01:00:21.300 | a linear layer that is converting
01:00:23.300 | from D-model to vocabulary size
01:00:25.300 | so .projectionlayer is
01:00:27.300 | [typing]
01:00:37.300 | let's define the forward method
01:00:39.300 | [typing]
01:00:41.300 | ok, what we want to do
01:00:43.300 | let me write this little comment
01:00:45.300 | we want to batch
01:00:47.300 | sequence length to D-model
01:00:49.300 | converted into
01:00:51.300 | batch sequence length
01:00:53.300 | vocabulary size
01:00:55.300 | [typing]
01:00:57.300 | and in this case
01:00:59.300 | we will also already apply the softmax
01:01:01.300 | and actually we will apply the log
01:01:03.300 | softmax for numerical stability
01:01:05.300 | like I showed before
01:01:07.300 | [typing]
01:01:17.300 | to the last dimension
01:01:19.300 | [typing]
01:01:21.300 | and that's it, this is our
01:01:23.300 | projection layer, now we have
01:01:25.300 | all the ingredients we need
01:01:27.300 | for the transformer, so let's define
01:01:29.300 | our transformer block
01:01:31.300 | [typing]
01:01:43.300 | in a transformer we have
01:01:45.300 | an encoder
01:01:47.300 | [typing]
01:01:49.300 | which is our encoder, we have a decoder
01:01:51.300 | which is our decoder
01:01:53.300 | we have a source embedding
01:01:55.300 | why we need a source embedding
01:01:57.300 | and a target embedding, because we are dealing with
01:01:59.300 | multiple languages, so we have one input
01:02:01.300 | embedding for the source language
01:02:03.300 | and one input embedding for the target
01:02:05.300 | language
01:02:07.300 | [typing]
01:02:13.300 | and we have the target embedding
01:02:15.300 | [typing]
01:02:19.300 | then we have the source position
01:02:21.300 | and the target position
01:02:23.300 | [typing]
01:02:30.300 | which will be the same actually
01:02:32.300 | and then we have the projection layer
01:02:34.300 | [typing]
01:02:43.300 | we just save this
01:02:45.300 | [typing]
01:03:07.300 | [typing]
01:03:17.300 | now we define
01:03:19.300 | three methods, one to encode
01:03:21.300 | one to decode and one to project
01:03:23.300 | we will apply them in succession
01:03:25.300 | why we don't
01:03:27.300 | just build one forward method
01:03:29.300 | because as we will see
01:03:31.300 | during inferencing we can reuse
01:03:33.300 | the output of the encoder, we don't need to
01:03:35.300 | calculate it every time
01:03:37.300 | and also we prefer
01:03:39.300 | to keep these
01:03:41.300 | outputs separate also for
01:03:43.300 | visualizing the attention
01:03:45.300 | [typing]
01:03:49.300 | so for the encoder we have
01:03:51.300 | the source of the
01:03:53.300 | because we have the source language
01:03:55.300 | and the source mask
01:03:57.300 | so what we do is
01:03:59.300 | we apply first the embedding
01:04:01.300 | [typing]
01:04:07.300 | then we apply the positional encoding
01:04:09.300 | [typing]
01:04:13.300 | and finally we apply the encoder
01:04:15.300 | [typing]
01:04:20.300 | then we define the decode method
01:04:22.300 | [typing]
01:04:25.300 | which takes the encoder output
01:04:27.300 | which is the tensor
01:04:29.300 | [typing]
01:04:31.300 | the source mask which is the tensor
01:04:33.300 | the target
01:04:35.300 | and the target mask
01:04:37.300 | [typing]
01:04:43.300 | [typing]
01:04:45.300 | and what we do is target
01:04:47.300 | we first apply the target embedding
01:04:49.300 | to the target sentence
01:04:51.300 | [typing]
01:04:55.300 | then we apply the positional encoding
01:04:57.300 | to the target sentence
01:04:59.300 | [typing]
01:05:03.300 | and finally we decode
01:05:05.300 | [typing]
01:05:15.300 | this is basically
01:05:17.300 | the forward method
01:05:19.300 | of this decoder
01:05:21.300 | so we have the same order
01:05:23.300 | of parameters
01:05:27.300 | finally we define the project method
01:05:29.300 | [typing]
01:05:31.300 | in which we just apply
01:05:33.300 | the projection so we take from the embedding
01:05:35.300 | to the vocabulary size
01:05:37.300 | [typing]
01:05:45.300 | ok, this is also
01:05:47.300 | this is the last block
01:05:49.300 | we had to build
01:05:51.300 | but we didn't make a method
01:05:53.300 | to combine all these blocks
01:05:55.300 | together, so we built many blocks
01:05:57.300 | we need one that given the hyperparameters
01:05:59.300 | of the transformer
01:06:01.300 | builds for us one single transformer
01:06:03.300 | initializing
01:06:05.300 | all the encoder, decoder, the embeddings
01:06:07.300 | etc. so let's build this
01:06:09.300 | function, let's call it
01:06:11.300 | buildTransformer
01:06:13.300 | that given all the hyperparameters
01:06:15.300 | will build the transformer for us
01:06:17.300 | and also initialize the parameters
01:06:19.300 | with some initial values
01:06:21.300 | [typing]
01:06:23.300 | what we need
01:06:25.300 | to define a transformer, for sure
01:06:27.300 | in this case we are talking about translation
01:06:29.300 | ok, this model that we are building
01:06:31.300 | we will be using for translation
01:06:33.300 | but you can use it for any task
01:06:35.300 | so the naming I'm using are basically
01:06:37.300 | the ones used in the translation task
01:06:39.300 | later you can change the naming
01:06:41.300 | but the structure is the same
01:06:43.300 | so you can use it for any other task
01:06:45.300 | for which the transformer is applicable
01:06:47.300 | so the first thing we need
01:06:49.300 | is the vocabulary size of the source
01:06:51.300 | and the target
01:06:53.300 | because we need to build the
01:06:55.300 | embedding
01:06:57.300 | because the embedding need to convert
01:06:59.300 | from the token
01:07:01.300 | of the vocabulary into a vector
01:07:03.300 | of size 512
01:07:05.300 | so it needs to know how big
01:07:07.300 | is the vocabulary, so how many vectors
01:07:09.300 | it needs to create
01:07:11.300 | [typing]
01:07:13.300 | then the target
01:07:15.300 | [typing]
01:07:17.300 | which is also an integer
01:07:19.300 | then we need to tell him
01:07:21.300 | what is the source sequence length and the target sequence length
01:07:23.300 | [typing]
01:07:25.300 | [typing]
01:07:27.300 | [typing]
01:07:29.300 | [typing]
01:07:31.300 | this is very important
01:07:33.300 | they could also be the same
01:07:35.300 | in our case it will be the same
01:07:37.300 | but they can also be different
01:07:39.300 | for example
01:07:41.300 | in case you are using the transformer
01:07:43.300 | that is dealing with two very different languages
01:07:45.300 | for example for translation
01:07:47.300 | in which the tokens needed
01:07:49.300 | for the source languages
01:07:51.300 | are much higher or much lower than the other ones
01:07:53.300 | so you don't need to keep the same length
01:07:55.300 | you can use different lengths
01:07:57.300 | the next hyperparameter is the
01:07:59.300 | dmodel
01:08:01.300 | [typing]
01:08:03.300 | which we initialize with 512
01:08:05.300 | because we want to keep the same values as the paper
01:08:07.300 | then we define the hyperparameter
01:08:09.300 | n which is the number of layers
01:08:11.300 | so the number of encoder blocks
01:08:13.300 | that we will be using
01:08:15.300 | is according to the paper
01:08:19.300 | then we define the hyperparameter h
01:08:21.300 | which is the number of heads we want
01:08:23.300 | and according to the paper it is 8
01:08:27.300 | dropout is
01:08:31.300 | [typing]
01:08:33.300 | and finally we have the hidden layer
01:08:35.300 | dff of the
01:08:37.300 | feedforward layer which is
01:08:39.300 | 2048 as we saw before on the paper
01:08:41.300 | [typing]
01:08:43.300 | and this builds
01:08:45.300 | a transformer
01:08:49.300 | so first we do is we create
01:08:51.300 | the embedding layers
01:08:53.300 | so source
01:08:55.300 | embedding
01:08:57.300 | [typing]
01:08:59.300 | [typing]
01:09:01.300 | [typing]
01:09:03.300 | [typing]
01:09:05.300 | [typing]
01:09:07.300 | then the target embedding
01:09:09.300 | [typing]
01:09:11.300 | [typing]
01:09:13.300 | [typing]
01:09:15.300 | [typing]
01:09:17.300 | then we create the positional encoding
01:09:19.300 | layers
01:09:21.300 | [typing]
01:09:23.300 | [typing]
01:09:25.300 | [typing]
01:09:27.300 | [typing]
01:09:29.300 | [typing]
01:09:31.300 | we don't need to create two positional
01:09:33.300 | encoding layers because actually they do
01:09:35.300 | the same job and they also
01:09:37.300 | don't add any parameter but because
01:09:39.300 | they have the dropout and also because
01:09:41.300 | I want to make it
01:09:43.300 | verbal so you can understand each
01:09:45.300 | part without making
01:09:47.300 | any optimization I think actually
01:09:49.300 | it's fine because this is for
01:09:51.300 | educational purpose so I don't want to
01:09:53.300 | optimize the code I want to make it as much
01:09:55.300 | comprehensible as possible
01:09:57.300 | so I do every part I need
01:09:59.300 | I don't take shortcuts
01:10:01.300 | [typing]
01:10:03.300 | [typing]
01:10:05.300 | [typing]
01:10:07.300 | [typing]
01:10:09.300 | [typing]
01:10:11.300 | [typing]
01:10:13.300 | [typing]
01:10:15.300 | and then we create the encoder blocks
01:10:17.300 | we have n of them so let's define
01:10:19.300 | [typing]
01:10:21.300 | [typing]
01:10:23.300 | [typing]
01:10:25.300 | let's create an empty array
01:10:29.300 | we have n of them
01:10:31.300 | so each encoder block has
01:10:33.300 | a self-attention
01:10:35.300 | so encoder self-attention
01:10:37.300 | [typing]
01:10:39.300 | which is a multi-head
01:10:41.300 | attention block, the multi-head attention
01:10:43.300 | requires the demodule
01:10:45.300 | the edge
01:10:47.300 | and the dropout value
01:10:49.300 | then we have a
01:10:51.300 | feed-forward block
01:10:53.300 | [typing]
01:10:55.300 | [typing]
01:10:57.300 | [typing]
01:10:59.300 | [typing]
01:11:01.300 | as you can see also
01:11:03.300 | the names I'm using are quite long
01:11:05.300 | mostly because I want to make it as
01:11:07.300 | comprehensible as possible for everyone
01:11:09.300 | [typing]
01:11:11.300 | [typing]
01:11:13.300 | so each encoder block is made of
01:11:15.300 | a self-attention
01:11:17.300 | [typing]
01:11:19.300 | and a feed-forward
01:11:21.300 | [typing]
01:11:23.300 | and finally we tell him how much is the dropout
01:11:25.300 | [typing]
01:11:27.300 | [typing]
01:11:29.300 | Finally we add this encoder block.
01:11:37.380 | And then we can create the decoder blocks.
01:12:01.860 | We also have the cross attention for the decoder block.
01:12:16.880 | We also have the feedforward, just like the encoder.
01:12:34.280 | Then we define the decoder block itself, which is decoder block, cross attention and finally
01:12:48.400 | the feedforward and the dropout.
01:12:55.960 | And finally we save it in its array.
01:13:06.600 | We now can create the encoder and the decoder.
01:13:24.760 | We give him all his blocks, which are n and then also the decoder.
01:13:37.560 | And we create the projection layer, which will convert the model into vocabulary size.
01:13:49.920 | Which vocabulary?
01:13:50.920 | Of course the target, because we want to take from the source language to the target language.
01:13:55.460 | So we want to project our output into the target vocabulary.
01:14:02.800 | And then we build the transformer.
01:14:12.560 | What does it need?
01:14:13.560 | An encoder, a decoder, source embedding, target embedding, then source positional encoding,
01:14:29.520 | target positional encoding, and finally the projection layer.
01:14:38.640 | And that's it.
01:14:39.640 | Now we can just initialize the parameters using the Xavier uniform.
01:14:45.640 | This is a way to initialize the parameters to make the training faster so they don't
01:14:50.040 | just start with random values.
01:14:54.400 | And there are many algorithms to do it.
01:14:56.920 | I saw many implementations using Xavier, so I think it's a quite good start for the model
01:15:01.400 | to learn from.
01:15:18.920 | Finally return our beloved transformer.
01:15:22.800 | And this is it.
01:15:23.800 | This is how you build the model.
01:15:26.440 | And now that we have built the model, we will go further to use it.
01:15:30.040 | So we will first have a look at the dataset, then we will build the training loop.
01:15:37.800 | After the training loop, we will also build the inferencing part and the code for visualizing
01:15:44.480 | the attention.
01:15:46.240 | So hold on and take some coffee, take some tea, because it's going to be a little long,
01:15:53.080 | but it's going to be worth it.
01:15:54.960 | Now that we have built the code for the model, our next step is to build the training code.
01:16:01.060 | But before we do that, let's recheck the code, because we may have some typos.
01:16:07.400 | I actually already made this check.
01:16:09.640 | And there are a few mistakes in the code.
01:16:13.120 | I compared the old with the new one.
01:16:16.000 | It is very minor problems.
01:16:18.280 | So we wrote "feedForward" instead of "feedForward" here.
01:16:23.920 | And so the same problem is also present in every reference to "feedForward".
01:16:29.520 | And also here, when we are building the decoder block.
01:16:34.360 | And the other problem is that here, when we build the decoder block, we just wrote "nn.module".
01:16:39.040 | Instead, it should be "nn.moduleList".
01:16:42.680 | And then the "feedForward" should be also fixed here and here in the buildTransformer
01:16:47.400 | method.
01:16:49.000 | Now, I can delete the old one, so we don't need it anymore.
01:16:54.200 | Let me check the model.
01:16:55.200 | It's the correct one, with "feedForward".
01:16:59.440 | Yes, okay.
01:17:02.480 | Our next step is to build the training code.
01:17:04.840 | But before we build the training code, we have to look at the data.
01:17:08.240 | What kind of data are we going to work with?
01:17:10.780 | So as I said before, we are dealing with a translation task.
01:17:14.120 | And I have chosen this dataset called "opus_books", which we can find on HuggingFace.
01:17:19.620 | And we will also use the library from HuggingFace to download this dataset for us.
01:17:24.280 | And this is the only library we will be using beside PyTorch.
01:17:28.200 | Because of course we cannot reinvent the dataset by ourselves, so we will use this dataset.
01:17:33.880 | And we will also use the HuggingFace tokenizer library to transform this text into vocabulary.
01:17:42.200 | Because our goal is to build the transformer, so not to reinvent the wheel about everything.
01:17:48.040 | So we will be only focusing on building and training the transformer.
01:17:52.800 | And in my particular case, I will be using the subset "English to Italian", but we will
01:17:57.440 | build the code in such a way that you can choose the language and the code will act
01:18:02.680 | accordingly.
01:18:03.680 | If we look at the data, we can see that each data item is a pair of sentences in English
01:18:11.140 | and in Italian.
01:18:12.480 | For example, there was no possibility of taking a walk that day, which in Italian means "In
01:18:17.040 | quel giorno era impossibile passeggiare".
01:18:20.000 | So we will train our transformer to translate from the source language, which is English,
01:18:26.640 | into the target language, which is Italian.
01:18:29.040 | So let's do it.
01:18:30.480 | We will do it step by step.
01:18:32.240 | So first we will make the code to download this dataset and to create the tokenizer.
01:18:37.940 | So what is a tokenizer?
01:18:40.080 | Let's go back to the slides to just have a brief overview of what we are going to do
01:18:44.440 | with this data.
01:18:45.440 | The tokenizer is what comes before the input embeddings.
01:18:49.300 | So we have an English sentence.
01:18:51.440 | So for example, "Your cat is a lovely cat", but this sentence will come from our dataset.
01:18:56.280 | The goal of the tokenizer is to create this token.
01:18:59.680 | So split this sentence into single words, which has many strategies.
01:19:04.480 | As you can see here, we have a sentence, which is "Your cat is a lovely cat".
01:19:09.180 | And the goal of the tokenizer is to split this sentence into single words, which can
01:19:14.500 | be done in many ways.
01:19:16.320 | There is the BPE tokenizer, there is the word-level tokenizer, there is the sub-word-level, word-part
01:19:22.220 | tokenizer.
01:19:23.220 | There are many tokenizers.
01:19:24.220 | The one we will be using is the simplest one called the word-level tokenizer.
01:19:27.980 | So the word-level tokenizer basically will split this sentence, let's say by space.
01:19:32.420 | So each space defines the boundary of a word, and so into the single words, and each word
01:19:39.380 | will be mapped to one number.
01:19:41.420 | So this is the job of the tokenizer, to build the vocabulary of these numbers and to map
01:19:47.540 | each word into a number.
01:19:51.100 | When we build the tokenizer, we can also create special tokens, which we will use for the
01:19:55.620 | transformer.
01:19:56.620 | For example, the tokens called padding, the token called the start-of-sentence, end-of-sentence,
01:20:02.540 | which are necessary for training the transformer.
01:20:05.580 | But we will do it step-by-step.
01:20:07.540 | So let's build first the code for building the tokenizer and to download the dataset.
01:20:13.980 | Let's create a new file.
01:20:15.100 | Let's call it train.py.
01:20:18.180 | Okay.
01:20:19.180 | Let's import our usual library.
01:20:22.220 | So torch, we will also import torch.nm.
01:20:27.380 | And we also, because we are using a library from HuggingFace, we also need to import these
01:20:33.780 | two libraries.
01:20:34.940 | We will be using the datasets library, which you can install using pip.
01:20:41.300 | So datasets, we will be using load dataset.
01:20:50.060 | And we will also be using the tokenizers library also from HuggingFace, which you can install
01:20:56.140 | with pip.
01:21:02.600 | We also need the, which tokenizer we need, so we will use the word-level tokenizer.
01:21:19.800 | And there is also the trainers, so the tokenizer, the class that will train the tokenizer.
01:21:27.520 | So that will create the vocabulary given the list of sentences.
01:21:40.800 | And we will split the word according to the white space.
01:21:44.800 | I will build one method by one.
01:21:47.460 | So I will build first the methods to create the tokenizer, and I will describe each parameter.
01:21:55.240 | For now, you will not have the bigger picture, but later when we combine all these methods
01:21:59.380 | together, you will have the bigger picture.
01:22:01.840 | So let's first make the method that builds the tokenizer.
01:22:05.500 | So we will call it getOrBuildTokenizer.
01:22:13.380 | And this method takes the configuration, which is the configuration of our model.
01:22:17.000 | We will define it later.
01:22:18.760 | The dataset and the language for which we are going to build the tokenizer.
01:22:25.920 | We define the tokenizer path, so the file where we will be save this tokenizer.
01:22:32.200 | And we do it path of config.
01:22:41.400 | Okay, let me define some things.
01:22:46.800 | First of all, this path is coming from the pathlib, so from pathlib.
01:22:53.580 | This is a library that allows you to create absolute path given relative paths.
01:22:58.420 | And we pretend that we have a configuration called the tokenizer file, which is the path
01:23:03.420 | to the tokenizer file.
01:23:05.360 | And this path is formattable using the language.
01:23:08.280 | So for example, we can have something like this, for example, something like this.
01:23:23.360 | And this will be, given the language, it will create a tokenizer English or tokenizer Italian,
01:23:31.140 | for example.
01:23:32.140 | So if the tokenizer doesn't exist, we create it.
01:23:45.200 | I took all this code actually from HuggingFace.
01:23:48.220 | It's nothing complicated.
01:23:49.580 | I just taken their quick tour of their tokenizers library.
01:23:54.500 | And it's really easy to use it, and saves you a lot of time.
01:23:58.680 | Because tokenizer, to build a tokenizer is really reinventing the wheel.
01:24:11.060 | And we will also introduce the unknown word, unknown.
01:24:15.460 | So what does it mean?
01:24:16.800 | If our tokenizer sees a word that it doesn't recognize in its vocabulary, it will replace
01:24:21.560 | it with this word, unknown.
01:24:24.120 | It will map it to the number corresponding to this word, unknown.
01:24:31.580 | The pre-tokenizer means basically that we split by whitespace.
01:24:36.320 | And then we train, we build the trainer to train our tokenizer.
01:24:55.360 | Okay, this is the trainer.
01:25:20.860 | What does it mean?
01:25:21.860 | It means it will be a word-level trainer.
01:25:24.340 | So it will split words using the whitespace and using the single words.
01:25:28.980 | And it will also have four special tokens.
01:25:31.340 | One is unknown, which means that if you cannot find that particular word in the vocabulary,
01:25:37.860 | just replace it with unknown.
01:25:39.700 | It will also have the padding, which we will use to train the transformer, the start of
01:25:45.180 | sentence and the end of sentence special tokens.
01:25:47.820 | In frequency means that a word, for a word to appear in our vocabulary, it has to have
01:25:52.980 | a frequency of at least two.
01:25:56.380 | Now we can train the tokenizer.
01:26:04.020 | We use this method, which means we build first a method that gives all the sentences from
01:26:10.680 | our data set and we will build it later.
01:26:38.220 | Okay, so let's build also this method called getAllSentence so that we can iterate through
01:26:59.140 | the data set to get all the sentences corresponding to the particular language for which we are
01:27:04.640 | creating the tokenizer.
01:27:17.200 | As you remember, each item in the data set, it's a pair of sentences, one in English,
01:27:21.680 | one in Italian.
01:27:22.680 | We just want to extract one particular language.
01:27:32.000 | This is the item representing the pair.
01:27:35.400 | And from this pair, we extract only the one language that we want.
01:27:40.600 | And this is the code to build the tokenizer.
01:27:43.520 | Now let's write the code to load the data set and then to build the tokenizer.
01:27:49.360 | We will call this method getDataset and which also takes the configuration of the model,
01:27:55.880 | which we will define later.
01:27:59.340 | So let's load the data set.
01:28:00.800 | We will call it dsRow.
01:28:04.800 | Okay, HuggingFace allows us to download its data sets very easily.
01:28:12.680 | We just need to tell him what is the name of the data set.
01:28:17.920 | And then tell him what is the subset we want.
01:28:20.920 | We want the subset that is English to Italian, but we want to also make it configurable for
01:28:25.200 | you guys to change the language very fast.
01:28:27.160 | So let's build this subset dynamically.
01:28:41.040 | We will have two parameters in the configuration.
01:28:44.140 | One is called languageSource and one is called languageTarget.
01:28:57.040 | Later we can also define what split we want of this data set.
01:29:00.960 | In our case, there is only the training split in the original data set from HuggingFace,
01:29:07.640 | but we will split by ourself into the validation and the training data.
01:29:13.480 | So let's build the tokenizer.
01:29:27.080 | This is the raw data set and we also have the target.
01:29:46.240 | Okay, now, because we only have the training split from HuggingFace, we can split it by
01:29:52.760 | by ourself into a training and the validation.
01:29:55.520 | We keep 90% of the data for training and 10% for validation.
01:30:17.840 | Okay, let's build the tokenizer.
01:30:45.880 | The method randomSplit allows, it's a method from PyTorch that allows to split a data set
01:30:59.280 | using the size that we give as input.
01:31:02.520 | So in this case, it means split this data set into this two smaller data set, one of
01:31:08.240 | this size and one of this size.
01:31:10.320 | But let's import the method from Torch.
01:31:23.280 | Let's also import the one that we will need later, TotalOrder and randomSplit.
01:31:36.660 | Now we need to create the data set.
01:31:39.100 | The data set that our model will use to access the tensors directly, because now we just
01:31:44.560 | created the tokenizer and we just loaded the data, but we need to create the tensors that
01:31:50.880 | our model will use.
01:31:52.400 | So let's create the data set.
01:31:54.080 | Let's call it bilingual data set and for that we create a new file.
01:32:07.360 | So here we import Torch and that's it.
01:32:31.440 | We will call the data set, we will call it bilingual data set.
01:32:41.620 | Okay as usual we define the constructor and in this constructor we need to give him the
01:32:49.920 | data set downloaded from HuggingFace, the tokenizer of the source language, the tokenizer
01:32:55.880 | of the target language, the source language, the name of the source language, the name
01:33:00.920 | of the target language and the sequence length that we will use.
01:33:14.360 | Okay we save all these values.
01:33:37.320 | We can also save the tokens, the particular tokens that we will use to create the tensors
01:33:45.520 | for the model.
01:33:46.840 | So we need the start of sentence, end of sentence and the padding token.
01:33:50.640 | So how do we convert the token start of sentence into a number, into the input ID?
01:33:57.840 | There is a special method of the tokenizer to do that, so let's do it.
01:34:02.920 | So this is the start of sentence token, we want to build it into a tensor.
01:34:10.680 | This tensor will contain only one number which is given by, we can use this tokenizer from
01:34:17.480 | the source or the target, it doesn't matter because they both contain these particular
01:34:20.960 | tokens.
01:34:24.640 | This is the method to convert the token into a number, so start of sentence and the type
01:34:33.540 | of this token, of this tensor is, we want it long because the vocabulary can be more
01:34:46.600 | than 32-bit long, the vocabulary size, so we usually use the long 64-bit.
01:34:55.160 | And we do the same for the end of sentence and the padding token.
01:35:23.100 | We also need to define the length method of this dataset, which tells the length of the
01:35:33.260 | dataset itself, so basically just the length of the dataset from hugging face, and then
01:35:40.180 | we need to define the get item method.
01:35:55.540 | First of all we will extract the original pair from the hugging face dataset, then we
01:36:09.580 | extract the source text and the target text.
01:36:38.500 | And finally we convert each text into tokens, and then into input IDs.
01:36:47.220 | What does it mean?
01:36:48.220 | We will first, the tokenizer will first split the sentence into single words, and then will
01:36:54.020 | map each word into its corresponding number in the vocabulary, and it will do it in one
01:36:58.700 | pass only, this is done by the encode method.ids, this gives us the input IDs, so the numbers
01:37:18.600 | corresponding to each word in the original sentence, and it will be given as an array.
01:37:34.220 | We did the same for the decoder.
01:37:36.260 | Now as you remember, we also need to pad the sentence to reach the sequence length.
01:37:43.420 | This is really important because we want our model to always work, I mean the model always
01:37:49.220 | works with a fixed length, sequence length, but we don't have enough words in every sentence,
01:37:54.960 | so we use the padding token, so this PAD here, as the padding token to fill the sentence
01:38:01.540 | until it reaches the sequence length.
01:38:04.020 | So we calculate how many padding tokens we need to add for the encoder side and for the
01:38:07.940 | decoder side, which is basically how many we need to reach the sequence length.
01:38:22.080 | Minus two, why minus two here?
01:38:24.260 | So we already have this amount of tokens, we need to reach this one, but we will add
01:38:28.780 | also the start of sentence token and the end of sentence token to the encoder side, so
01:38:35.200 | we also have minus two here.
01:38:49.380 | And here only minus one.
01:38:50.900 | If you remember my previous video, when we do the training, we add only the start of
01:38:57.420 | sentence token to the decoder side, and then in the label we only add the end of sentence
01:39:03.660 | token.
01:39:04.660 | So in this case we only need to add one token, special token to the sentence.
01:39:09.160 | We also make sure that this sequence length that we have chosen is enough to represent
01:39:15.500 | all the sentences in our dataset, and if we chose too small one, we want to raise an exception.
01:39:25.500 | So basically this number of padding tokens should never become negative.
01:39:42.340 | Okay, now let's build the two tensors for the encoder input and for the decoder input,
01:40:01.200 | but also for the label.
01:40:02.820 | So one sentence will be sent to the input of the encoder, one sentence will be sent
01:40:08.000 | to the input of the decoder, and one sentence is the one that we expect as the output of
01:40:14.900 | the decoder.
01:40:16.220 | And that output we will call label.
01:40:19.380 | Usually it's called target or label.
01:40:21.200 | I call it label.
01:40:28.360 | We can cut the tensor of the start, okay, we can cut three tensors.
01:40:35.460 | First is the start of sentence token, then the tokens of the source text, then the end
01:40:56.660 | of sentence token, and then enough padding tokens to reach the sequence length.
01:41:05.260 | We already calculated how many impeding tokens we need to add to this sentence, so let's
01:41:09.480 | just do it.
01:41:36.160 | And this is the encoder input, so let me write some comment here.
01:41:40.240 | This is add SOS and AOS to the source text.
01:41:50.780 | Then we build the decoder input, which is also a concatenation of tokens.
01:42:01.680 | In this case we don't have the start of sentence, we just have the start of sentence.
01:42:26.640 | And finally we add enough padding tokens to reach the sequence length.
01:42:33.560 | We already calculated how many we need, just use this value now.
01:42:39.040 | And then we build the label.
01:43:00.120 | In the label we only add the end of sentence token.
01:43:16.400 | Because we need the same number of padding tokens as for the decoder input.
01:43:25.480 | Just for debugging, let's double check that we actually reach the sequence length.
01:43:37.160 | Ok, now that we have made this check, let me also write some comments here.
01:44:01.240 | Here we are only adding SOS to the decoder input.
01:44:09.680 | And here is add EOS to the label, what we expect as output from the decoder.
01:44:25.280 | Now we can return all these tensors so that our training can use them.
01:44:32.360 | We return a dictionary comprised of encoder input.
01:44:41.920 | What is the encoder input?
01:44:42.920 | It's basically of size sequence length.
01:44:46.880 | Then we have the decoder input, which is also just a sequence length number of tokens.
01:44:59.220 | I forgot a comma here.
01:45:06.480 | And then we have the encoder mask.
01:45:08.720 | So what is the encoder mask?
01:45:11.240 | As you remember, we are increasing the size of the encoder input sentence by adding padding
01:45:19.180 | tokens.
01:45:20.180 | But we don't want these padding tokens to participate in the self-attention.
01:45:24.820 | So what we need is to build a mask that says that we don't want these tokens to be seen
01:45:30.280 | by the self-attention mechanism.
01:45:32.700 | And so we build the mask for the encoder.
01:45:37.360 | How do we build this mask?
01:45:38.400 | We just say that all the tokens that are not padding are okay.
01:45:44.120 | All the tokens that are padding are not okay.
01:45:50.680 | We also unscreeze to add this sequence dimension and also to add the batch dimension later.
01:46:01.160 | And we convert into integers.
01:46:03.160 | So this is 1, 1 sequence length, because this will be used in the self-attention mechanism.
01:46:12.200 | However, for the decoder, we need a special mask that is a causal mask, which means that
01:46:19.640 | each word can only look at the previous word and each word can only look at non-padding
01:46:27.240 | words.
01:46:28.240 | So we don't want, again, we don't want the padding tokens to participate in the self-attention.
01:46:32.560 | We only want real words to participate in this.
01:46:35.680 | And we also don't want each word to watch at words that come after it, but only that
01:46:42.360 | words come before it.
01:46:45.160 | So I will use a method here called causal mask that will build it.
01:46:49.480 | Later we will build it also.
01:46:50.600 | So now I just call it to show you how it's used, and then we will proceed to build it.
01:47:00.760 | So in this case, we don't want the padding tokens, and we add the necessary dimensions.
01:47:09.480 | And also we do a Boolean end with causal mask, which is a method that we will build right
01:47:22.360 | And this causal mask needs to build a matrix of size sequence length to sequence length.
01:47:27.240 | What is sequence length is basically the size of our decoder input.
01:47:36.480 | And this, let me write a comment for you.
01:47:39.000 | So this is one, two, sequence length, combined with, so the end with one sequence length,
01:47:50.640 | sequence length, and this can be broadcasted.
01:47:56.240 | Let's go define this method, causal mask.
01:47:59.280 | So what is causal mask?
01:48:02.480 | Causal mask basically means that we want, let's go back to the slides actually, as you
01:48:07.280 | remember from the slides, we want each word in the decoder to only watch words that come
01:48:12.800 | before it.
01:48:13.800 | So what we want is to make all these values above this diagonal that represents the multiplication,
01:48:20.360 | this matrix represents the multiplication of the queries by the keys in the self-attention
01:48:25.240 | mechanism.
01:48:26.480 | What we want is to hide all these values.
01:48:28.640 | So your cannot watch the word cat is a lovely cat.
01:48:33.360 | It can only watch itself, but this word here, for example, this word lovely can watch everything
01:48:38.720 | that comes before it.
01:48:40.040 | So from your up to lovely itself, but not the word cat that comes after it.
01:48:45.540 | So what we do is we want all these values here to be masked out.
01:48:51.440 | So which also means that we want all the values above this diagonal to be masked out.
01:48:56.880 | And there is a very practical method in PyTorch to do it.
01:49:01.040 | So let's do it.
01:49:02.280 | Let's go build this method.
01:49:04.700 | So the mask is basically torch.triu, which means give me every value that is above the
01:49:13.420 | diagonal that I am telling you.
01:49:15.440 | So we want a matrix, which matrix, matrix made of all ones.
01:49:23.180 | And this method will return every value above the diagonal and everything else will become
01:49:29.520 | zero.
01:49:30.880 | So we want diagonal one type, we want it to be integer.
01:49:40.000 | And what we do is return mask is equal to zero.
01:49:43.460 | So this will return all the values above the diagonal and everything below the diagonal
01:49:49.200 | will become zero.
01:49:50.200 | But we want actually the opposite.
01:49:51.760 | So we say, okay, everything that is zero should will become true with this expression and
01:49:56.040 | everything that is not zero will become false.
01:50:01.360 | So we apply it here to build this mask.
01:50:03.540 | So this mask will be one by sequence length by sequence length, which is exactly what
01:50:09.760 | we want.
01:50:11.320 | Okay, let's add also the label.
01:50:19.000 | The label is also, oh, I forgot the comma.
01:50:26.000 | Sequence length and then we have the source text just for visualization, we can send it
01:50:30.480 | source text and then the target text.
01:50:41.640 | And this is our data set.
01:50:44.380 | Now let's go back to our training method to continue writing the training loop.
01:50:50.620 | So now that we have the data set, we can create it.
01:50:55.120 | We can create two data sets, one for training, one for validation, and then we send it to
01:51:01.120 | a data loader and finally to our training loop.
01:51:11.020 | We forgot to import the data set.
01:51:14.480 | So let's import it here.
01:51:18.280 | We also import the causal mask, which we will need later.
01:51:43.480 | What is our source language? it's in the configuration.
01:51:50.040 | What is our target language?
01:51:56.520 | And what is our sequence length?
01:51:57.720 | It's also in the configuration.
01:52:01.800 | We do the same for the validation.
01:52:09.480 | But the only difference is that we use this one now and the rest is same.
01:52:15.960 | We also, just for choosing the max sequence length, we also want to watch what is the
01:52:21.200 | maximum length of each sentence in the source and the target for each of the two splits
01:52:27.320 | that we created here.
01:52:28.760 | So that if we choose a very small sequence length, we will know.
01:52:46.840 | Basically what we do, I load each sentence from each language, from the source and the
01:52:51.640 | target language.
01:52:52.640 | I convert into IDs using the tokenizer and I check the length.
01:52:56.340 | If the length is, let's say 180, we can choose 200 as sequence length, because it will cover
01:53:02.640 | all the possible sentences that we have in this data set.
01:53:06.360 | If it's, let's say 500, we can use 510 or something like this, because we also need
01:53:11.160 | to add the start of sentence and the end of sentence tokens to these sentences.
01:53:39.420 | This is the source IDs, then let's create also the target IDs, and this is the language
01:53:46.280 | of target.
01:53:50.240 | And then we just say the source maximum length is the maximum of the
01:54:02.640 | and the length of the current sentence, the target is the target and the target IDs.
01:54:11.760 | Then we print these two values, we also do it for the target.
01:54:29.440 | And that's it.
01:54:30.440 | Now we can proceed to create the data loaders.
01:54:42.760 | We define the batch size according to our configuration, which we still didn't define,
01:54:47.160 | but you can already guess what are its values.
01:54:53.280 | We want it to be shuffled.
01:55:07.500 | For the validation, I will use a batch size of one, because I want to process each sentence
01:55:13.000 | one by one.
01:55:17.640 | And this method returns the data loader of the training, the data loader of the validation,
01:55:24.180 | the tokenizer of the source language and the tokenizer of the target language.
01:55:30.880 | Now we can start building the model.
01:55:34.960 | So let's define a new method called getModel, which will, according to our configuration,
01:55:40.880 | our vocabulary size, build the model, the transformer model.
01:55:52.260 | So the model is, we didn't import the model, so let's import it.
01:56:05.600 | Model transformer.
01:56:11.280 | What is the first?
01:56:12.280 | The source vocabulary size and the target vocabulary size.
01:56:20.280 | And then we have the sequence length.
01:56:25.200 | And we have the sequence length of the source language and the sequence length of the target
01:56:30.880 | language.
01:56:31.880 | We will use the same.
01:56:35.160 | For both.
01:56:37.080 | And then we have the dModule, which is the size of the embedding.
01:56:43.800 | We can keep all the rest, the default, as in the paper.
01:56:49.640 | If the model is too big for your GPU to be trained on, you can try to reduce the number
01:56:54.520 | of heads or the number of layers.
01:56:56.640 | Of course, it will impact the performance of the model.
01:57:00.640 | But I think given the dataset, which is not so big and not so complicated, it should not
01:57:06.260 | be a big problem because we are not building a huge dataset anyway.
01:57:10.640 | OK, now that we have the model, we can start building the training loop.
01:57:15.800 | But before we build the training loop, let me as define this configuration because it
01:57:19.960 | keeps coming and I think it's better to define the structure now.
01:57:26.400 | So let's create a new file called config.py in which we define two methods.
01:57:32.660 | One is called getConfig and one is to map to get the path where we will save the weights
01:57:40.760 | of the model.
01:57:44.880 | OK, let's define the batch size.
01:57:54.240 | I choose 8.
01:57:55.240 | You can choose something bigger if your computer allows it.
01:57:58.440 | The number of epochs for which we will be training, I would say 20 is enough.
01:58:03.840 | The learning rate, I am using 10 to the power of -4.
01:58:08.740 | You can use other values.
01:58:11.640 | I thought this learning rate is reasonable.
01:58:16.880 | It's possible to change the learning rate during training.
01:58:21.320 | It's quite common to give a very high learning rate and then reduce it gradually with every
01:58:26.360 | epoch.
01:58:27.360 | We will not be using it because it will just complicate the code a little more and this
01:58:31.160 | is not actually the goal of this video.
01:58:33.840 | The goal of this video is to teach how the transformer works.
01:58:41.080 | I have already checked the sequence length that we need for this particular dataset from
01:58:46.760 | English to Italian, which is 350 is more than enough.
01:58:50.760 | And the D model that we will be using is the default of 512.
01:58:55.680 | The language source is English.
01:58:59.120 | So we are going from English.
01:59:00.880 | The language target is Italian.
01:59:03.520 | We are going to translate into Italian.
01:59:07.640 | We will save the model into the folder called weights.
01:59:16.760 | And the file name of which model will be T model, so transformer model.
01:59:25.480 | I also built the code to preload the model in case we want to restart the training after
01:59:32.080 | maybe it crashed.
01:59:43.680 | And this is the tokenizer file.
01:59:46.320 | So it will be saved like this.
01:59:47.560 | So tokenizer n and tokenizer it according to the language.
01:59:52.720 | And this is the experiment name for TensorBoard on which we will save the losses while training.
02:00:04.720 | I think there is a comma here.
02:00:08.080 | Okay.
02:00:09.080 | Now let's define another method that allows us to find the part where we need to save
02:00:13.280 | the weights.
02:00:19.520 | Why I'm creating such a complicated structure is because I will provide also notebooks to
02:00:26.600 | run this training on Google Colab.
02:00:29.720 | So we just need to change these parameters to make it work on Google Colab and save the
02:00:34.320 | weights directly on your Google Drive.
02:00:36.560 | I have already created actually this code and it will be provided on GitHub and I will
02:00:42.880 | also provide the link in the video.
02:01:02.240 | Okay, the file is built according to model base name, then the epoch.pt.
02:01:27.920 | Let's import also here the path library.
02:01:47.560 | Okay, now let's go back to our training loop.
02:01:52.880 | Okay, we can build the training loop now finally.
02:01:55.840 | So train model given the configuration.
02:01:59.960 | Okay, first we need to define which device on which we will put all the tensors.
02:02:05.680 | So define the device.
02:02:33.120 | Then we also print.
02:02:44.960 | We make sure that the weights folder is created.
02:03:04.760 | And then we load our data set.
02:03:28.040 | To get the vocabulary size, there is method called get_vocab_size.
02:03:37.720 | And I think we don't have any other parameter.
02:03:41.160 | And finally, we transfer the model to our device.
02:03:47.520 | We also start TensorBoard.TensorBoard allows to visualize the loss, the graphics, the charts.
02:04:06.320 | Let's also import TensorBoard.
02:04:28.920 | Let's go back.
02:04:30.240 | Let's also create the optimizer.
02:04:33.280 | I will be using the Adam optimizer.
02:04:48.000 | Okay, since we also have the configuration that allow us to resume the training in case
02:05:01.000 | the model crashes or something crashes, let's implement that one.
02:05:05.480 | And that will allow us to restore the state of the model and the state of the optimizer.
02:05:31.160 | Let's import this method we defined in the data set.
02:05:58.720 | We load the file.
02:06:23.120 | And we run it.
02:06:51.760 | Here we have a typo.
02:06:57.480 | Okay, the loss function we will be using is the cross entropy loss.
02:07:07.880 | We need to tell him what is the ignore index.
02:07:10.160 | So we want him to ignore the padding token basically.
02:07:14.080 | We don't want the padding token to contribute to the loss.
02:07:34.480 | And we also will be using label smoothing.
02:07:38.160 | Label smoothing basically allows our model to be less confident about its decision.
02:07:45.040 | So how to say, imagine our model is telling us to choose the word number three and with
02:07:52.280 | a very high probability.
02:07:53.600 | So what we will do with label smoothing is take a little percentage of that probability
02:07:57.320 | and distribute to the other tokens so that our model becomes less sure of its choices.
02:08:04.280 | So kind of less over fit and this actually improves the accuracy of the model.
02:08:12.020 | So we will use a label smoothing of 0.1 which means from every highest probability token
02:08:19.240 | take 0.1% of score and give it to the others.
02:08:28.240 | Okay let's build finally the training loop, we tell the model to train.
02:08:47.320 | I build a batch iterator for the data loader using tqodm which will show a very nice progress
02:09:18.400 | And we need to import tqodm.
02:09:34.280 | Okay finally we get the tensors, the encoder input.
02:09:49.000 | What is the size of this tensor?
02:09:51.280 | It's batch to sequence length.
02:09:54.880 | The decoder input is batch of decoder input and we also move it to our device, batch to
02:10:06.160 | sequence length, we get the two masks also.
02:10:25.040 | This is the size and then the decoder mask.
02:10:41.320 | Okay why these two masks are different?
02:10:44.240 | Because in the one case we are only telling him to hide only the padding tokens, in the
02:10:51.240 | other case we are also telling him to hide all these subsequent words, for each word
02:10:57.120 | to hide all the subsequent words to mask them out.
02:11:02.440 | Okay now we run the tensors through the transformer.
02:11:10.880 | So first we calculate the output of the encoder and we encode using what the encoder input
02:11:20.480 | and the mask of the encoder.
02:11:24.360 | Then we calculate the decoder output using the encoder output, the source, the
02:11:36.920 | mask of the encoder, then the decoder input and the decoder mask.
02:11:46.800 | Okay as we know this the result of this so the output of the model.encode will be a batch
02:11:54.840 | sequence length d model.
02:11:59.800 | Also the output of the decoder will be batch sequence length d model.
02:12:08.120 | But we want to map it back to the vocabulary so we need the projection.
02:12:11.620 | So let's get the projection output.
02:12:19.960 | And this will produce a B so batch sequence length and target vocabulary size.
02:12:29.440 | Okay now that we have the output of the model we want to compare it with our label.
02:12:34.360 | So first let's extract the label from the batch.
02:12:42.580 | And we also put it on our device.
02:12:45.180 | So what is the label it's B so batch to sequence length in which each position tell so the
02:12:52.700 | label is already for each B and sequence length so for each dimension tells us what is the
02:13:00.140 | position in the vocabulary of that particular word and we want these two to be comparable
02:13:08.660 | so we first need to compute the loss into this I show you now projection output view
02:13:18.180 | minus one.
02:13:28.620 | Okay what does this do this basically transforms the I show you here this size into this size
02:13:40.600 | B multiplied by sequence length and then target vocabulary size vocabulary size.
02:13:49.140 | Okay because we want to compare it with this.
02:13:52.600 | This is how the cross entropy wants the tensors to be.
02:14:00.300 | And also the label.
02:14:04.660 | Okay now we can we have calculated the loss we can update our progress bar this one with
02:14:11.620 | the loss we have calculated.
02:14:39.560 | This is this will show the loss on our progress bar.
02:14:43.340 | We can also log it on TensorBoard.
02:14:52.380 | Let's also flush it.
02:15:07.260 | Okay now we can back propagate the loss so loss.backward and finally we update the weights
02:15:17.620 | of the model so that is the job of the optimizer and finally we can zero out the grad and we
02:15:28.040 | move the global step by one the global step is being used mostly for TensorBoard to keep
02:15:32.820 | track of the loss we can save the model every epoch okay model file name which we get from
02:15:45.700 | our special methods this one we tell him the configuration we have and the name of the
02:15:54.140 | file which is the epoch but with zeros in front and we save our model.
02:16:06.860 | It is very good idea when we want to be able to resume the training to also save not only
02:16:12.740 | the state of the model but also the state of the optimizer because the optimizer also
02:16:18.340 | keep tracks of some statistics one for each weight to understand how to move each weight
02:16:24.820 | independently and usually actually I saw that the optimizer dictionary is quite big so even
02:16:35.860 | if it's big if you want your training to be resumable you need to save it otherwise the
02:16:40.800 | optimizer will always start from zero and we'll have to figure out from zero even if
02:16:45.980 | you start from a previous epoch how to move each weight so every time we save some snapshot
02:16:52.660 | I always include it.
02:16:58.660 | State of the model this is all the weights of the model we also want to save the optimizer
02:17:09.020 | let's do also the global step and we want to save all this into the file
02:17:28.240 | name so model file name and that's it now let's build the code to run this so if name
02:17:44.280 | I really find the warnings frustrating so I want to filter them out because I have some
02:17:50.260 | a lot of libraries especially CUDA I already know what's the content and so I don't want
02:17:57.040 | to visualize them every time but for sure for you guys I suggest watching them at least
02:18:02.680 | once to understand if there is any big problem otherwise they're just complaining from CUDA
02:18:22.440 | okay let's try to run this code and see if everything is working fine we should what
02:18:29.760 | we expect is that the code should download the data set the first time then it should
02:18:35.240 | create the tokenizer and save it into its file and it should also start training the
02:18:42.960 | model for 30 epochs of course it will never finish but let's do it let me check again
02:18:48.760 | the configuration tokenizer okay let's run it
02:19:13.520 | okay it's building the tokenizer and we have some problem here sequence length okay finally
02:19:20.040 | the model is training I show you recap you guys what I had mistaken first of all the
02:19:26.920 | sequence length was written incorrectly there was a capital L here and also in the data
02:19:32.720 | set I forgot to save it here and here I had it also written capitalized so L was capital
02:19:41.120 | and now the training is going on and as you can see the training is quite fast or at least
02:19:48.440 | on my computer actually not so fast but because I chose a batch size of 8 I could try to increase
02:19:55.680 | it and it's happening on CUDA the loss is decreasing and the weights will be saved here
02:20:03.080 | so if we reach the end of the epoch it will create the first weight here so let's wait
02:20:07.720 | until the end of the epoch and see if the weight is actually created before actually
02:20:12.800 | finishing the training of the model let's do another thing we also would like to visualize
02:20:19.000 | the output of the model while we are training and this is called validation so we want to
02:20:23.960 | check how our model is evolving while it is getting trained so what we want to build is
02:20:31.160 | a validation loop which will allow us to evaluate the model which also means that we want to
02:20:36.940 | inference from this model and check some sample sentences and see if how they get translated
02:20:43.520 | so let's start building the validation loop the first thing we do is we build a new method
02:20:48.280 | called run validation and this method will accept some parameters that we will use for
02:21:01.920 | now I just write all of them and later I explain how they will be used
02:21:05.500 | so, we have a new method called run validation and this method will accept some parameters, we will use them later.
02:21:12.500 | [typing]
02:21:38.500 | okay the first thing we do to run the validation is we put our model into evaluation mode so
02:21:46.260 | we do model.eval and this means that this tells PyTorch that we are going to evaluate
02:21:52.100 | our model and then what we will do we will inference two sentences and see what is the
02:22:01.740 | output of the model. [typing]
02:22:30.740 | so with torch.nodegrad we are disabling the gradient calculation for this for every tensor
02:22:47.400 | that we will run inside this with block and this is exactly what we want we just want
02:22:52.100 | to inference from the model we don't want to train it during this loop so let's get
02:22:58.700 | a batch from the validation data set because we want to inference only two so we keep a
02:23:05.380 | count of how many we have already processed and we get the input from this current batch
02:23:13.340 | I want to remind you that for the validation ds we only have a batch size of 1 [typing]
02:23:28.120 | this is the encoder input and we can also get the encoder mask
02:23:43.100 | let's just verify that the size of the batch is actually 1 [typing]
02:24:01.040 | and now let's go to the interesting part so as you remember when we calculate the when
02:24:09.360 | we want to inference the model we need to calculate the encoder output only once and
02:24:14.440 | reuse it for every token that the model will output from the decoder so let's create another
02:24:20.600 | function that will run the greedy decoding on our model and we will see that it will
02:24:26.520 | run the encoder only once so let's call this function greedy decode [typing]
02:24:54.080 | okay let's create some tokens that we will need so the SOS token which is the start of
02:25:01.140 | sentence we can get it from either tokenizer it doesn't matter if it's the target or the
02:25:19.660 | target EOS okay and then we what we do is we pre-compute the encoder output and reuse
02:25:35.740 | it for every token we get from the decoder so
02:25:42.180 | we just give the source and the source mask which is the encoder input and the encoder
02:25:57.340 | mask we can also call it encoder input and encoder mask then we get the then we okay
02:26:06.800 | how do we do the inferencing the first thing we do is we give to the decoder the start
02:26:11.820 | of sentence token so that the decoder will output the first token of the sentence of
02:26:18.060 | the translated sentence then at every iteration just like we saw in my slides at every iteration
02:26:24.540 | we add the previous token to the to the decoder input and so that the decoder can output the
02:26:31.660 | next token then we take the next token we put it again in front of the input to the
02:26:36.620 | decoder and we get the successive token so let's build a decoder input for the first
02:26:42.780 | iteration which is only the start of sentence token
02:27:06.780 | we fill this one with the start of sentence token
02:27:17.280 | and it has the same type as the encoder input okay now we will keep in asking the decoder
02:27:26.740 | to output the next token until we reach either the end of sentence token or the max length
02:27:32.420 | we have defined here so we can do a while true and then our first stopping condition
02:27:39.260 | is if we the decoder output which is becomes the input of the next step becomes large larger
02:27:46.760 | than max length or which is max length
02:27:58.140 | here why do we have two dimensions one is for the batch and one is for the tokens of
02:28:02.620 | the of the decoder input
02:28:09.740 | now we also need to create a mask for this
02:28:24.820 | we can use our function causal mask to say that we don't want the input to watch future
02:28:31.580 | words
02:28:39.780 | and we don't need the other mask because here we don't have any padding token as you can
02:28:52.900 | now we calculate the output
02:29:07.820 | we reuse the output of the encoder for every iteration of the loop we reuse the source
02:29:15.260 | mask so the input the mask of the encoder then we give the decoder input and along with
02:29:21.220 | its mask the decoder mask and then we get the next token
02:29:31.860 | so we get the probabilities of the next token using the projection layer
02:29:39.860 | but we only want the projection of the last token so the next token after the last we
02:29:47.260 | have given to the encoder
02:29:51.980 | now we can use the max
02:29:59.460 | so we get the token with the maximum probability this is the greedy search
02:30:18.620 | and then we get this word and we append it back to this one because it will become the
02:30:24.740 | input of the next iteration
02:30:30.140 | and we concat
02:30:33.700 | so we take the decoder input and we append the next token so we create another tensor
02:30:39.900 | for that
02:31:08.900 | yeah should be correct okay if the next token so if the next word or token is equal equal
02:31:25.700 | to the end of sentence token then we also stop the loop
02:31:31.260 | and this is our greedy search now we can just return the output so the output is basically
02:31:37.660 | the decoder input because every time we are appending the next token to it and we remove
02:31:42.540 | the batch dimension so we squeeze it
02:31:47.660 | and that's our greedy decoding now we can use it here in this function so in the validation
02:31:54.060 | function so we can finally get the model output is equal to greedy decode in which we give
02:32:02.300 | him all the parameters
02:32:19.620 | and then we want to compare this model output with what we expected so with the label so
02:32:26.900 | let's append all of these so what we give to the input we gave to the model what the
02:32:32.820 | model output the output of the model so the predicted and what we expected as output we
02:32:38.500 | save all of this in this lists and then at the end of the loop we will print them on
02:32:57.780 | to get the text of the output of the model we need to use the tokenizer again to convert
02:33:18.500 | the tokens back into text and we use of course the target tokenizer because this is the target
02:33:25.940 | the language
02:33:40.460 | okay and now we save them all of this into their respective lists
02:34:06.620 | and we can also print it on the console
02:34:13.820 | while we are using why we are we using this function called print message and why not
02:34:18.540 | just use the print of the Python because we are using here in the main loop in the training
02:34:23.940 | loop we are using here tqdm which is our really nice looking progress bar but it is not suggested
02:34:32.260 | to print directly on the console when this progress bar is running so to print on the
02:34:38.460 | console there is one method called the print provided by tqdm and we will give this method
02:34:45.740 | to this function so that the output does not interfere with the progress bar printing
02:34:58.900 | so we print some bars
02:35:04.420 | and then we print all the messages
02:35:05.860 | okay so we have a message here and we have a message here and we have a message here.
02:35:34.300 | and if we have already processed number of examples then we just break so why we have
02:35:48.260 | created these lists actually we can also send all of this to to a tensor board so we can
02:35:58.700 | so for example if we have tensor board enabled we can send all of this to the tensor board
02:36:04.280 | and to do that actually we need another library that allow us to calculate some metrics I
02:36:10.900 | think we can skip this part but if you are really interested in the code I published
02:36:17.620 | on github you will find that I use this library called the torch metrics that allows us to
02:36:24.020 | calculate the char error rate and the bleu metric which is really useful for translation
02:36:31.820 | tasks and the word error rate so if you really interested you can find the code on the github
02:36:39.420 | but for our demonstration I think it's not necessary so and actually this we can also
02:36:47.220 | remove it given that we are not doing this part okay so now that we have our run validation
02:36:55.140 | method we can just call it okay what I usually do is I run the validation at every few steps
02:37:05.700 | but because we want to see it as soon as possible what we will do is we will first run it at
02:37:13.880 | every iteration and we also put this model.train inside of this loop so that every time after
02:37:22.100 | we run the validation the model is back into its training mode so now we can just run validation
02:37:29.500 | and we give it all the parameter that it needs to run the validation so give it model okay
02:37:57.140 | for printing message are we printing any message we are so let's create a lambda and we just
02:38:09.020 | do and this is the message to write with the tqdm then we need to give the global step
02:38:25.060 | and the writer which we will not use but okay now I think we can run the training again
02:38:33.660 | and see if the validation works
02:38:58.660 | all right looks like it is working so the model is okay it's running the validation
02:39:05.780 | at every step which is not desirable at all but at least we know that the greedy search
02:39:10.820 | is working and it's not at least looks like it is working and the model is not predicting
02:39:17.060 | anything useful actually it's just predicting a bunch of commas because it's not training
02:39:23.980 | at all but if we train the model after a while we should see that at after a few epochs the
02:39:29.940 | model should become better and better and better so let's stop this training and let's
02:39:36.980 | put this one back to where it belongs so at the end of every epoch here and this one we
02:39:43.880 | can keep it here no problem yeah okay I will now skip fast forward to a model that has
02:39:52.860 | been pre-trained I pre-trained it for a few hours so that we can inference it and we can
02:39:58.980 | visualize the attention I have copied the pre-trained weights that I pre-calculated
02:40:05.660 | and I also created this notebook reusing the functions that we have defined before in the
02:40:10.900 | train file the code is very simple actually I just copy and pasted the code from the train
02:40:15.980 | file I just load the model and run the validation the same method that we just wrote and then
02:40:22.460 | I ran the validation on the pre-trained let's run it again for example and as you can see
02:40:28.860 | the model is inferencing 10 examples sentences and the result is not bad I mean we can see
02:40:35.020 | that LevinSmile, LevinSorrisse, LevinSorrisse it's matching and most of them matching actually
02:40:40.420 | we could also say that it's nearly over fit for this particular data but this is the power
02:40:47.740 | of the transformer I didn't train it for many days I just trained it for a few hours if
02:40:53.020 | I remember correctly and the results are really really good and now let's write let's make
02:40:58.760 | the notebook that we will use to visualize the attention of this pre-trained model given
02:41:06.080 | the file that we built before so train.py you can also train your own model choosing
02:41:10.820 | the language of your choice which I highly recommend that you change the language and
02:41:15.060 | try to see how the model is performing and try to diagnose why the model is performing
02:41:21.240 | bad if it's performing bad or if it's performing well try to understand how can you improve
02:41:26.620 | it further so let's try to visualize the attention so let's create a new notebook let's call
02:41:33.660 | it let's say attention visualization okay so the first thing we do we import all the
02:41:47.740 | libraries we will need.
02:42:14.060 | I will also be using this library called Altair it's a visualization library for charts it's
02:42:25.180 | nothing related to deep learning actually it's just a visualization function and in
02:42:30.420 | particular the visualization function actually I found it online it's not written by me just
02:42:34.660 | like most of the visualization functions you can find easily on the internet if you want
02:42:38.240 | to build a chart or if you want to build a histogram etc so I am using this library mostly
02:42:43.920 | because I copied the code from the internet to visualize it but all the rest is my own
02:42:48.700 | code so let's import it okay let's import all of this and of course you will have to
02:43:16.780 | install this particular library when you run the code on your computer let's also define
02:43:22.280 | the device you can just copy the code from here
02:43:38.760 | and then we load the model which we can copy from here like this okay let's paste it here
02:43:51.640 | and this one becomes vocabulary source and vocabulary target
02:44:18.320 | okay now let's make a function to load the batch
02:44:22.560 | okay so we have a function called load_batch which is a function that loads the batch and
02:44:51.000 | oops.
02:44:58.000 | I will convert the batch into tokens.
02:45:27.960 | I will convert the tokens now using the tokenizer
02:45:29.600 | (keyboard clicking)
02:45:32.600 | (keyboard clicking)
02:45:35.600 | (keyboard clicking)
02:45:38.600 | (keyboard clicking)
02:45:41.600 | (keyboard clicking)
02:45:50.600 | (keyboard clicking)
02:45:59.600 | (keyboard clicking)
02:46:06.600 | (keyboard clicking)
02:46:08.760 | And of course, for the decoder,
02:46:10.000 | we use the target vocabulary.
02:46:11.760 | So the target tokenizer.
02:46:15.560 | (keyboard clicking)
02:46:19.560 | So let's just infer
02:46:30.800 | using our greedy decode algorithm.
02:46:36.800 | So we provide the model.
02:46:38.520 | (keyboard clicking)
02:47:05.000 | We return all this information.
02:47:07.520 | (keyboard clicking)
02:47:10.520 | Okay, now I will build the necessary functions
02:47:18.520 | to visualize the attention.
02:47:23.440 | I will copy some functions from another file
02:47:25.480 | because actually what we are going to build
02:47:27.520 | is nothing interesting from a learning point of view
02:47:31.480 | with regards to the deep learning.
02:47:33.720 | It's mostly functions to visualize the data.
02:47:36.240 | So I will copy it because it's quite long to write
02:47:39.040 | and the salient part I will explain, of course.
02:47:42.280 | And this is the function.
02:47:44.160 | Okay, what does this function do?
02:47:46.600 | Basically, we have the attention
02:47:48.680 | that we will get from the encoder.
02:47:50.440 | How to get the attention from the encoder?
02:47:53.120 | For example, the attention we have in three positions.
02:47:55.800 | First is in the encoder.
02:47:57.120 | The second one is in the decoder
02:47:59.200 | at the beginning of the decoder,
02:48:00.560 | so the self-attention of the decoder.
02:48:02.400 | And then we have the cross-attention
02:48:04.160 | between the encoder and the decoder.
02:48:06.360 | So we can visualize three type of attention.
02:48:09.000 | How to get the information about the attention?
02:48:11.560 | Well, we load the other model.
02:48:13.600 | We have the encoder.
02:48:14.800 | We choose which layer we want to get the attention from.
02:48:18.400 | And then from each layer,
02:48:19.400 | we can get the self-attention block
02:48:21.880 | and then its attention scores.
02:48:23.720 | How do, where does this variable come from?
02:48:28.000 | If you remember when we defined
02:48:29.720 | the attention calculation here,
02:48:32.240 | here, when we calculate the attention,
02:48:36.960 | we not only return the output to the next layer,
02:48:40.680 | we also give this attention scores,
02:48:42.680 | which is the output of the softmax.
02:48:45.640 | And we save it here in this variable,
02:48:49.560 | self.attentionscores.
02:48:51.200 | Now we can just retrieve it and visualize it.
02:48:55.720 | So this function will,
02:48:58.080 | based on which attention we want to get
02:49:00.320 | from which layer and from which head,
02:49:02.120 | will select the matrix, the correct matrix.
02:49:06.440 | This function builds a data frame
02:49:09.400 | to visualize the information.
02:49:11.400 | So the tokens and the score
02:49:14.400 | extracted from this matrix here.
02:49:16.640 | So it will, this matrix,
02:49:18.280 | we extract the row and the column.
02:49:21.320 | And then we also build the chart.
02:49:24.040 | The chart is built with Altair.
02:49:27.600 | And what we will build, actually,
02:49:29.760 | is we will get the attention for all the,
02:49:33.920 | I built this method to get the attention
02:49:37.560 | for all the heads and all the layers
02:49:40.000 | that we pass to this function as input.
02:49:43.840 | So let me run this cell now.
02:49:46.120 | Okay, let's create a new cell.
02:49:49.520 | And then let's just run it.
02:49:51.320 | Okay, first we want to visualize
02:49:53.200 | the sentence that we are dealing with.
02:49:55.320 | So the batch.
02:49:57.200 | Order input tokens.
02:50:04.440 | So we load a batch.
02:50:10.680 | And then we visualize what is the source and the target.
02:50:14.640 | (keys clacking)
02:50:17.320 | (keys clacking)
02:50:46.560 | And then also the target.
02:50:48.640 | And finally we calculate also the length.
02:50:57.680 | What is the length?
02:51:08.040 | Okay, it's basically all the characters
02:51:10.400 | that come before the padding character.
02:51:12.440 | So the first occurrence of the padding character.
02:51:15.000 | Because this is the batch taken from the dataset,
02:51:17.320 | which is already the tensor built for training,
02:51:19.680 | so they already include the padding.
02:51:21.400 | In our case, we just want to retrieve
02:51:23.000 | the number of actual characters in our sentence.
02:51:26.880 | So this one, we can,
02:51:28.280 | the number of actual words in our sentence,
02:51:30.600 | so we can check the number of words
02:51:32.560 | that come before padding.
02:51:34.000 | So let's run this one.
02:51:37.120 | And there is some problem.
02:51:39.120 | (keys clacking)
02:51:41.800 | Ah, here I forgot to,
02:51:48.680 | this function was wrong, so now it should work.
02:51:54.120 | Okay, this sentence is too small, let's get a longer one.
02:51:57.560 | Okay, let me check the quality.
02:52:00.920 | You cannot remain as you are, especially you.
02:52:03.240 | (speaking in foreign language)
02:52:06.240 | Okay, looks not bad.
02:52:08.920 | Okay, let's print the attention for the layers,
02:52:11.800 | let's say, zero, one, and two.
02:52:14.680 | Because we have six of them,
02:52:15.960 | if you remember, the parameter is n is equal to six.
02:52:19.520 | So we will just visualize three layers.
02:52:22.440 | And we will visualize all the heads.
02:52:24.840 | We have eight of them for each layer.
02:52:27.080 | So the head number zero,
02:52:28.320 | one, two, three, four, five, six, seven, and seven.
02:52:31.860 | Okay, let's first visualize the encoder self-attention.
02:52:38.800 | And we do get all attention maps.
02:52:41.440 | Which one we want?
02:52:42.400 | So the encoder one.
02:52:44.520 | And we want these layers and these heads.
02:52:48.160 | And what are the row tokens?
02:52:49.840 | The encoder input tokens.
02:52:52.600 | And what do we want in the column?
02:52:55.580 | Because we are gonna build a grid.
02:52:57.320 | So as you know, the attention is a grid
02:53:01.680 | that correlates rows with columns.
02:53:04.000 | In our case, we are talking about
02:53:05.600 | the self-attention of the encoder.
02:53:07.040 | So it's the same sentence that is attending itself.
02:53:11.160 | So we need to provide the input sentence of the encoder
02:53:14.760 | on both the rows and the columns.
02:53:17.360 | And what is the maximum number of length
02:53:19.380 | that we want to visualize?
02:53:21.120 | Okay, let's say we want to visualize no more than 20.
02:53:24.000 | So the minimum of 20 and sentence length.
02:53:27.240 | Okay, this is our visualization.
02:53:32.780 | We can see, and as we expected, actually,
02:53:36.920 | when we visualize the attention,
02:53:38.260 | we expect the values along the diagonals to be high
02:53:42.280 | because it's the dot product of each token with itself.
02:53:45.880 | And we can see also that there are
02:53:48.800 | other interesting relationship.
02:53:50.160 | For example, we see that the start of sentence token
02:53:53.520 | and the end of sentence token,
02:53:54.840 | at least for the head zero and the layer zero,
02:53:57.640 | they are not related to other words,
02:53:59.880 | like I would expect, actually.
02:54:02.880 | But other heads, they do learn some very small mapping.
02:54:07.640 | If we hover over each of the grid cells,
02:54:10.840 | we can see the actual value of the self-attention,
02:54:14.320 | so the score of the self-attention.
02:54:16.360 | For example, we can see the attention is very strong here,
02:54:19.480 | so the word especially and specially are related,
02:54:23.100 | so it's the same word with itself,
02:54:24.980 | but also especially and now.
02:54:28.280 | And we can visualize this kind of attention
02:54:31.560 | for all the layers.
02:54:32.520 | So because each head will watch different aspect
02:54:37.520 | of each word because we are distributing
02:54:39.760 | the word embedding among the heads equally,
02:54:42.760 | so each head will see a different part
02:54:45.720 | of the embedding of the word.
02:54:47.360 | We also hope that they learn different kind of mapping
02:54:51.300 | between the words.
02:54:52.160 | And this is actually the case.
02:54:54.760 | And between one layer and the next,
02:54:57.360 | we also have different WQ, WK, and WV metrics.
02:55:01.880 | So they should also learn different relationships.
02:55:06.560 | Now we can also want, we may also want to visualize
02:55:09.520 | the attention of the decoder.
02:55:12.260 | So let's do it.
02:55:13.960 | Let me just copy the code and just change the parameters.
02:55:31.520 | Okay, here we want the decoder one,
02:55:34.080 | we want the same layers, et cetera,
02:55:36.100 | but the tokens that will be on the rows and the columns
02:55:41.040 | are the decoder tokens.
02:55:43.160 | So decoder input tokens and decoder input tokens.
02:55:46.480 | Let's visualize.
02:55:48.120 | And also we should see Italian language now
02:55:50.600 | because we are using the decoder self-attention.
02:55:53.480 | And it is.
02:55:55.160 | So here we see a different kind of attention
02:55:57.840 | on the decoder side.
02:55:59.720 | And also here we have multiple heads
02:56:02.560 | that should learn different mapping
02:56:04.320 | and also different layers should learn
02:56:06.020 | different mappings between words.
02:56:07.720 | The one I find most interesting is the cross attention.
02:56:11.840 | So let's have a look at that.
02:56:13.320 | Okay, let me just copy the code and run it again.
02:56:26.080 | Okay, so if you remember the method,
02:56:29.120 | it's encoder, decoder, same layer.
02:56:32.520 | So here on the rows we will show the encoder input
02:56:36.960 | and on the columns we will show the decoder input tokens
02:56:40.120 | because it's a cross attention
02:56:41.520 | between the encoder and the decoder.
02:56:43.280 | Okay, this is more or less how the interaction
02:56:50.880 | between the encoder and the decoder works
02:56:55.240 | and how it happens.
02:56:56.960 | So this is where we find the cross attention calculated
02:57:01.960 | using the keys and the values coming from the encoder
02:57:07.160 | while the query is coming from the decoder.
02:57:09.600 | So this is actually where the translation task happens.
02:57:14.480 | And this is how the model learns to relate
02:57:19.480 | these two sentences to each other
02:57:21.840 | to actually calculate the translation.
02:57:24.880 | So I invite you guys to run the code by yourself.
02:57:29.520 | So the first suggestion I give you
02:57:31.160 | is to write the code along with me with the video.
02:57:35.040 | You can pause the video, you can write the code by yourself.
02:57:39.800 | Okay, let me give you some practical examples.
02:57:42.400 | For example, when I'm writing the model code,
02:57:45.120 | I suggest you watch me write the code
02:57:48.440 | for one particular layer and then stop the video,
02:57:52.680 | write it by yourself, take some time.
02:57:55.040 | Don't watch the solution right away.
02:57:57.440 | Try to figure out what is going wrong.
02:57:59.920 | And if you really cannot, after one, two minutes,
02:58:02.440 | you cannot really figure out what is the problem,
02:58:04.440 | you can have a glimpse at the video.
02:58:06.160 | But try to do it by yourself.
02:58:08.200 | Some things, of course, you cannot come up by yourself.
02:58:11.120 | So for example, for the positional encoding
02:58:13.000 | and all this calculation,
02:58:14.200 | it's basically just an application of formulas.
02:58:17.920 | But the point is you should at least be able
02:58:20.480 | to come with a structure by yourself.
02:58:22.320 | So how all the layers are interacting with each other.
02:58:26.360 | This is my first recommendation.
02:58:28.280 | And while about the training loop,
02:58:30.440 | the training part actually is quite standard.
02:58:33.800 | So it's very similar to other training loops
02:58:37.440 | that you may have seen.
02:58:38.840 | The interesting part is how we calculate the loss
02:58:43.160 | and how we use the transformer model.
02:58:46.120 | And the last thing that is really important
02:58:48.640 | is how we inference the model,
02:58:50.480 | which is in this greedy decode.
02:58:52.800 | So thank you everyone for watching the video
02:58:55.280 | and for staying with me for so long.
02:58:57.480 | I can assure you that it was worth it.
02:58:59.760 | And I hope in the next videos to make more examples
02:59:04.680 | of transformers and other models that I am familiar with
02:59:09.160 | and I also want to explore with you guys.
02:59:11.640 | So let me know if there is something
02:59:13.320 | that you don't understand or you want me to explain better.
02:59:16.960 | I will also for sure follow the comment section.
02:59:20.280 | and please write me.
02:59:21.800 | Thank you and have a nice day.