Coding a Transformer from scratch on PyTorch, with full explanation, training and inference.

00:00:00.000 | Hello guys, welcome to another episode about the transformer. In this episode we will be building the transformer from scratch using PyTorch

00:00:07.880 | so coding it from zero. We will be building the model and we will also build the code for training it for inferencing and for visualizing the attention scores

00:00:16.780 | stick with me because it's gonna be a long video but I assure you that by the end of the video you will have a deep knowledge of the transformer model

00:00:24.800 | not only from a conceptual point of view but also from a practical point of view

00:00:29.120 | we will be building a translation model which means that our model will be able to translate from one language to another

00:00:36.300 | I chose a data set that is called Opus Books and it's a synthesis taken from famous books

00:00:43.580 | I chose the English to Italian because I'm Italian so I can understand and I can tell that if the translation is good or not

00:00:51.520 | but I will show you which point you can change the language so you can test the same model with the language of your choice

00:00:58.800 | let's get started! Let's open the IDE of our choice, in my case I really love Visual Studio Code

00:01:05.100 | and let's create our first file which is the model of the transformer

00:01:10.100 | okay, let's go have a look at the transformer model first so we know which part we are going to build first

00:01:18.600 | and then we will build each part one by one

00:01:21.100 | the first part that we will be building is the input embeddings

00:01:24.600 | as you can see the input embeddings take the input and convert into an embedding

00:01:29.100 | what is the input embedding?

00:01:30.600 | as you remember from my previous video the input embeddings allows to convert the original sentence into a vector of 512 dimensions

00:01:40.100 | for example in this sentence "your cat is a lovely cat" first we convert the sentence into a list of input IDs

00:01:48.300 | that is numbers that correspond to the position of each word inside the vocabulary

00:01:53.900 | and then each of this number corresponds to an embedding which is a vector of size 512

00:02:00.200 | so let's build this layer first

00:02:02.200 | the first thing we need to do is to import Torch

00:02:05.700 | and then we need to create our class

00:02:15.500 | this is the constructor we will need to tell him what is the dimension of the model

00:02:28.500 | so the dimension of the vector in the paper this is called D model

00:02:33.500 | and we also need to tell him what is the vocabulary size

00:02:38.500 | so how many words there are in the vocabulary

00:02:43.300 | (typing)

00:03:00.300 | save these two values and now we can create the actual embedding

00:03:04.300 | actually PyTorch already provides with a layer that does exactly what we want to do

00:03:10.300 | that is given a number it will provide you with the same vector every time

00:03:15.300 | and this is exactly what embedding does

00:03:17.300 | it's just a mapping between numbers and a vector of size 512

00:03:21.300 | 512 here in this our case is the D model

00:03:25.300 | so this is done by the embedding layer and n.embedding

00:03:30.300 | and vocab size and D model

00:03:35.300 | let me check why my autocomplete is not working

00:03:40.300 | (typing)

00:03:46.300 | okay so now let's implement the forward method

00:03:49.300 | (typing)

00:03:54.300 | what we do in the embedding is that we just use the embedding layer provided by PyTorch to do this mapping

00:04:01.300 | so return self.embedding(x)

00:04:06.300 | now actually there is a little detail that is written on the paper

00:04:09.300 | that is let's have a look at the paper actually

00:04:11.300 | let's go here and if we check the embedding and softmax

00:04:14.300 | we will see that in this sentence in the embedding layer

00:04:17.300 | we multiply the weights of the embedding by square root of D model

00:04:21.300 | so what the authors do they take the embedding given by this embedding layer

00:04:28.300 | which I remind you is just a dictionary kind of layer

00:04:31.300 | that just maps numbers to the same vector every time

00:04:35.300 | and this vector is learned by the model

00:04:38.300 | so we just multiply this by math.sqrt of D model

00:04:44.300 | (typing)

00:04:47.300 | you also need to import math

00:04:50.300 | okay now the input embeddings are ready

00:04:54.300 | let's go to the next module

00:04:56.300 | the next module we are going to build is the positional encoding

00:04:59.300 | let's have also a look at what are the positional encoding very fast

00:05:03.300 | so we saw before that our original sentence gets mapped to a list of vectors

00:05:08.300 | by the embeddings layer

00:05:12.300 | and this is our embeddings

00:05:14.300 | now we want to convey to the model the information about the position of each word inside the sentence

00:05:21.300 | and this is done by adding another vector of the same size as the embedding

00:05:25.300 | so of size 512

00:05:27.300 | that includes some special values given by a formula that I will show later

00:05:31.300 | that tells the model that this particular word occupies this position in the sentence

00:05:36.300 | so we will create these vectors called the position embedding

00:05:40.300 | and we will add them to the embedding

00:05:42.300 | okay let's go do it

00:05:43.300 | okay let's define the class positional encoding

00:05:46.300 | (typing)

00:05:52.300 | and we define the constructor

00:05:55.300 | okay what we need to give to the constructor is for sure the D model

00:06:00.300 | because this is the size of the vector that the positional encoding should be

00:06:04.300 | and the sequence length this is the maximum length of the sentence

00:06:09.300 | and because we need to create one vector for each position

00:06:13.300 | and we also need to give the dropout

00:06:16.300 | dropout is to make the model less over fit

00:06:20.300 | (typing)

00:06:37.300 | okay let's actually build a positional encoding

00:06:41.300 | okay first of all the positional encoding is a

00:06:44.300 | we will build a matrix of shape sequence length to D model

00:06:47.300 | why sequence length to D model?

00:06:49.300 | because we need vectors of D model size so 512

00:06:54.300 | but we need sequence length number of them

00:06:57.300 | because the maximum length of the sentence is sequence length

00:07:00.300 | so let's do it

00:07:03.300 | (typing)

00:07:15.300 | okay before we create the matrix and we know how to create the matrix

00:07:20.300 | let's have a look at the formula used to create the positional encoding

00:07:23.300 | so let's go have a look at the formula used to create the positional encoding

00:07:27.300 | this is the slide from my previous video

00:07:30.300 | and let's have a look at how to build the vectors

00:07:33.300 | so as you remember we have a sentence let's say in this case we have three words

00:07:36.300 | we use these two formulas taken from the paper

00:07:40.300 | we create a vector of size 512

00:07:43.300 | and one for each possible position so up to sequence length

00:07:48.300 | and in the even positions we apply the first formula

00:07:52.300 | in the odd positions of the vector we apply the second formula

00:07:56.300 | in this case I will actually simplify the calculation

00:08:00.300 | because I saw online it has been simplified also

00:08:03.300 | so we will do a slightly modified calculation using log space

00:08:08.300 | this is for numerical stability

00:08:10.300 | so when you apply the exponential and then the log of something inside the exponential

00:08:15.300 | the result is the same number but it's more numerically stable

00:08:19.300 | so first we create a vector called position

00:08:22.300 | that will represent the position of the word inside the sentence

00:08:26.300 | and this vector can go from 0 to sequence length -1

00:08:36.300 | [typing]

00:08:53.300 | so actually we are creating a tensor of shape sequence length to 1

00:09:00.300 | [typing]

00:09:10.300 | okay now we create the denominator of the formula

00:09:35.300 | and these are the two terms we see inside the formula

00:09:39.300 | let's go back to the slide

00:09:40.300 | so the first tensor that we build that's called position

00:09:44.300 | it's this pause here and the second tensor that we build is the denominator here

00:09:48.300 | but we calculated it in log space for numerical stability

00:09:52.300 | the value actually will be slightly different but the result will be the same

00:09:56.300 | the model will learn this positional encoding

00:09:58.300 | don't worry if you don't fully understand this part

00:10:01.300 | it's just very special let's say functions that convey this positional information to the model

00:10:07.300 | and if you watched my previous video you will also understand why

00:10:11.300 | now we apply this to denominator and denominator to the sine and the cosine

00:10:15.300 | as you remember the sine is only used for the even positions

00:10:19.300 | and the cosine only for the odd positions

00:10:21.300 | so we will apply it twice

00:10:23.300 | let's do it

00:10:24.300 | so apply

00:10:26.300 | [typing]

00:10:32.300 | so every position will have the sine but only

00:10:37.300 | so every word will have the sine but only the even dimensions

00:10:41.300 | so starting from 0 up to the end and going forward by 2 means

00:10:46.300 | every from 0 then the number 2 then the number 4 etc etc

00:10:51.300 | [typing]

00:10:54.300 | position multiplied by diphtherm

00:10:57.300 | [typing]

00:10:59.300 | then we do the same for the cosine

00:11:02.300 | [typing]

00:11:06.300 | in this case we start from 1 and go forward by 2

00:11:09.300 | it means 1, 3, 5 etc

00:11:12.300 | [typing]

00:11:18.300 | and then we need to add the batch dimension to this tensor

00:11:22.300 | so that we can apply it to the whole sentences

00:11:25.300 | so to all the batch of sentence

00:11:27.300 | because now the shape is sequence length to demodule

00:11:29.300 | but we will have a batch of sentences

00:11:31.300 | so what we do is we add a new dimension to this PE

00:11:35.300 | [typing]

00:11:37.300 | and this is done using unsqueeze

00:11:39.300 | and in the first position

00:11:41.300 | so it will become a tensor of shape 1 to sequence length to demodule

00:11:48.300 | and finally we can register this tensor in the buffer of this module

00:11:53.300 | so what is the buffer of the module

00:11:55.300 | let's first do it

00:11:57.300 | register buffer

00:11:59.300 | [typing]

00:12:02.300 | so basically when you have a tensor that you want to keep inside the module

00:12:07.300 | not as a parameter, learned parameter

00:12:09.300 | but you want it to be saved when you save the file of the module

00:12:13.300 | you should register it as a buffer

00:12:15.300 | this way the tensor will be saved in the file along with the state of the module

00:12:19.300 | then we do the forward method

00:12:22.300 | [typing]

00:12:26.300 | so as you remember from before

00:12:28.300 | we need to add this positional encoding to every word inside the sentence

00:12:33.300 | so let's do it

00:12:35.300 | so we just do x is equal to x

00:12:37.300 | plus the positional encoding for this particular sentence

00:12:41.300 | [typing]

00:12:51.300 | and we also tell the module that we don't want to learn this positional encoding

00:12:56.300 | because they are fixed

00:12:58.300 | they will always be the same

00:13:00.300 | they are not learned along the training process

00:13:02.300 | so we just do it

00:13:04.300 | require squared false

00:13:06.300 | this will make this particular tensor not learned

00:13:10.300 | and then we apply the dropout

00:13:12.300 | [typing]

00:13:16.300 | and that's it, this is the positional encoding

00:13:18.300 | let's have a look at the next module

00:13:20.300 | first we will build the encoder part of the transformer

00:13:24.300 | which is this left side here

00:13:26.300 | and we still have the multihead attention to build the add and norm and the feedforward

00:13:31.300 | and actually there is another layer which connects this skip connection to all these sublayers

00:13:36.300 | so let's start with the easiest one

00:13:38.300 | let's start with layer normalization

00:13:39.300 | which is this add and norm

00:13:41.300 | as you remember from my previous video

00:13:43.300 | let's have a look at the layer normalization

00:13:45.300 | a little briefing

00:13:47.300 | layer normalization basically means that if you have a batch of n items

00:13:50.300 | in this case only 3

00:13:52.300 | each item will have some features

00:13:54.300 | let's say that these are actually sentences

00:13:57.300 | and each sentence is made up of many words with its numbers

00:14:00.300 | so this is our 3 items

00:14:03.300 | and layer normalization means that for each item in this batch

00:14:07.300 | we calculate a mean and a variance

00:14:09.300 | independently from the other items of the batch

00:14:12.300 | and then we calculate the new values for each of them using their own mean and their own variance

00:14:18.300 | in the layer normalization usually we also introduce some parameters

00:14:23.300 | called gamma and beta

00:14:25.300 | some people call it alpha and beta

00:14:27.300 | some people call it alpha and bias

00:14:29.300 | ok, it doesn't matter

00:14:30.300 | one is multiplicative, so it's multiplied by each of these x

00:14:33.300 | and one is additive, so it's added to each one of these x

00:14:37.300 | why?

00:14:38.300 | because we want the model to have the possibility to amplify these values

00:14:43.300 | when he needs this value to be amplified

00:14:46.300 | so the model will learn to multiply this gamma by these values

00:14:51.300 | in such a way to amplify the values that it wants to be amplified

00:14:54.300 | ok, let's go to build the code for this layer

00:14:57.300 | let's define the layer normalization class

00:15:06.300 | and constructor as usual

00:15:09.300 | in this case we don't need any parameter except for one

00:15:14.300 | that I will show you now

00:15:16.300 | which is epsilon

00:15:18.300 | and usually EPS stands for epsilon

00:15:21.300 | which is a very small number that you need to give to the model

00:15:24.300 | and I will also show you why we need this number

00:15:27.300 | in this case we use 10 to the power of -6

00:15:31.300 | let's save it

00:15:33.300 | ok, this epsilon is needed because if we look at the slide

00:15:37.300 | we have this epsilon here in the denominator of this formula here

00:15:41.300 | so x with cap is equal to xj minus mu

00:15:45.300 | divided by the square root of sigma square plus epsilon

00:15:49.300 | why we need this epsilon?

00:15:51.300 | because imagine this denominator

00:15:53.300 | if sigma happens to be 0 or very close to 0

00:15:57.300 | this x new will become very big

00:16:00.300 | which is undesirable

00:16:01.300 | as we know that the CPU or the GPU can only represent numbers

00:16:05.300 | up to a certain position and scale

00:16:08.300 | so we don't want very big numbers or very small numbers

00:16:11.300 | so usually for numerical stability we use this epsilon

00:16:14.300 | also to avoid division by 0

00:16:16.300 | let's go forward

00:16:17.300 | so now let's introduce the two parameters

00:16:19.300 | that we will use for the layer normalization

00:16:21.300 | one is called alpha which will be multiplied

00:16:23.300 | and one is bias which will be added

00:16:25.300 | usually the additive is called bias

00:16:28.300 | it's always added

00:16:29.300 | and the alpha is the one that is multiplied

00:16:31.300 | in this case we will use nn.parameter

00:16:36.300 | this makes the parameter learnable

00:16:39.300 | and we define also the bias

00:16:46.300 | this I want to remind you is multiplied

00:16:53.300 | and this is added

00:16:58.300 | let's define the forward

00:17:00.300 | okay

00:17:03.300 | as you remember

00:17:04.300 | we need to calculate the mean and the standard deviation

00:17:06.300 | or the variance for both of these

00:17:08.300 | we will calculate the standard deviation

00:17:10.300 | of the last dimension

00:17:17.300 | so everything after the batch

00:17:20.300 | and we keep the dimension

00:17:24.300 | so this parameter keep dimension means that

00:17:27.300 | usually the mean cancels the dimension to which it is applied

00:17:31.300 | but we want to keep it

00:17:33.300 | and then we just apply the formula that we saw on the slide

00:17:50.300 | alpha multiplied by what?

00:17:52.300 | x minus its mean

00:17:55.300 | divided by the standard deviation

00:17:58.300 | plus self.eps

00:18:01.300 | everything added to bias

00:18:04.300 | and this is our layer normalization

00:18:09.300 | okay let's go have a look at the next layer we are going to build

00:18:12.300 | the next layer we are going to build is the feed forward

00:18:15.300 | you can see here

00:18:16.300 | and the feed forward is basically a fully connected layer

00:18:21.300 | that the model uses both in the encoder and in the decoder

00:18:26.300 | let's first have a look at the paper to see

00:18:28.300 | what are the details of this feed forward layer

00:18:31.300 | in the paper

00:18:32.300 | the feed forward layer is basically two matrices

00:18:35.300 | one w1 one w2 that are multiplied by this x

00:18:39.300 | one after another with a relu in between and with a bias

00:18:43.300 | we can do this in PyTorch using a linear layer

00:18:47.300 | in which we define the first one

00:18:50.300 | to be the matrix with the w1 and b1

00:18:53.300 | and the second one to be the w2 and the b2

00:18:56.300 | and in between we apply a relu

00:18:58.300 | in the paper we can also see the dimensions of these matrices

00:19:02.300 | so the first one is basically d model to dff

00:19:06.300 | and the second one is from dff to d model

00:19:09.300 | so dff is 2048 and d model is 512

00:19:13.300 | let's go build it

00:19:14.300 | class feed forward block

00:19:18.300 | we also build in this case the constructor

00:19:24.300 | and in the constructor we need to define these two values

00:19:30.300 | that we saw on the paper

00:19:31.300 | so d model dff and also in this case dropout

00:19:36.300 | we define the first matrix so w1 and b1

00:19:50.300 | to be the linear one

00:19:52.300 | and it's from d model to dff

00:19:58.300 | and then we apply the dropout

00:20:00.300 | actually we define the dropout

00:20:04.300 | and then we define the second matrix w2 and b2

00:20:11.300 | so let me write the comments here

00:20:13.300 | it's w1 and b1

00:20:16.300 | of dff to d model

00:20:25.300 | and this is w2 and b2

00:20:28.300 | why we have b2?

00:20:29.300 | because actually as you can see here bias is by default it's true

00:20:34.300 | so it's already defining a bias matrix for us

00:20:37.300 | okay let's define the forward method

00:20:43.300 | in this case what we are going to do is

00:20:47.300 | we have an input sentence which is batch

00:20:51.300 | it's a tensor with dimension batch sequence length and d model

00:20:57.300 | first we will convert it using linear 1

00:21:00.300 | into another tensor of batch to sequence length to dff

00:21:07.300 | because if we apply this linear it will convert the d model into dff

00:21:11.300 | and then we apply the linear 2 which will convert it back to d model

00:21:16.300 | we apply the dropout in between

00:21:30.300 | and this is our feed forward block

00:21:40.300 | let's go have a look at the next block

00:21:42.300 | our next block is the most important and most interesting one

00:21:46.300 | and it's the multi-head attention

00:21:48.300 | we saw briefly in the last video how the multi-head attention works

00:21:54.300 | so I will open now the slide again to show to rehearse how it actually works

00:22:00.300 | and then we will do it practically by coding

00:22:03.300 | as you remember in the encoder we have the multi-head attention

00:22:06.300 | that takes the input of the encoder and uses it three times

00:22:11.300 | one time it's called query, one time it's called key and one time it's called values

00:22:16.300 | you can also think it like a duplication of the input three times

00:22:20.300 | or you can just say that it's the same input applied three times

00:22:23.300 | and the multi-head attention basically works like this

00:22:26.300 | we have our input sequence which is sequence length by d model

00:22:30.300 | we transform into three matrices q, k and v

00:22:35.300 | which are exactly the same as the input in this case

00:22:38.300 | because we are talking about the encoder

00:22:40.300 | you see that in the decoder it's a slightly different

00:22:42.300 | and then we multiply this by matrices called w, q, w, k and w, v

00:22:49.300 | and this results in a new matrix of dimension sequence by d model

00:22:54.300 | we then split these matrices into h matrices, smaller matrices

00:22:59.300 | why h? because it's the number of head we want for this multi-head attention

00:23:03.300 | and we split these matrices along the embedding dimension

00:23:07.300 | not along the sequence dimension

00:23:08.300 | which means that each head we will have access to the full sentence

00:23:12.300 | but a different part of the embedding of each word

00:23:16.300 | we apply the attention to each of these smaller matrices using this formula

00:23:21.300 | which will give us smaller matrices as a result

00:23:24.300 | then we combine them back

00:23:26.300 | so we concatenate them back just like the paper says

00:23:30.300 | so concatenation of head one up to head h

00:23:33.300 | and finally we multiply it by w, o to get the multi-head attention output

00:23:38.300 | which again is a matrix that has the same dimension as the input matrix

00:23:43.300 | as you can see the output of the multi-head attention is also sequenced by d model

00:23:48.300 | in this slide actually I didn't show the batch dimension

00:23:51.300 | because we are talking about one sentence

00:23:53.300 | but when we code the transformer

00:23:55.300 | we don't work only with one sentence but with multiple sentences

00:23:58.300 | so we need to think that we have another dimension here which is the batch

00:24:03.300 | okay let's go to code this multi-head attention

00:24:07.300 | I will do it a little more slower

00:24:09.300 | so we can see in detail everything how it's done

00:24:13.300 | but I really wanted you to have an overview again of how it works

00:24:17.300 | and why we are doing what we are doing

00:24:19.300 | so let's go code it

00:24:21.300 | class

00:24:29.300 | also in this case we define the constructor

00:24:33.300 | and what we need to give to this multi-head attention as parameter

00:24:37.300 | for sure the d model of the model which is in our case 512

00:24:42.300 | the number of heads which we call h just like in the paper

00:24:46.300 | so h indicates the number of heads we want and then the dropout value

00:24:52.300 | we save these values

00:25:00.300 | as you can see we need to divide this embedding vector into h heads

00:25:05.300 | which means that this d model should be divisible by h

00:25:08.300 | otherwise we cannot divide equally the same vector

00:25:13.300 | representing the embedding into equal matrices for each head

00:25:17.300 | so we make sure that d model is divisible by h basically

00:25:33.300 | and this will make the check

00:25:35.300 | if we watch again my slide we can see that the value d model divided by h is called dk

00:25:41.300 | as we can see here if we divide the d model by h heads

00:25:48.300 | we get a new value which is called dk

00:25:50.300 | and to be aligned with what the paper with the nomenclature used in the paper

00:25:54.300 | we will also call it dk

00:25:56.300 | so dk is d model divided by h

00:26:08.300 | okay let's also define the matrices by which we will multiply the query the key and the values

00:26:14.300 | and also the output matrix w o

00:26:21.300 | this again is a linear so from d model to d model

00:26:25.300 | why from d model to d model because as you can see from my slides

00:26:30.300 | this is d model by d model

00:26:33.300 | so that the output will be sequenced by d model

00:26:39.300 | so this is wq

00:26:48.300 | this is wk

00:27:00.300 | and this is wv

00:27:03.300 | finally we also have the output matrix which is called w o here

00:27:06.300 | this w o is h by dv by d model

00:27:11.300 | so h by dv, dv is what?

00:27:13.300 | dv is actually equal to dk

00:27:15.300 | because it's the d model divided by h

00:27:18.300 | but why it's called dv here and dk here?

00:27:21.300 | because this head is actually the result this head comes from this multiplication

00:27:27.300 | and the last multiplication is by v

00:27:29.300 | and in the paper they call this value dv

00:27:32.300 | but on a practical level it's equal to dk

00:27:36.300 | so our w o is also a matrix that is d model by d model

00:27:41.300 | because h by dv is equal to d model

00:27:52.300 | and this is w o

00:27:54.300 | finally we create the dropout

00:28:02.300 | let's implement the forward method

00:28:04.300 | and let's see how the multi head attention works in detail during the coding process

00:28:11.300 | we define the query, the key and the values

00:28:16.300 | and there is this mask

00:28:18.300 | so what is this mask?

00:28:20.300 | the mask is basically if we want some words to not interact with other words

00:28:26.300 | we mask them

00:28:28.300 | and we saw in my previous video but now let's go back to those slides

00:28:31.300 | to see what is the mask doing

00:28:33.300 | as you remember when we calculate the attention

00:28:36.300 | using this formula

00:28:37.300 | so softmax of q multiplied by kt

00:28:40.300 | divided by square root of dk and then by v

00:28:43.300 | we get this head matrix

00:28:46.300 | but before we multiply by v

00:28:48.300 | so only this multiplication here with q by k

00:28:53.300 | we get this matrix

00:28:54.300 | which is each word with each other word

00:28:57.300 | it's a sequence by sequence matrix

00:29:00.300 | and if we don't want some words to interact with other words

00:29:03.300 | we basically replace their value

00:29:05.300 | so their attention score

00:29:07.300 | with something that is very small

00:29:09.300 | before we apply the softmax

00:29:11.300 | and when we apply the softmax

00:29:13.300 | these values will become zero

00:29:15.300 | because as you remember the softmax on the numerator has e to the power of x

00:29:19.300 | so if x goes to minus infinity

00:29:22.300 | so a very small number

00:29:23.300 | e to the power of minus infinity

00:29:25.300 | will become very small

00:29:27.300 | so very close to zero

00:29:28.300 | so basically we hide the attention for those two words

00:29:33.300 | so this is the job of the mask

00:29:37.300 | just following my slide

00:29:39.300 | we do the multiplication one by one

00:29:41.300 | so as we remember

00:29:42.300 | we calculate first

00:29:43.300 | the query are multiplied by the wq

00:29:46.300 | so self.wq multiplied with the query

00:29:50.300 | gives us a new matrix

00:29:52.300 | which is called the q prime in my slides

00:29:54.300 | I just call it query here

00:29:59.300 | we do the same with the keys

00:30:01.300 | and the same with the values

00:30:07.300 | let me also write the dimensions

00:30:09.300 | so we are going from

00:30:11.300 | batch, sequence, length, d, model

00:30:16.300 | with this multiplication

00:30:17.300 | we are going to another matrix

00:30:19.300 | which is batch, sequence, length, d, model

00:30:23.300 | and you can see that from the slides

00:30:25.300 | so when we do sequence by d, model

00:30:28.300 | multiplied by d, model by d, model

00:30:30.300 | we get a new matrix

00:30:31.300 | which has the same dimension as the initial matrix

00:30:33.300 | so sequence by d, model

00:30:36.300 | and it's the same for all three of them

00:30:43.300 | now what we want to do is

00:30:45.300 | we want to divide this query key and value

00:30:48.300 | into smaller matrices

00:30:49.300 | so that we can give each small matrix

00:30:52.300 | to a different head

00:30:54.300 | so let's do it

00:30:56.300 | we will divide into

00:30:58.300 | using the view method of PyTorch

00:31:00.300 | which means that we keep the batch dimension

00:31:03.300 | because we don't want to split the sentence

00:31:06.300 | we want to split the embedding

00:31:08.300 | into h parts

00:31:12.300 | we also want to keep the second dimension

00:31:14.300 | which is the sequence

00:31:15.300 | because we don't want to split it

00:31:17.300 | and the third dimension

00:31:19.300 | so the d, model

00:31:20.300 | we want to split it into two smaller dimensions

00:31:23.300 | which is h by d, k

00:31:25.300 | so self.h, self.d,k

00:31:30.300 | as you remember

00:31:31.300 | d, k is basically d, model

00:31:33.300 | divided by h

00:31:34.300 | so this multiplied by this

00:31:36.300 | gives you d, model

00:31:40.300 | and then we transpose

00:31:45.300 | one, two

00:31:46.300 | why do we transpose?

00:31:47.300 | because we prefer to have the h dimension

00:31:52.300 | instead of being the third dimension

00:31:54.300 | we want it to be the second dimension

00:31:56.300 | and this way

00:31:59.300 | each view, each head

00:32:00.300 | will see all the sentence

00:32:02.300 | so we'll see this dimension

00:32:03.300 | so the sequence length by d, k

00:32:07.300 | let me also write the comment here

00:32:11.300 | so we are going from

00:32:13.300 | batch, sequence length, d, model

00:32:16.300 | to batch, sequence length, h, d, k

00:32:24.300 | and then by using the transposition

00:32:26.300 | we are going to

00:32:28.300 | batch, h, sequence length, and d, k

00:32:33.300 | this is really important

00:32:34.300 | because we want each batch

00:32:39.300 | we want each head to watch this stuff

00:32:42.300 | so the sequence length by d, k

00:32:44.300 | which means that

00:32:45.300 | each head will see the full sentence

00:32:47.300 | so each word in the sentence

00:32:49.300 | but only a smaller part of the embedding

00:32:51.300 | we do the same thing for the key and the value

00:32:55.300 | [typing sounds]

00:33:21.300 | [typing sounds]

00:33:47.300 | ok, now that we have these smaller matrices

00:33:50.300 | so let me go back to the slide

00:33:52.300 | so I can show you where we are

00:33:54.300 | so we did this multiplication

00:33:56.300 | we obtained query, key, and values

00:33:58.300 | we split into smaller matrices

00:34:00.300 | now we need to calculate the attention

00:34:02.300 | using this formula here

00:34:04.300 | before we can calculate the attention

00:34:06.300 | let's create a function to calculate the attention

00:34:08.300 | so if we create a new function

00:34:10.300 | that can be used also later

00:34:13.300 | so self, attention

00:34:16.300 | let's define it as a static method

00:34:19.300 | [typing sounds]

00:34:24.300 | so static method means basically

00:34:26.300 | that you can call this function

00:34:28.300 | without having an instance of this class

00:34:30.300 | you can just say

00:34:31.300 | multi head attention block dot attention

00:34:33.300 | instead of having an instance of this class

00:34:36.300 | [typing sounds]

00:34:41.300 | we also give him the dropout layer

00:34:44.300 | ok, what we do is we get the decay

00:34:49.300 | what is the decay?

00:34:50.300 | it's the last dimension of the query, key, and the value

00:34:53.300 | [typing sounds]

00:34:58.300 | and we will be using this function here

00:35:00.300 | let me first call it

00:35:02.300 | so that you can understand how we will use it

00:35:04.300 | and then we define it

00:35:06.300 | so we want from this function

00:35:08.300 | we want two things, the output

00:35:10.300 | and we want the attention scores

00:35:12.300 | so the output of the softmax

00:35:15.300 | attention scores

00:35:17.300 | and we will call it like this

00:35:21.300 | so we give it the query, the key, the value, the mask

00:35:26.300 | and the dropout layer

00:35:28.300 | now let's go back here

00:35:31.300 | so we have the decay

00:35:33.300 | now what we do is

00:35:34.300 | first we apply the first part of the formula

00:35:36.300 | that is the query multiplied by the transpose of the key

00:35:40.300 | divided by the square root of decay

00:35:43.300 | so these are our attention scores

00:35:46.300 | [typing sounds]

00:35:49.300 | query matrix multiplication

00:35:52.300 | so this @ sign means matrix multiplication in PyTorch

00:35:55.300 | [typing sounds]

00:35:58.300 | we transpose the last two dimensions

00:36:01.300 | -1 means transpose the last two dimensions

00:36:04.300 | so this will become

00:36:06.300 | the last dimension is sequence length by decay

00:36:09.300 | it will become decay by sequence length

00:36:12.300 | and then we divide this by math.decay

00:36:18.300 | before, as we saw before

00:36:21.300 | before applying the softmax

00:36:23.300 | we need to apply the mask

00:36:24.300 | so we want to hide some interaction between words

00:36:27.300 | we apply the mask

00:36:28.300 | and then we apply the softmax

00:36:30.300 | so the softmax will take care of the values that we replaced

00:36:33.300 | how do we apply the mask?

00:36:35.300 | we just

00:36:36.300 | all the values that we want to mask

00:36:38.300 | we replace them with very very small values

00:36:40.300 | so that the softmax will replace them with 0

00:36:43.300 | so if a mask is defined

00:36:46.300 | [typing sounds]

00:36:48.300 | apply it

00:36:49.300 | [typing sounds]

00:37:02.300 | this means basically

00:37:04.300 | replace all the values for which this statement is true

00:37:08.300 | with this value

00:37:10.300 | the mask we will define in such a way that

00:37:13.300 | where this value, this expression is true

00:37:16.300 | we want it to be replaced by this

00:37:18.300 | later we will see also how we will build the mask

00:37:21.300 | for now just take it for granted

00:37:23.300 | that these are all the values that we don't want

00:37:26.300 | to have in the attention

00:37:28.300 | so we don't want for example some word

00:37:31.300 | to watch future words

00:37:33.300 | for example when we will build a decoder

00:37:35.300 | or we don't want the padding values

00:37:37.300 | to interact with other values

00:37:39.300 | because they are just filler words to reach the sequence length

00:37:42.300 | we will replace them with -1 to the power of

00:37:45.300 | -10 to the power of 9

00:37:49.300 | which is a very big number in the negative range

00:37:53.300 | which basically represents -infinity

00:37:57.300 | and then when we apply now the softmax

00:38:00.300 | it will be replaced by 0

00:38:02.300 | [typing sounds]

00:38:09.300 | we apply it to this dimension

00:38:12.300 | ok, let me write some comments

00:38:14.300 | so in this case we have

00:38:17.300 | batch by h

00:38:20.300 | so each head will

00:38:22.300 | and then sequence length and sequence length

00:38:26.300 | alright, if we also have a dropout

00:38:29.300 | so if dropout is not known

00:38:32.300 | we also apply the dropout

00:38:34.300 | [typing sounds]

00:38:41.300 | and finally as we saw in the original slide

00:38:44.300 | we multiply the output of the softmax

00:38:47.300 | by the vmatrix

00:38:49.300 | matrix multiplication

00:38:51.300 | so we return

00:38:53.300 | attention scores multiplied by value

00:38:56.300 | and also the attention score itself

00:38:58.300 | so why are we returning a tuple?

00:39:00.300 | because we want this

00:39:02.300 | of course we need it for the model

00:39:04.300 | because we need to give it to the next layer

00:39:06.300 | but this will be used for visualization

00:39:09.300 | so the output of the self-attention

00:39:13.300 | so the multi-head attention in this case

00:39:15.300 | is actually going to be here

00:39:18.300 | and we will use it for visualizing

00:39:20.300 | so for visualizing what is the score

00:39:22.300 | given by the model for that particular interaction

00:39:25.300 | let me also write some comments here

00:39:29.300 | so here we are doing like this

00:39:32.300 | batch

00:39:34.300 | [typing sounds]

00:39:46.300 | and let's go back here

00:39:48.300 | now we have our multi-head attention

00:39:50.300 | so the output of the multi-head attention

00:39:52.300 | what we do is finally

00:39:54.300 | we, ok let's go back to the slide first

00:39:57.300 | where we are

00:39:59.300 | we calculated these smaller matrices here

00:40:02.300 | so we applied the softmax

00:40:04.300 | Q by KT

00:40:06.300 | divided by the square root of DV

00:40:08.300 | and then we multiplied it also by V

00:40:10.300 | we can see it here

00:40:12.300 | which gives us this small matrix here

00:40:15.300 | head 1, head 2, head 3 and head 4

00:40:17.300 | now we need to combine them together

00:40:19.300 | concat, just like the formula says from the paper

00:40:22.300 | and finally multiply it by WO

00:40:24.300 | so let's do it

00:40:26.300 | [typing sounds]

00:40:33.300 | we transpose because

00:40:35.300 | before we transformed the matrix into sequence length

00:40:39.300 | we had the sequence length as the third dimension

00:40:41.300 | we wanted back in the first place

00:40:44.300 | to combine them

00:40:46.300 | because the resulting tensor

00:40:48.300 | we want the sequence length to be in the second position

00:40:50.300 | so let me write it first

00:40:52.300 | what we want to do

00:40:54.300 | batch

00:40:56.300 | we started from this one

00:40:59.300 | sequence length

00:41:01.300 | first we do a transposition

00:41:04.300 | [typing sounds]

00:41:09.300 | and then what we want is this

00:41:12.300 | [typing sounds]

00:41:17.300 | so this transposition takes us here

00:41:20.300 | and then we do a view

00:41:24.300 | but we cannot do it

00:41:26.300 | we need to use contiguous

00:41:28.300 | this means basically that PyTorch

00:41:30.300 | to transform the shape of a tensor

00:41:33.300 | needs to put the memory to be contiguous

00:41:36.300 | so we can just do it in place

00:41:38.300 | [typing sounds]

00:41:46.300 | -1

00:41:48.300 | and self.h

00:41:50.300 | multiplied by self.dk

00:41:52.300 | which as you remember

00:41:54.300 | this is the model

00:41:56.300 | because we defined dk to be

00:41:59.300 | here the model by h

00:42:01.300 | divide by h

00:42:03.300 | ok

00:42:05.300 | and finally we multiply this x by wo

00:42:08.300 | which is our output matrix

00:42:10.300 | [typing sounds]

00:42:13.300 | of x

00:42:15.300 | this will give us

00:42:17.300 | we go from batch

00:42:20.300 | [typing sounds]

00:42:28.300 | and this is

00:42:30.300 | and this is our multi-head attention block

00:42:33.300 | we have I think all the ingredients now

00:42:35.300 | to combine them all together

00:42:37.300 | we just miss one small layer

00:42:39.300 | let's go have a look at it first

00:42:41.300 | there is one last layer we need to build

00:42:43.300 | which is the connection we can see here

00:42:45.300 | for example here we have some

00:42:47.300 | output of this layer, so addNorm

00:42:49.300 | that is taken here

00:42:51.300 | with this connection

00:42:53.300 | and this one part is sent here

00:42:55.300 | then the output of this is sent to the addNorm

00:42:57.300 | and then combined together by this layer

00:42:59.300 | so we need to create this

00:43:01.300 | layer that manages this skip connection

00:43:03.300 | so we take the input

00:43:05.300 | we give it to

00:43:07.300 | we skip it by one layer

00:43:09.300 | we take the output of the previous layer

00:43:11.300 | so in this case the multi-head attention

00:43:13.300 | we give it to this layer

00:43:15.300 | but also combining with this part

00:43:17.300 | so let's build this layer

00:43:19.300 | I will call it residual connection

00:43:22.300 | because it's basically a skip connection

00:43:24.300 | ok let's build this residual connection

00:43:27.300 | [typing]

00:43:35.300 | as usual we define the constructor

00:43:37.300 | and in this case we just need a dropout

00:43:40.300 | [typing]

00:43:51.300 | as you remember the skip connection

00:43:54.300 | is between the add and the norm

00:43:56.300 | and the previous layer

00:43:58.300 | so we also need the norm

00:44:00.300 | which is our layer normalization

00:44:02.300 | which we defined before

00:44:04.300 | and then we define the forward method

00:44:06.300 | [typing]

00:44:09.300 | and the sublayer which is the previous layer

00:44:12.300 | [typing]

00:44:14.300 | what we do is we take the X

00:44:16.300 | and we combine it

00:44:18.300 | with the output of the next layer

00:44:20.300 | which in this case is called sublayer

00:44:22.300 | [typing]

00:44:25.300 | and we apply the dropout

00:44:27.300 | [typing]

00:44:31.300 | so this is the definition of add and norm

00:44:33.300 | actually there is a slight difference

00:44:35.300 | that we first apply the normalization

00:44:37.300 | and then we apply the sublayer

00:44:39.300 | in the case of the paper

00:44:41.300 | they apply first the sublayer

00:44:43.300 | and then the normalization

00:44:45.300 | I saw many implementations

00:44:47.300 | and most of them actually did it like this

00:44:49.300 | so we will also stick with this particular

00:44:51.300 | as you remember

00:44:53.300 | we have these blocks

00:44:55.300 | are combined together

00:44:57.300 | by this bigger block here

00:44:59.300 | and we have N of them

00:45:01.300 | so this big block

00:45:03.300 | we will call it encoder block

00:45:05.300 | and each of this encoder block is repeated

00:45:07.300 | N times where the output

00:45:09.300 | of the previous is sent to the next one

00:45:11.300 | and the output of the last one is sent to the decoder

00:45:13.300 | so we need to create

00:45:15.300 | this block which will contain

00:45:17.300 | one multi head attention

00:45:19.300 | two add and norm

00:45:21.300 | and one feed forward

00:45:23.300 | so let's do it

00:45:25.300 | [typing]

00:45:27.300 | we will call this block

00:45:29.300 | the encoder block

00:45:31.300 | because the decoder has

00:45:33.300 | three blocks inside

00:45:35.300 | the encoder has only two

00:45:37.300 | [typing]

00:45:39.300 | [typing]

00:45:41.300 | [typing]

00:45:43.300 | [typing]

00:45:45.300 | and as I

00:45:47.300 | saw before

00:45:49.300 | we have the self attention block

00:45:51.300 | inside which is the multi head attention

00:45:53.300 | we call it self attention because

00:45:55.300 | in the case of the encoder

00:45:57.300 | it is applied to the

00:45:59.300 | same input with

00:46:01.300 | three different roles

00:46:03.300 | the role of query, of the key and the value

00:46:05.300 | [typing]

00:46:07.300 | [typing]

00:46:09.300 | [typing]

00:46:11.300 | which is our feed forward

00:46:13.300 | and then we have a dropout

00:46:15.300 | which is a floating point

00:46:17.300 | and then we define

00:46:19.300 | [typing]

00:46:21.300 | [typing]

00:46:23.300 | [typing]

00:46:25.300 | [typing]

00:46:27.300 | [typing]

00:46:29.300 | [typing]

00:46:31.300 | [typing]

00:46:33.300 | and then we define the two residual

00:46:35.300 | connections

00:46:37.300 | [typing]

00:46:39.300 | [typing]

00:46:41.300 | we use the module list

00:46:43.300 | which is a way to organize

00:46:45.300 | a list of modules

00:46:47.300 | in this case we need two of them

00:46:49.300 | [typing]

00:46:51.300 | [typing]

00:46:53.300 | [typing]

00:46:55.300 | [typing]

00:46:57.300 | [typing]

00:46:59.300 | [typing]

00:47:01.300 | [typing]

00:47:03.300 | [typing]

00:47:05.300 | [typing]

00:47:07.300 | okay let's define

00:47:09.300 | the forward method

00:47:11.300 | [typing]

00:47:13.300 | [typing]

00:47:15.300 | I define the source

00:47:17.300 | mask, what is the source mask?

00:47:19.300 | it's the mask that we want to apply to the

00:47:21.300 | input of the encoder, and why do we

00:47:23.300 | need a mask for the input of the encoder?

00:47:25.300 | because we want to hide

00:47:27.300 | the interaction of the padding word

00:47:29.300 | with other words, we don't want the padding

00:47:31.300 | word to interact with other words

00:47:33.300 | so we apply the mask

00:47:35.300 | [typing]

00:47:37.300 | [typing]

00:47:39.300 | and let's do the

00:47:41.300 | first residual connection

00:47:43.300 | let's go back to check the video actually

00:47:45.300 | to check the slide so we can understand what we

00:47:47.300 | are doing now

00:47:49.300 | so the first skip connection is

00:47:51.300 | this X here

00:47:53.300 | is going to

00:47:55.300 | here, but before it's

00:47:57.300 | added and

00:47:59.300 | we add a norm, we first need to apply

00:48:01.300 | the multi-head attention, so we take this X

00:48:03.300 | we send it to the multi-head attention

00:48:05.300 | and at the same time we also send it here

00:48:07.300 | and then we combine the two

00:48:09.300 | [typing]

00:48:11.300 | [typing]

00:48:13.300 | so the first skip connection is between

00:48:15.300 | X and then the other

00:48:17.300 | X is coming from the self-attention

00:48:19.300 | so

00:48:21.300 | this is the function

00:48:23.300 | so I will define the sub-layer

00:48:25.300 | using a lambda, so this basically

00:48:27.300 | means first apply the self-attention

00:48:29.300 | self-attention

00:48:31.300 | in which we give the query key

00:48:33.300 | and the value is our X

00:48:35.300 | so our input, so this is why it's called

00:48:37.300 | self-attention, because the role of the query

00:48:39.300 | key and the value is X

00:48:41.300 | itself, so the input itself, so

00:48:43.300 | it's the sentence that is

00:48:45.300 | watching itself, so each

00:48:47.300 | word of one sentence is

00:48:49.300 | interacting with other words of the

00:48:51.300 | same sentence, we will see that

00:48:53.300 | in the decoder it's different because we have

00:48:55.300 | the cross-attention, so

00:48:57.300 | the keys coming from

00:48:59.300 | the decoder are watching the

00:49:01.300 | sorry, the query coming from the

00:49:03.300 | decoder are watching the key and the values

00:49:05.300 | coming from the encoder

00:49:07.300 | we give it the

00:49:09.300 | source mask, so

00:49:11.300 | what is this, basically we are calling

00:49:13.300 | this function, the forward

00:49:15.300 | function of the multi-head

00:49:17.300 | attention block, so we give query

00:49:19.300 | key value and the mask

00:49:21.300 | this will be combined

00:49:23.300 | with this by using

00:49:25.300 | the residual connection

00:49:27.300 | then

00:49:29.300 | again we do the second one, the second

00:49:31.300 | one is the feed forward

00:49:33.300 | [typing]

00:49:35.300 | [typing]

00:49:37.300 | [typing]

00:49:39.300 | we don't need lambda here actually

00:49:41.300 | [typing]

00:49:43.300 | [typing]

00:49:45.300 | and then we return X

00:49:47.300 | so this means

00:49:49.300 | combine the feed forward

00:49:51.300 | and then the

00:49:53.300 | X itself, so the output

00:49:55.300 | of the previous layer, which is this one

00:49:57.300 | and then

00:49:59.300 | apply the residual connection

00:50:01.300 | this defines our encoder block

00:50:03.300 | now we can define the

00:50:05.300 | encoder object, so because the

00:50:07.300 | encoder is made up of many encoder

00:50:09.300 | blocks, we can have up to N of them

00:50:11.300 | according to the paper, so

00:50:13.300 | let's define the encoder

00:50:15.300 | [typing]

00:50:17.300 | [typing]

00:50:19.300 | [typing]

00:50:21.300 | [typing]

00:50:23.300 | how many layers we will have, we will

00:50:25.300 | have N, so we have many layers

00:50:27.300 | and they are applied

00:50:29.300 | one after another, so this is a

00:50:31.300 | module list

00:50:33.300 | [typing]

00:50:35.300 | and at the end we will apply a layer

00:50:37.300 | normalization

00:50:39.300 | [typing]

00:50:41.300 | [typing]

00:50:43.300 | [typing]

00:50:45.300 | [typing]

00:50:47.300 | [typing]

00:50:49.300 | [typing]

00:50:51.300 | so we apply one layer after another

00:50:53.300 | [typing]

00:50:55.300 | [typing]

00:50:57.300 | [typing]

00:50:59.300 | the output of the

00:51:01.300 | previous layer becomes the input for the

00:51:03.300 | next layer, here I forgot something

00:51:05.300 | [typing]

00:51:07.300 | and finally we apply the

00:51:09.300 | normalization

00:51:11.300 | and this concludes our

00:51:13.300 | journey around the encoder

00:51:15.300 | let's go have a brief overview

00:51:17.300 | of what we have done

00:51:19.300 | we have taken the inputs, send it to the

00:51:21.300 | we didn't, ok, we didn't

00:51:23.300 | combine all the blocks together for now

00:51:25.300 | we just built this big block here

00:51:27.300 | called encoder

00:51:29.300 | which contains two smaller

00:51:31.300 | blocks

00:51:33.300 | that are the skip connection, the skip connection

00:51:35.300 | first one is between the multihead

00:51:37.300 | attention and this X that is sent

00:51:39.300 | here, the second one is between this

00:51:41.300 | feedforward and this X that is sent

00:51:43.300 | here, we have N of

00:51:45.300 | these blocks one after another

00:51:47.300 | the output of the last will be sent

00:51:49.300 | to the decoder before

00:51:51.300 | but before we apply the normalization

00:51:53.300 | now we

00:51:55.300 | built the

00:51:57.300 | decoder part, now in the

00:51:59.300 | decoder the output embeddings

00:52:01.300 | are the same as the input embeddings

00:52:03.300 | I mean

00:52:05.300 | the class that we need to define

00:52:07.300 | is the same, so we will just initialize it

00:52:09.300 | twice and the same goes for

00:52:11.300 | the positional encodings, we can use the same

00:52:13.300 | values that we use for the

00:52:15.300 | encoder, also for the decoder

00:52:17.300 | what we need to define

00:52:19.300 | is this big block here

00:52:21.300 | which is made of masked multihead

00:52:23.300 | attention, add a norm, so one

00:52:25.300 | skip connection here, another

00:52:27.300 | multihead attention with another skip connection

00:52:29.300 | and the feedforward with the

00:52:31.300 | skip connection here, the way we define

00:52:33.300 | the multihead attention class actually

00:52:35.300 | already takes into consideration

00:52:37.300 | the masks, so we don't need to reinvent

00:52:39.300 | the wheel, also for the decoder

00:52:41.300 | we can just define the

00:52:43.300 | decoder block which is this big block

00:52:45.300 | here made of three sublayers

00:52:47.300 | and then we build

00:52:49.300 | the decoder using this

00:52:51.300 | n number of this

00:52:53.300 | decoder blocks, so

00:52:55.300 | let's do it

00:52:57.300 | let's define

00:52:59.300 | first the decoder block

00:53:01.300 | in the decoder we have

00:53:13.300 | the self attention

00:53:15.300 | which is, let's go back

00:53:17.300 | this is a self attention because

00:53:19.300 | we have this input that is used

00:53:21.300 | three times in the masked multihead

00:53:23.300 | attention, so this is called self

00:53:25.300 | attention because the same input plays the

00:53:27.300 | role of the query, the key and the values

00:53:29.300 | which means that the same sentence is

00:53:31.300 | each word in the sentence is

00:53:33.300 | matched with each other word in the

00:53:35.300 | same sentence, but in

00:53:37.300 | this part here we will have

00:53:39.300 | an attention calculated

00:53:41.300 | using the query

00:53:43.300 | coming from the decoder

00:53:45.300 | while the key and the values will come

00:53:47.300 | from the encoder

00:53:49.300 | so this is not a self

00:53:51.300 | attention, this is called cross attention

00:53:53.300 | because we are crossing

00:53:55.300 | two kind of different

00:53:57.300 | objects together and matching them

00:53:59.300 | somehow to calculate the relationship

00:54:01.300 | between them, ok let's define

00:54:13.300 | this is the cross

00:54:15.300 | attention block which is basically

00:54:17.300 | the multihead attention but we will give it

00:54:19.300 | the different

00:54:21.300 | parameters

00:54:23.300 | this is our feedforward

00:54:27.300 | and then we have a dropout

00:54:29.300 | dropout

00:54:31.300 | ok, we defined

00:54:55.300 | also the residual connection, in this case we have

00:54:57.300 | three of them

00:54:59.300 | wonderful, ok let's build

00:55:19.300 | the forward method which is very similar to the

00:55:21.300 | encoder with a slight difference that I will

00:55:23.300 | highlight

00:55:25.300 | we need

00:55:27.300 | x, what is x?

00:55:29.300 | it's the input of the decoder

00:55:31.300 | but we also need the

00:55:33.300 | output of the encoder

00:55:35.300 | we need the source mask which is

00:55:37.300 | the mask applied to the encoder

00:55:39.300 | and the target mask which is the

00:55:41.300 | mask applied to the decoder

00:55:43.300 | why they are called source mask and target

00:55:45.300 | mask? because in this particular

00:55:47.300 | case we are dealing with a translation task

00:55:49.300 | so we have a source language, in this case

00:55:51.300 | it's english and we have a target

00:55:53.300 | language which in our case is

00:55:55.300 | italian, so

00:55:57.300 | you can call it encoder mask

00:55:59.300 | or decoder mask but basically we have

00:56:01.300 | two masks, one is the one coming from the encoder

00:56:03.300 | one is the one coming from the decoder

00:56:05.300 | so in our case we will call

00:56:07.300 | it source, so the source mask

00:56:09.300 | is the one coming from the encoder, so the

00:56:11.300 | source language and the target mask is the

00:56:13.300 | one coming from the decoder, so the

00:56:15.300 | target language

00:56:17.300 | [typing]

00:56:23.300 | and just like before we calculate the self-attention

00:56:25.300 | first, which is the first

00:56:27.300 | part of the decoder block

00:56:29.300 | [typing]

00:56:33.300 | in which the query, the key and the values

00:56:35.300 | are the same input

00:56:37.300 | but with the mask of the decoder

00:56:39.300 | because this is the self-attention block of the

00:56:41.300 | decoder

00:56:43.300 | [typing]

00:56:45.300 | and then we need to

00:56:47.300 | combine, we need to calculate the cross-attention

00:56:49.300 | which is our second residual connection

00:56:51.300 | [typing]

00:57:01.300 | we give him

00:57:03.300 | ok, in this case we are giving the

00:57:05.300 | query coming from the decoder

00:57:07.300 | so the x, the

00:57:09.300 | key and the values coming from the

00:57:11.300 | encoder

00:57:13.300 | [typing]

00:57:15.300 | and the mask of the encoder

00:57:17.300 | [typing]

00:57:27.300 | and finally the feedforward block

00:57:29.300 | just like before

00:57:31.300 | [typing]

00:57:33.300 | and that's it, we have all the ingredients

00:57:35.300 | actually to build the decoder now

00:57:37.300 | which is just n times

00:57:39.300 | this block one after another

00:57:41.300 | just like we did for the encoder

00:57:43.300 | [typing]

00:57:53.300 | also in this case we will provide with many layers

00:57:55.300 | so layers

00:57:57.300 | this is just a model list

00:57:59.300 | and we will also have a

00:58:01.300 | normalization at the end

00:58:03.300 | [typing]

00:58:27.300 | just like we did before, we apply

00:58:29.300 | the input to one layer

00:58:31.300 | and then we use the output

00:58:33.300 | of the previous layer and give it as

00:58:35.300 | an input of the next layer

00:58:37.300 | [typing]

00:58:45.300 | each layer is a decoder block

00:58:47.300 | so we need to give it x

00:58:49.300 | we need to give it the encoder

00:58:51.300 | output, then the source

00:58:53.300 | mask and the target mask

00:58:55.300 | so each of them is

00:58:57.300 | this, we are calling the forward method

00:58:59.300 | here, so nothing different

00:59:01.300 | [typing]

00:59:05.300 | and finally we apply the normalization

00:59:07.300 | and this is

00:59:09.300 | our decoder

00:59:11.300 | there is one last ingredient

00:59:13.300 | we need to have

00:59:15.300 | what is a full

00:59:17.300 | transformer, so let's have a look at it

00:59:19.300 | the last

00:59:21.300 | ingredient we need is this layer here

00:59:23.300 | the linear layer

00:59:25.300 | as you remember from my slides

00:59:27.300 | the output of the

00:59:29.300 | multi-head attention is something

00:59:31.300 | that is sequenced by D-model

00:59:33.300 | so here, we expect

00:59:35.300 | to have the output to be

00:59:37.300 | sequenced by D-model

00:59:39.300 | if we don't consider the batch dimension

00:59:41.300 | however, we want to map these words

00:59:43.300 | back into the vocabulary

00:59:45.300 | so that's why we need this linear layer

00:59:47.300 | which will convert the embedding

00:59:49.300 | into a position of the vocabulary

00:59:51.300 | I will

00:59:53.300 | call this layer, call the projection

00:59:55.300 | layer, because it's projecting the

00:59:57.300 | embedding into the vocabulary, let's go

00:59:59.300 | build it

01:00:01.300 | [typing]

01:00:09.300 | what we need for this layer

01:00:11.300 | is the D-model, so the D-model

01:00:13.300 | which is an integer

01:00:15.300 | and the vocabulary size

01:00:17.300 | [typing]

01:00:19.300 | this is basically

01:00:21.300 | a linear layer that is converting

01:00:23.300 | from D-model to vocabulary size

01:00:25.300 | so .projectionlayer is

01:00:27.300 | [typing]

01:00:37.300 | let's define the forward method

01:00:39.300 | [typing]

01:00:41.300 | ok, what we want to do

01:00:43.300 | let me write this little comment

01:00:45.300 | we want to batch

01:00:47.300 | sequence length to D-model

01:00:49.300 | converted into

01:00:51.300 | batch sequence length

01:00:53.300 | vocabulary size

01:00:55.300 | [typing]

01:00:57.300 | and in this case

01:00:59.300 | we will also already apply the softmax

01:01:01.300 | and actually we will apply the log

01:01:03.300 | softmax for numerical stability

01:01:05.300 | like I showed before

01:01:07.300 | [typing]

01:01:17.300 | to the last dimension

01:01:19.300 | [typing]

01:01:21.300 | and that's it, this is our

01:01:23.300 | projection layer, now we have

01:01:25.300 | all the ingredients we need

01:01:27.300 | for the transformer, so let's define

01:01:29.300 | our transformer block

01:01:31.300 | [typing]

01:01:43.300 | in a transformer we have

01:01:45.300 | an encoder

01:01:47.300 | [typing]

01:01:49.300 | which is our encoder, we have a decoder

01:01:51.300 | which is our decoder

01:01:53.300 | we have a source embedding

01:01:55.300 | why we need a source embedding

01:01:57.300 | and a target embedding, because we are dealing with

01:01:59.300 | multiple languages, so we have one input

01:02:01.300 | embedding for the source language

01:02:03.300 | and one input embedding for the target

01:02:05.300 | language

01:02:07.300 | [typing]

01:02:13.300 | and we have the target embedding

01:02:15.300 | [typing]

01:02:19.300 | then we have the source position

01:02:21.300 | and the target position

01:02:23.300 | [typing]

01:02:30.300 | which will be the same actually

01:02:32.300 | and then we have the projection layer

01:02:34.300 | [typing]

01:02:43.300 | we just save this

01:02:45.300 | [typing]

01:03:07.300 | [typing]

01:03:17.300 | now we define

01:03:19.300 | three methods, one to encode

01:03:21.300 | one to decode and one to project

01:03:23.300 | we will apply them in succession

01:03:25.300 | why we don't

01:03:27.300 | just build one forward method

01:03:29.300 | because as we will see

01:03:31.300 | during inferencing we can reuse

01:03:33.300 | the output of the encoder, we don't need to

01:03:35.300 | calculate it every time

01:03:37.300 | and also we prefer

01:03:39.300 | to keep these

01:03:41.300 | outputs separate also for

01:03:43.300 | visualizing the attention

01:03:45.300 | [typing]

01:03:49.300 | so for the encoder we have

01:03:51.300 | the source of the

01:03:53.300 | because we have the source language

01:03:55.300 | and the source mask

01:03:57.300 | so what we do is

01:03:59.300 | we apply first the embedding

01:04:01.300 | [typing]

01:04:07.300 | then we apply the positional encoding

01:04:09.300 | [typing]

01:04:13.300 | and finally we apply the encoder

01:04:15.300 | [typing]

01:04:20.300 | then we define the decode method

01:04:22.300 | [typing]

01:04:25.300 | which takes the encoder output

01:04:27.300 | which is the tensor

01:04:29.300 | [typing]

01:04:31.300 | the source mask which is the tensor

01:04:33.300 | the target

01:04:35.300 | and the target mask

01:04:37.300 | [typing]

01:04:41.300 | oops

01:04:43.300 | [typing]

01:04:45.300 | and what we do is target

01:04:47.300 | we first apply the target embedding

01:04:49.300 | to the target sentence

01:04:51.300 | [typing]

01:04:55.300 | then we apply the positional encoding

01:04:57.300 | to the target sentence

01:04:59.300 | [typing]

01:05:03.300 | and finally we decode

01:05:05.300 | [typing]

01:05:15.300 | this is basically

01:05:17.300 | the forward method

01:05:19.300 | of this decoder

01:05:21.300 | so we have the same order

01:05:23.300 | of parameters

01:05:25.300 | yes

01:05:27.300 | finally we define the project method

01:05:29.300 | [typing]

01:05:31.300 | in which we just apply

01:05:33.300 | the projection so we take from the embedding

01:05:35.300 | to the vocabulary size

01:05:37.300 | [typing]

01:05:45.300 | ok, this is also

01:05:47.300 | this is the last block

01:05:49.300 | we had to build

01:05:51.300 | but we didn't make a method

01:05:53.300 | to combine all these blocks

01:05:55.300 | together, so we built many blocks

01:05:57.300 | we need one that given the hyperparameters

01:05:59.300 | of the transformer

01:06:01.300 | builds for us one single transformer

01:06:03.300 | initializing

01:06:05.300 | all the encoder, decoder, the embeddings

01:06:07.300 | etc. so let's build this

01:06:09.300 | function, let's call it

01:06:11.300 | buildTransformer

01:06:13.300 | that given all the hyperparameters

01:06:15.300 | will build the transformer for us

01:06:17.300 | and also initialize the parameters

01:06:19.300 | with some initial values

01:06:21.300 | [typing]

01:06:23.300 | what we need

01:06:25.300 | to define a transformer, for sure

01:06:27.300 | in this case we are talking about translation

01:06:29.300 | ok, this model that we are building

01:06:31.300 | we will be using for translation

01:06:33.300 | but you can use it for any task

01:06:35.300 | so the naming I'm using are basically

01:06:37.300 | the ones used in the translation task

01:06:39.300 | later you can change the naming

01:06:41.300 | but the structure is the same

01:06:43.300 | so you can use it for any other task

01:06:45.300 | for which the transformer is applicable

01:06:47.300 | so the first thing we need

01:06:49.300 | is the vocabulary size of the source

01:06:51.300 | and the target

01:06:53.300 | because we need to build the

01:06:55.300 | embedding

01:06:57.300 | because the embedding need to convert

01:06:59.300 | from the token

01:07:01.300 | of the vocabulary into a vector

01:07:03.300 | of size 512

01:07:05.300 | so it needs to know how big

01:07:07.300 | is the vocabulary, so how many vectors

01:07:09.300 | it needs to create

01:07:11.300 | [typing]

01:07:13.300 | then the target

01:07:15.300 | [typing]

01:07:17.300 | which is also an integer

01:07:19.300 | then we need to tell him

01:07:21.300 | what is the source sequence length and the target sequence length

01:07:23.300 | [typing]

01:07:25.300 | [typing]

01:07:27.300 | [typing]

01:07:29.300 | [typing]

01:07:31.300 | this is very important

01:07:33.300 | they could also be the same

01:07:35.300 | in our case it will be the same

01:07:37.300 | but they can also be different

01:07:39.300 | for example

01:07:41.300 | in case you are using the transformer

01:07:43.300 | that is dealing with two very different languages

01:07:45.300 | for example for translation

01:07:47.300 | in which the tokens needed

01:07:49.300 | for the source languages

01:07:51.300 | are much higher or much lower than the other ones

01:07:53.300 | so you don't need to keep the same length

01:07:55.300 | you can use different lengths

01:07:57.300 | the next hyperparameter is the

01:07:59.300 | dmodel

01:08:01.300 | [typing]

01:08:03.300 | which we initialize with 512

01:08:05.300 | because we want to keep the same values as the paper

01:08:07.300 | then we define the hyperparameter

01:08:09.300 | n which is the number of layers

01:08:11.300 | so the number of encoder blocks

01:08:13.300 | that we will be using

01:08:15.300 | is according to the paper

01:08:17.300 | is 6

01:08:19.300 | then we define the hyperparameter h

01:08:21.300 | which is the number of heads we want

01:08:23.300 | and according to the paper it is 8

01:08:25.300 | the

01:08:27.300 | dropout is

01:08:29.300 | 0.1

01:08:31.300 | [typing]

01:08:33.300 | and finally we have the hidden layer

01:08:35.300 | dff of the

01:08:37.300 | feedforward layer which is

01:08:39.300 | 2048 as we saw before on the paper

01:08:41.300 | [typing]

01:08:43.300 | and this builds

01:08:45.300 | a transformer

01:08:47.300 | ok

01:08:49.300 | so first we do is we create

01:08:51.300 | the embedding layers

01:08:53.300 | so source

01:08:55.300 | embedding

01:08:57.300 | [typing]

01:08:59.300 | [typing]

01:09:01.300 | [typing]

01:09:03.300 | [typing]

01:09:05.300 | [typing]

01:09:07.300 | then the target embedding

01:09:09.300 | [typing]

01:09:11.300 | [typing]

01:09:13.300 | [typing]

01:09:15.300 | [typing]

01:09:17.300 | then we create the positional encoding

01:09:19.300 | layers

01:09:21.300 | [typing]

01:09:23.300 | [typing]

01:09:25.300 | [typing]

01:09:27.300 | [typing]

01:09:29.300 | [typing]

01:09:31.300 | we don't need to create two positional

01:09:33.300 | encoding layers because actually they do

01:09:35.300 | the same job and they also

01:09:37.300 | don't add any parameter but because

01:09:39.300 | they have the dropout and also because

01:09:41.300 | I want to make it

01:09:43.300 | verbal so you can understand each

01:09:45.300 | part without making

01:09:47.300 | any optimization I think actually

01:09:49.300 | it's fine because this is for

01:09:51.300 | educational purpose so I don't want to

01:09:53.300 | optimize the code I want to make it as much

01:09:55.300 | comprehensible as possible

01:09:57.300 | so I do every part I need

01:09:59.300 | I don't take shortcuts

01:10:01.300 | [typing]

01:10:03.300 | [typing]

01:10:05.300 | [typing]

01:10:07.300 | [typing]

01:10:09.300 | [typing]

01:10:11.300 | [typing]

01:10:13.300 | [typing]

01:10:15.300 | and then we create the encoder blocks

01:10:17.300 | we have n of them so let's define

01:10:19.300 | [typing]

01:10:21.300 | [typing]

01:10:23.300 | [typing]

01:10:25.300 | let's create an empty array

01:10:27.300 | so

01:10:29.300 | we have n of them

01:10:31.300 | so each encoder block has

01:10:33.300 | a self-attention

01:10:35.300 | so encoder self-attention

01:10:37.300 | [typing]

01:10:39.300 | which is a multi-head

01:10:41.300 | attention block, the multi-head attention

01:10:43.300 | requires the demodule

01:10:45.300 | the edge

01:10:47.300 | and the dropout value

01:10:49.300 | then we have a

01:10:51.300 | feed-forward block

01:10:53.300 | [typing]

01:10:55.300 | [typing]

01:10:57.300 | [typing]

01:10:59.300 | [typing]

01:11:01.300 | as you can see also

01:11:03.300 | the names I'm using are quite long

01:11:05.300 | mostly because I want to make it as

01:11:07.300 | comprehensible as possible for everyone

01:11:09.300 | [typing]

01:11:11.300 | [typing]

01:11:13.300 | so each encoder block is made of

01:11:15.300 | a self-attention

01:11:17.300 | [typing]

01:11:19.300 | and a feed-forward

01:11:21.300 | [typing]

01:11:23.300 | and finally we tell him how much is the dropout

01:11:25.300 | [typing]

01:11:27.300 | [typing]

01:11:29.300 | Finally we add this encoder block.

01:11:37.380 | And then we can create the decoder blocks.

01:12:01.860 | We also have the cross attention for the decoder block.

01:12:16.880 | We also have the feedforward, just like the encoder.

01:12:34.280 | Then we define the decoder block itself, which is decoder block, cross attention and finally

01:12:48.400 | the feedforward and the dropout.

01:12:55.960 | And finally we save it in its array.

01:13:06.600 | We now can create the encoder and the decoder.

01:13:24.760 | We give him all his blocks, which are n and then also the decoder.

01:13:37.560 | And we create the projection layer, which will convert the model into vocabulary size.

01:13:49.920 | Which vocabulary?

01:13:50.920 | Of course the target, because we want to take from the source language to the target language.

01:13:55.460 | So we want to project our output into the target vocabulary.

01:14:02.800 | And then we build the transformer.

01:14:12.560 | What does it need?

01:14:13.560 | An encoder, a decoder, source embedding, target embedding, then source positional encoding,

01:14:29.520 | target positional encoding, and finally the projection layer.

01:14:38.640 | And that's it.

01:14:39.640 | Now we can just initialize the parameters using the Xavier uniform.

01:14:45.640 | This is a way to initialize the parameters to make the training faster so they don't

01:14:50.040 | just start with random values.

01:14:54.400 | And there are many algorithms to do it.

01:14:56.920 | I saw many implementations using Xavier, so I think it's a quite good start for the model

01:15:01.400 | to learn from.

01:15:18.920 | Finally return our beloved transformer.

01:15:22.800 | And this is it.

01:15:23.800 | This is how you build the model.

01:15:26.440 | And now that we have built the model, we will go further to use it.

01:15:30.040 | So we will first have a look at the dataset, then we will build the training loop.

01:15:37.800 | After the training loop, we will also build the inferencing part and the code for visualizing

01:15:44.480 | the attention.

01:15:46.240 | So hold on and take some coffee, take some tea, because it's going to be a little long,

01:15:53.080 | but it's going to be worth it.

01:15:54.960 | Now that we have built the code for the model, our next step is to build the training code.

01:16:01.060 | But before we do that, let's recheck the code, because we may have some typos.

01:16:07.400 | I actually already made this check.

01:16:09.640 | And there are a few mistakes in the code.

01:16:13.120 | I compared the old with the new one.

01:16:16.000 | It is very minor problems.

01:16:18.280 | So we wrote "feedForward" instead of "feedForward" here.

01:16:23.920 | And so the same problem is also present in every reference to "feedForward".

01:16:29.520 | And also here, when we are building the decoder block.

01:16:34.360 | And the other problem is that here, when we build the decoder block, we just wrote "nn.module".

01:16:39.040 | Instead, it should be "nn.moduleList".

01:16:42.680 | And then the "feedForward" should be also fixed here and here in the buildTransformer

01:16:47.400 | method.

01:16:49.000 | Now, I can delete the old one, so we don't need it anymore.

01:16:54.200 | Let me check the model.

01:16:55.200 | It's the correct one, with "feedForward".

01:16:59.440 | Yes, okay.

01:17:02.480 | Our next step is to build the training code.

01:17:04.840 | But before we build the training code, we have to look at the data.

01:17:08.240 | What kind of data are we going to work with?

01:17:10.780 | So as I said before, we are dealing with a translation task.

01:17:14.120 | And I have chosen this dataset called "opus_books", which we can find on HuggingFace.

01:17:19.620 | And we will also use the library from HuggingFace to download this dataset for us.

01:17:24.280 | And this is the only library we will be using beside PyTorch.

01:17:28.200 | Because of course we cannot reinvent the dataset by ourselves, so we will use this dataset.

01:17:33.880 | And we will also use the HuggingFace tokenizer library to transform this text into vocabulary.

01:17:42.200 | Because our goal is to build the transformer, so not to reinvent the wheel about everything.

01:17:48.040 | So we will be only focusing on building and training the transformer.

01:17:52.800 | And in my particular case, I will be using the subset "English to Italian", but we will

01:17:57.440 | build the code in such a way that you can choose the language and the code will act

01:18:02.680 | accordingly.

01:18:03.680 | If we look at the data, we can see that each data item is a pair of sentences in English

01:18:11.140 | and in Italian.

01:18:12.480 | For example, there was no possibility of taking a walk that day, which in Italian means "In

01:18:17.040 | quel giorno era impossibile passeggiare".

01:18:20.000 | So we will train our transformer to translate from the source language, which is English,

01:18:26.640 | into the target language, which is Italian.

01:18:29.040 | So let's do it.

01:18:30.480 | We will do it step by step.

01:18:32.240 | So first we will make the code to download this dataset and to create the tokenizer.

01:18:37.940 | So what is a tokenizer?

01:18:40.080 | Let's go back to the slides to just have a brief overview of what we are going to do

01:18:44.440 | with this data.

01:18:45.440 | The tokenizer is what comes before the input embeddings.

01:18:49.300 | So we have an English sentence.

01:18:51.440 | So for example, "Your cat is a lovely cat", but this sentence will come from our dataset.

01:18:56.280 | The goal of the tokenizer is to create this token.

01:18:59.680 | So split this sentence into single words, which has many strategies.

01:19:04.480 | As you can see here, we have a sentence, which is "Your cat is a lovely cat".

01:19:09.180 | And the goal of the tokenizer is to split this sentence into single words, which can

01:19:14.500 | be done in many ways.

01:19:16.320 | There is the BPE tokenizer, there is the word-level tokenizer, there is the sub-word-level, word-part

01:19:22.220 | tokenizer.

01:19:23.220 | There are many tokenizers.

01:19:24.220 | The one we will be using is the simplest one called the word-level tokenizer.

01:19:27.980 | So the word-level tokenizer basically will split this sentence, let's say by space.

01:19:32.420 | So each space defines the boundary of a word, and so into the single words, and each word

01:19:39.380 | will be mapped to one number.

01:19:41.420 | So this is the job of the tokenizer, to build the vocabulary of these numbers and to map

01:19:47.540 | each word into a number.

01:19:51.100 | When we build the tokenizer, we can also create special tokens, which we will use for the

01:19:55.620 | transformer.

01:19:56.620 | For example, the tokens called padding, the token called the start-of-sentence, end-of-sentence,

01:20:02.540 | which are necessary for training the transformer.

01:20:05.580 | But we will do it step-by-step.

01:20:07.540 | So let's build first the code for building the tokenizer and to download the dataset.

01:20:13.980 | Let's create a new file.

01:20:15.100 | Let's call it train.py.

01:20:18.180 | Okay.

01:20:19.180 | Let's import our usual library.

01:20:22.220 | So torch, we will also import torch.nm.

01:20:27.380 | And we also, because we are using a library from HuggingFace, we also need to import these

01:20:33.780 | two libraries.

01:20:34.940 | We will be using the datasets library, which you can install using pip.

01:20:41.300 | So datasets, we will be using load dataset.

01:20:50.060 | And we will also be using the tokenizers library also from HuggingFace, which you can install

01:20:56.140 | with pip.

01:21:02.600 | We also need the, which tokenizer we need, so we will use the word-level tokenizer.

01:21:19.800 | And there is also the trainers, so the tokenizer, the class that will train the tokenizer.

01:21:27.520 | So that will create the vocabulary given the list of sentences.

01:21:40.800 | And we will split the word according to the white space.

01:21:44.800 | I will build one method by one.

01:21:47.460 | So I will build first the methods to create the tokenizer, and I will describe each parameter.

01:21:55.240 | For now, you will not have the bigger picture, but later when we combine all these methods

01:21:59.380 | together, you will have the bigger picture.

01:22:01.840 | So let's first make the method that builds the tokenizer.

01:22:05.500 | So we will call it getOrBuildTokenizer.

01:22:13.380 | And this method takes the configuration, which is the configuration of our model.

01:22:17.000 | We will define it later.

01:22:18.760 | The dataset and the language for which we are going to build the tokenizer.

01:22:25.920 | We define the tokenizer path, so the file where we will be save this tokenizer.

01:22:32.200 | And we do it path of config.

01:22:41.400 | Okay, let me define some things.

01:22:46.800 | First of all, this path is coming from the pathlib, so from pathlib.

01:22:53.580 | This is a library that allows you to create absolute path given relative paths.

01:22:58.420 | And we pretend that we have a configuration called the tokenizer file, which is the path

01:23:03.420 | to the tokenizer file.

01:23:05.360 | And this path is formattable using the language.

01:23:08.280 | So for example, we can have something like this, for example, something like this.

01:23:23.360 | And this will be, given the language, it will create a tokenizer English or tokenizer Italian,

01:23:31.140 | for example.

01:23:32.140 | So if the tokenizer doesn't exist, we create it.

01:23:45.200 | I took all this code actually from HuggingFace.

01:23:48.220 | It's nothing complicated.

01:23:49.580 | I just taken their quick tour of their tokenizers library.

01:23:54.500 | And it's really easy to use it, and saves you a lot of time.

01:23:58.680 | Because tokenizer, to build a tokenizer is really reinventing the wheel.

01:24:11.060 | And we will also introduce the unknown word, unknown.

01:24:15.460 | So what does it mean?

01:24:16.800 | If our tokenizer sees a word that it doesn't recognize in its vocabulary, it will replace

01:24:21.560 | it with this word, unknown.

01:24:24.120 | It will map it to the number corresponding to this word, unknown.

01:24:31.580 | The pre-tokenizer means basically that we split by whitespace.

01:24:36.320 | And then we train, we build the trainer to train our tokenizer.

01:24:55.360 | Okay, this is the trainer.

01:25:20.860 | What does it mean?

01:25:21.860 | It means it will be a word-level trainer.

01:25:24.340 | So it will split words using the whitespace and using the single words.

01:25:28.980 | And it will also have four special tokens.

01:25:31.340 | One is unknown, which means that if you cannot find that particular word in the vocabulary,

01:25:37.860 | just replace it with unknown.

01:25:39.700 | It will also have the padding, which we will use to train the transformer, the start of

01:25:45.180 | sentence and the end of sentence special tokens.

01:25:47.820 | In frequency means that a word, for a word to appear in our vocabulary, it has to have

01:25:52.980 | a frequency of at least two.

01:25:56.380 | Now we can train the tokenizer.

01:26:04.020 | We use this method, which means we build first a method that gives all the sentences from

01:26:10.680 | our data set and we will build it later.

01:26:38.220 | Okay, so let's build also this method called getAllSentence so that we can iterate through

01:26:59.140 | the data set to get all the sentences corresponding to the particular language for which we are

01:27:04.640 | creating the tokenizer.

01:27:17.200 | As you remember, each item in the data set, it's a pair of sentences, one in English,

01:27:21.680 | one in Italian.

01:27:22.680 | We just want to extract one particular language.

01:27:32.000 | This is the item representing the pair.

01:27:35.400 | And from this pair, we extract only the one language that we want.

01:27:40.600 | And this is the code to build the tokenizer.

01:27:43.520 | Now let's write the code to load the data set and then to build the tokenizer.

01:27:49.360 | We will call this method getDataset and which also takes the configuration of the model,

01:27:55.880 | which we will define later.

01:27:59.340 | So let's load the data set.

01:28:00.800 | We will call it dsRow.

01:28:04.800 | Okay, HuggingFace allows us to download its data sets very easily.

01:28:12.680 | We just need to tell him what is the name of the data set.

01:28:17.920 | And then tell him what is the subset we want.

01:28:20.920 | We want the subset that is English to Italian, but we want to also make it configurable for

01:28:25.200 | you guys to change the language very fast.

01:28:27.160 | So let's build this subset dynamically.

01:28:41.040 | We will have two parameters in the configuration.

01:28:44.140 | One is called languageSource and one is called languageTarget.

01:28:57.040 | Later we can also define what split we want of this data set.

01:29:00.960 | In our case, there is only the training split in the original data set from HuggingFace,

01:29:07.640 | but we will split by ourself into the validation and the training data.

01:29:13.480 | So let's build the tokenizer.

01:29:27.080 | This is the raw data set and we also have the target.

01:29:46.240 | Okay, now, because we only have the training split from HuggingFace, we can split it by

01:29:52.760 | by ourself into a training and the validation.

01:29:55.520 | We keep 90% of the data for training and 10% for validation.

01:30:17.840 | Okay, let's build the tokenizer.

01:30:45.880 | The method randomSplit allows, it's a method from PyTorch that allows to split a data set

01:30:59.280 | using the size that we give as input.

01:31:02.520 | So in this case, it means split this data set into this two smaller data set, one of

01:31:08.240 | this size and one of this size.

01:31:10.320 | But let's import the method from Torch.

01:31:23.280 | Let's also import the one that we will need later, TotalOrder and randomSplit.

01:31:36.660 | Now we need to create the data set.

01:31:39.100 | The data set that our model will use to access the tensors directly, because now we just

01:31:44.560 | created the tokenizer and we just loaded the data, but we need to create the tensors that

01:31:50.880 | our model will use.

01:31:52.400 | So let's create the data set.

01:31:54.080 | Let's call it bilingual data set and for that we create a new file.

01:32:07.360 | So here we import Torch and that's it.

01:32:31.440 | We will call the data set, we will call it bilingual data set.

01:32:41.620 | Okay as usual we define the constructor and in this constructor we need to give him the

01:32:49.920 | data set downloaded from HuggingFace, the tokenizer of the source language, the tokenizer

01:32:55.880 | of the target language, the source language, the name of the source language, the name

01:33:00.920 | of the target language and the sequence length that we will use.

01:33:14.360 | Okay we save all these values.

01:33:37.320 | We can also save the tokens, the particular tokens that we will use to create the tensors

01:33:45.520 | for the model.

01:33:46.840 | So we need the start of sentence, end of sentence and the padding token.

01:33:50.640 | So how do we convert the token start of sentence into a number, into the input ID?

01:33:57.840 | There is a special method of the tokenizer to do that, so let's do it.

01:34:02.920 | So this is the start of sentence token, we want to build it into a tensor.

01:34:10.680 | This tensor will contain only one number which is given by, we can use this tokenizer from

01:34:17.480 | the source or the target, it doesn't matter because they both contain these particular

01:34:20.960 | tokens.

01:34:24.640 | This is the method to convert the token into a number, so start of sentence and the type

01:34:33.540 | of this token, of this tensor is, we want it long because the vocabulary can be more

01:34:46.600 | than 32-bit long, the vocabulary size, so we usually use the long 64-bit.

01:34:55.160 | And we do the same for the end of sentence and the padding token.

01:35:23.100 | We also need to define the length method of this dataset, which tells the length of the

01:35:33.260 | dataset itself, so basically just the length of the dataset from hugging face, and then

01:35:40.180 | we need to define the get item method.

01:35:55.540 | First of all we will extract the original pair from the hugging face dataset, then we

01:36:09.580 | extract the source text and the target text.

01:36:38.500 | And finally we convert each text into tokens, and then into input IDs.

01:36:47.220 | What does it mean?

01:36:48.220 | We will first, the tokenizer will first split the sentence into single words, and then will

01:36:54.020 | map each word into its corresponding number in the vocabulary, and it will do it in one

01:36:58.700 | pass only, this is done by the encode method.ids, this gives us the input IDs, so the numbers

01:37:18.600 | corresponding to each word in the original sentence, and it will be given as an array.

01:37:34.220 | We did the same for the decoder.

01:37:36.260 | Now as you remember, we also need to pad the sentence to reach the sequence length.

01:37:43.420 | This is really important because we want our model to always work, I mean the model always

01:37:49.220 | works with a fixed length, sequence length, but we don't have enough words in every sentence,

01:37:54.960 | so we use the padding token, so this PAD here, as the padding token to fill the sentence

01:38:01.540 | until it reaches the sequence length.

01:38:04.020 | So we calculate how many padding tokens we need to add for the encoder side and for the

01:38:07.940 | decoder side, which is basically how many we need to reach the sequence length.

01:38:22.080 | Minus two, why minus two here?

01:38:24.260 | So we already have this amount of tokens, we need to reach this one, but we will add

01:38:28.780 | also the start of sentence token and the end of sentence token to the encoder side, so

01:38:35.200 | we also have minus two here.

01:38:49.380 | And here only minus one.

01:38:50.900 | If you remember my previous video, when we do the training, we add only the start of

01:38:57.420 | sentence token to the decoder side, and then in the label we only add the end of sentence

01:39:03.660 | token.

01:39:04.660 | So in this case we only need to add one token, special token to the sentence.

01:39:09.160 | We also make sure that this sequence length that we have chosen is enough to represent

01:39:15.500 | all the sentences in our dataset, and if we chose too small one, we want to raise an exception.

01:39:25.500 | So basically this number of padding tokens should never become negative.

01:39:42.340 | Okay, now let's build the two tensors for the encoder input and for the decoder input,

01:40:01.200 | but also for the label.

01:40:02.820 | So one sentence will be sent to the input of the encoder, one sentence will be sent

01:40:08.000 | to the input of the decoder, and one sentence is the one that we expect as the output of

01:40:14.900 | the decoder.

01:40:16.220 | And that output we will call label.

01:40:19.380 | Usually it's called target or label.

01:40:21.200 | I call it label.

01:40:28.360 | We can cut the tensor of the start, okay, we can cut three tensors.

01:40:35.460 | First is the start of sentence token, then the tokens of the source text, then the end

01:40:56.660 | of sentence token, and then enough padding tokens to reach the sequence length.

01:41:05.260 | We already calculated how many impeding tokens we need to add to this sentence, so let's

01:41:09.480 | just do it.

01:41:36.160 | And this is the encoder input, so let me write some comment here.

01:41:40.240 | This is add SOS and AOS to the source text.

01:41:50.780 | Then we build the decoder input, which is also a concatenation of tokens.

01:42:01.680 | In this case we don't have the start of sentence, we just have the start of sentence.

01:42:26.640 | And finally we add enough padding tokens to reach the sequence length.

01:42:33.560 | We already calculated how many we need, just use this value now.

01:42:39.040 | And then we build the label.

01:43:00.120 | In the label we only add the end of sentence token.

01:43:16.400 | Because we need the same number of padding tokens as for the decoder input.

01:43:25.480 | Just for debugging, let's double check that we actually reach the sequence length.

01:43:37.160 | Ok, now that we have made this check, let me also write some comments here.

01:44:01.240 | Here we are only adding SOS to the decoder input.

01:44:09.680 | And here is add EOS to the label, what we expect as output from the decoder.

01:44:25.280 | Now we can return all these tensors so that our training can use them.

01:44:32.360 | We return a dictionary comprised of encoder input.

01:44:41.920 | What is the encoder input?

01:44:42.920 | It's basically of size sequence length.

01:44:46.880 | Then we have the decoder input, which is also just a sequence length number of tokens.

01:44:59.220 | I forgot a comma here.

01:45:06.480 | And then we have the encoder mask.

01:45:08.720 | So what is the encoder mask?

01:45:11.240 | As you remember, we are increasing the size of the encoder input sentence by adding padding

01:45:19.180 | tokens.

01:45:20.180 | But we don't want these padding tokens to participate in the self-attention.

01:45:24.820 | So what we need is to build a mask that says that we don't want these tokens to be seen

01:45:30.280 | by the self-attention mechanism.

01:45:32.700 | And so we build the mask for the encoder.

01:45:37.360 | How do we build this mask?

01:45:38.400 | We just say that all the tokens that are not padding are okay.

01:45:44.120 | All the tokens that are padding are not okay.

01:45:50.680 | We also unscreeze to add this sequence dimension and also to add the batch dimension later.

01:46:01.160 | And we convert into integers.

01:46:03.160 | So this is 1, 1 sequence length, because this will be used in the self-attention mechanism.

01:46:12.200 | However, for the decoder, we need a special mask that is a causal mask, which means that

01:46:19.640 | each word can only look at the previous word and each word can only look at non-padding

01:46:27.240 | words.

01:46:28.240 | So we don't want, again, we don't want the padding tokens to participate in the self-attention.

01:46:32.560 | We only want real words to participate in this.

01:46:35.680 | And we also don't want each word to watch at words that come after it, but only that

01:46:42.360 | words come before it.

01:46:45.160 | So I will use a method here called causal mask that will build it.

01:46:49.480 | Later we will build it also.

01:46:50.600 | So now I just call it to show you how it's used, and then we will proceed to build it.

01:47:00.760 | So in this case, we don't want the padding tokens, and we add the necessary dimensions.

01:47:09.480 | And also we do a Boolean end with causal mask, which is a method that we will build right

01:47:20.840 | now.

01:47:22.360 | And this causal mask needs to build a matrix of size sequence length to sequence length.

01:47:27.240 | What is sequence length is basically the size of our decoder input.

01:47:36.480 | And this, let me write a comment for you.

01:47:39.000 | So this is one, two, sequence length, combined with, so the end with one sequence length,

01:47:50.640 | sequence length, and this can be broadcasted.

01:47:56.240 | Let's go define this method, causal mask.

01:47:59.280 | So what is causal mask?

01:48:02.480 | Causal mask basically means that we want, let's go back to the slides actually, as you

01:48:07.280 | remember from the slides, we want each word in the decoder to only watch words that come

01:48:12.800 | before it.

01:48:13.800 | So what we want is to make all these values above this diagonal that represents the multiplication,

01:48:20.360 | this matrix represents the multiplication of the queries by the keys in the self-attention

01:48:25.240 | mechanism.

01:48:26.480 | What we want is to hide all these values.

01:48:28.640 | So your cannot watch the word cat is a lovely cat.

01:48:33.360 | It can only watch itself, but this word here, for example, this word lovely can watch everything

01:48:38.720 | that comes before it.

01:48:40.040 | So from your up to lovely itself, but not the word cat that comes after it.

01:48:45.540 | So what we do is we want all these values here to be masked out.

01:48:51.440 | So which also means that we want all the values above this diagonal to be masked out.

01:48:56.880 | And there is a very practical method in PyTorch to do it.

01:49:01.040 | So let's do it.

01:49:02.280 | Let's go build this method.

01:49:04.700 | So the mask is basically torch.triu, which means give me every value that is above the

01:49:13.420 | diagonal that I am telling you.

01:49:15.440 | So we want a matrix, which matrix, matrix made of all ones.

01:49:23.180 | And this method will return every value above the diagonal and everything else will become

01:49:29.520 | zero.

01:49:30.880 | So we want diagonal one type, we want it to be integer.

01:49:40.000 | And what we do is return mask is equal to zero.

01:49:43.460 | So this will return all the values above the diagonal and everything below the diagonal

01:49:49.200 | will become zero.

01:49:50.200 | But we want actually the opposite.

01:49:51.760 | So we say, okay, everything that is zero should will become true with this expression and

01:49:56.040 | everything that is not zero will become false.

01:50:01.360 | So we apply it here to build this mask.

01:50:03.540 | So this mask will be one by sequence length by sequence length, which is exactly what

01:50:09.760 | we want.

01:50:11.320 | Okay, let's add also the label.

01:50:19.000 | The label is also, oh, I forgot the comma.

01:50:26.000 | Sequence length and then we have the source text just for visualization, we can send it

01:50:30.480 | source text and then the target text.

01:50:41.640 | And this is our data set.

01:50:44.380 | Now let's go back to our training method to continue writing the training loop.

01:50:50.620 | So now that we have the data set, we can create it.

01:50:55.120 | We can create two data sets, one for training, one for validation, and then we send it to

01:51:01.120 | a data loader and finally to our training loop.

01:51:11.020 | We forgot to import the data set.

01:51:14.480 | So let's import it here.

01:51:18.280 | We also import the causal mask, which we will need later.

01:51:43.480 | What is our source language? it's in the configuration.

01:51:50.040 | What is our target language?

01:51:56.520 | And what is our sequence length?

01:51:57.720 | It's also in the configuration.

01:52:01.800 | We do the same for the validation.

01:52:09.480 | But the only difference is that we use this one now and the rest is same.

01:52:15.960 | We also, just for choosing the max sequence length, we also want to watch what is the

01:52:21.200 | maximum length of each sentence in the source and the target for each of the two splits

01:52:27.320 | that we created here.

01:52:28.760 | So that if we choose a very small sequence length, we will know.

01:52:46.840 | Basically what we do, I load each sentence from each language, from the source and the

01:52:51.640 | target language.

01:52:52.640 | I convert into IDs using the tokenizer and I check the length.

01:52:56.340 | If the length is, let's say 180, we can choose 200 as sequence length, because it will cover

01:53:02.640 | all the possible sentences that we have in this data set.

01:53:06.360 | If it's, let's say 500, we can use 510 or something like this, because we also need

01:53:11.160 | to add the start of sentence and the end of sentence tokens to these sentences.

01:53:39.420 | This is the source IDs, then let's create also the target IDs, and this is the language

01:53:46.280 | of target.

01:53:50.240 | And then we just say the source maximum length is the maximum of the

01:54:02.640 | and the length of the current sentence, the target is the target and the target IDs.

01:54:11.760 | Then we print these two values, we also do it for the target.

01:54:29.440 | And that's it.

01:54:30.440 | Now we can proceed to create the data loaders.

01:54:42.760 | We define the batch size according to our configuration, which we still didn't define,

01:54:47.160 | but you can already guess what are its values.

01:54:53.280 | We want it to be shuffled.

01:55:07.500 | For the validation, I will use a batch size of one, because I want to process each sentence

01:55:13.000 | one by one.

01:55:17.640 | And this method returns the data loader of the training, the data loader of the validation,

01:55:24.180 | the tokenizer of the source language and the tokenizer of the target language.

01:55:30.880 | Now we can start building the model.

01:55:34.960 | So let's define a new method called getModel, which will, according to our configuration,

01:55:40.880 | our vocabulary size, build the model, the transformer model.

01:55:52.260 | So the model is, we didn't import the model, so let's import it.

01:56:05.600 | Model transformer.

01:56:11.280 | What is the first?

01:56:12.280 | The source vocabulary size and the target vocabulary size.

01:56:20.280 | And then we have the sequence length.

01:56:25.200 | And we have the sequence length of the source language and the sequence length of the target

01:56:30.880 | language.

01:56:31.880 | We will use the same.

01:56:35.160 | For both.

01:56:37.080 | And then we have the dModule, which is the size of the embedding.

01:56:43.800 | We can keep all the rest, the default, as in the paper.

01:56:49.640 | If the model is too big for your GPU to be trained on, you can try to reduce the number

01:56:54.520 | of heads or the number of layers.

01:56:56.640 | Of course, it will impact the performance of the model.

01:57:00.640 | But I think given the dataset, which is not so big and not so complicated, it should not

01:57:06.260 | be a big problem because we are not building a huge dataset anyway.

01:57:10.640 | OK, now that we have the model, we can start building the training loop.

01:57:15.800 | But before we build the training loop, let me as define this configuration because it

01:57:19.960 | keeps coming and I think it's better to define the structure now.

01:57:26.400 | So let's create a new file called config.py in which we define two methods.

01:57:32.660 | One is called getConfig and one is to map to get the path where we will save the weights

01:57:40.760 | of the model.

01:57:44.880 | OK, let's define the batch size.

01:57:54.240 | I choose 8.

01:57:55.240 | You can choose something bigger if your computer allows it.

01:57:58.440 | The number of epochs for which we will be training, I would say 20 is enough.

01:58:03.840 | The learning rate, I am using 10 to the power of -4.

01:58:08.740 | You can use other values.

01:58:11.640 | I thought this learning rate is reasonable.

01:58:16.880 | It's possible to change the learning rate during training.

01:58:21.320 | It's quite common to give a very high learning rate and then reduce it gradually with every

01:58:26.360 | epoch.

01:58:27.360 | We will not be using it because it will just complicate the code a little more and this

01:58:31.160 | is not actually the goal of this video.

01:58:33.840 | The goal of this video is to teach how the transformer works.

01:58:41.080 | I have already checked the sequence length that we need for this particular dataset from

01:58:46.760 | English to Italian, which is 350 is more than enough.

01:58:50.760 | And the D model that we will be using is the default of 512.

01:58:55.680 | The language source is English.

01:58:59.120 | So we are going from English.

01:59:00.880 | The language target is Italian.

01:59:03.520 | We are going to translate into Italian.

01:59:07.640 | We will save the model into the folder called weights.

01:59:16.760 | And the file name of which model will be T model, so transformer model.

01:59:25.480 | I also built the code to preload the model in case we want to restart the training after

01:59:32.080 | maybe it crashed.

01:59:43.680 | And this is the tokenizer file.

01:59:46.320 | So it will be saved like this.

01:59:47.560 | So tokenizer n and tokenizer it according to the language.

01:59:52.720 | And this is the experiment name for TensorBoard on which we will save the losses while training.

02:00:04.720 | I think there is a comma here.

02:00:08.080 | Okay.

02:00:09.080 | Now let's define another method that allows us to find the part where we need to save

02:00:13.280 | the weights.

02:00:19.520 | Why I'm creating such a complicated structure is because I will provide also notebooks to

02:00:26.600 | run this training on Google Colab.

02:00:29.720 | So we just need to change these parameters to make it work on Google Colab and save the

02:00:34.320 | weights directly on your Google Drive.

02:00:36.560 | I have already created actually this code and it will be provided on GitHub and I will

02:00:42.880 | also provide the link in the video.

02:01:02.240 | Okay, the file is built according to model base name, then the epoch.pt.

02:01:27.920 | Let's import also here the path library.

02:01:47.560 | Okay, now let's go back to our training loop.

02:01:52.880 | Okay, we can build the training loop now finally.

02:01:55.840 | So train model given the configuration.

02:01:59.960 | Okay, first we need to define which device on which we will put all the tensors.

02:02:05.680 | So define the device.

02:02:33.120 | Then we also print.

02:02:44.960 | We make sure that the weights folder is created.

02:03:04.760 | And then we load our data set.

02:03:28.040 | To get the vocabulary size, there is method called get_vocab_size.

02:03:37.720 | And I think we don't have any other parameter.

02:03:41.160 | And finally, we transfer the model to our device.

02:03:47.520 | We also start TensorBoard.TensorBoard allows to visualize the loss, the graphics, the charts.

02:04:06.320 | Let's also import TensorBoard.

02:04:28.920 | Let's go back.

02:04:30.240 | Let's also create the optimizer.

02:04:33.280 | I will be using the Adam optimizer.

02:04:48.000 | Okay, since we also have the configuration that allow us to resume the training in case

02:05:01.000 | the model crashes or something crashes, let's implement that one.

02:05:05.480 | And that will allow us to restore the state of the model and the state of the optimizer.

02:05:31.160 | Let's import this method we defined in the data set.

02:05:58.720 | We load the file.

02:06:23.120 | And we run it.

02:06:51.760 | Here we have a typo.

02:06:57.480 | Okay, the loss function we will be using is the cross entropy loss.

02:07:07.880 | We need to tell him what is the ignore index.

02:07:10.160 | So we want him to ignore the padding token basically.

02:07:14.080 | We don't want the padding token to contribute to the loss.

02:07:34.480 | And we also will be using label smoothing.

02:07:38.160 | Label smoothing basically allows our model to be less confident about its decision.

02:07:45.040 | So how to say, imagine our model is telling us to choose the word number three and with

02:07:52.280 | a very high probability.

02:07:53.600 | So what we will do with label smoothing is take a little percentage of that probability

02:07:57.320 | and distribute to the other tokens so that our model becomes less sure of its choices.

02:08:04.280 | So kind of less over fit and this actually improves the accuracy of the model.

02:08:12.020 | So we will use a label smoothing of 0.1 which means from every highest probability token

02:08:19.240 | take 0.1% of score and give it to the others.

02:08:28.240 | Okay let's build finally the training loop, we tell the model to train.

02:08:47.320 | I build a batch iterator for the data loader using tqodm which will show a very nice progress

02:08:55.540 | bar.

02:09:18.400 | And we need to import tqodm.

02:09:34.280 | Okay finally we get the tensors, the encoder input.

02:09:49.000 | What is the size of this tensor?

02:09:51.280 | It's batch to sequence length.

02:09:54.880 | The decoder input is batch of decoder input and we also move it to our device, batch to

02:10:06.160 | sequence length, we get the two masks also.

02:10:25.040 | This is the size and then the decoder mask.

02:10:41.320 | Okay why these two masks are different?

02:10:44.240 | Because in the one case we are only telling him to hide only the padding tokens, in the

02:10:51.240 | other case we are also telling him to hide all these subsequent words, for each word

02:10:57.120 | to hide all the subsequent words to mask them out.

02:11:02.440 | Okay now we run the tensors through the transformer.

02:11:10.880 | So first we calculate the output of the encoder and we encode using what the encoder input

02:11:20.480 | and the mask of the encoder.

02:11:24.360 | Then we calculate the decoder output using the encoder output, the source, the

02:11:36.920 | mask of the encoder, then the decoder input and the decoder mask.

02:11:46.800 | Okay as we know this the result of this so the output of the model.encode will be a batch

02:11:54.840 | sequence length d model.

02:11:59.800 | Also the output of the decoder will be batch sequence length d model.

02:12:08.120 | But we want to map it back to the vocabulary so we need the projection.

02:12:11.620 | So let's get the projection output.

02:12:19.960 | And this will produce a B so batch sequence length and target vocabulary size.

02:12:29.440 | Okay now that we have the output of the model we want to compare it with our label.

02:12:34.360 | So first let's extract the label from the batch.

02:12:42.580 | And we also put it on our device.

02:12:45.180 | So what is the label it's B so batch to sequence length in which each position tell so the

02:12:52.700 | label is already for each B and sequence length so for each dimension tells us what is the

02:13:00.140 | position in the vocabulary of that particular word and we want these two to be comparable

02:13:08.660 | so we first need to compute the loss into this I show you now projection output view

02:13:18.180 | minus one.

02:13:28.620 | Okay what does this do this basically transforms the I show you here this size into this size

02:13:40.600 | B multiplied by sequence length and then target vocabulary size vocabulary size.

02:13:49.140 | Okay because we want to compare it with this.

02:13:52.600 | This is how the cross entropy wants the tensors to be.

02:14:00.300 | And also the label.

02:14:04.660 | Okay now we can we have calculated the loss we can update our progress bar this one with

02:14:11.620 | the loss we have calculated.

02:14:39.560 | This is this will show the loss on our progress bar.

02:14:43.340 | We can also log it on TensorBoard.

02:14:52.380 | Let's also flush it.

02:15:07.260 | Okay now we can back propagate the loss so loss.backward and finally we update the weights

02:15:17.620 | of the model so that is the job of the optimizer and finally we can zero out the grad and we

02:15:28.040 | move the global step by one the global step is being used mostly for TensorBoard to keep

02:15:32.820 | track of the loss we can save the model every epoch okay model file name which we get from

02:15:45.700 | our special methods this one we tell him the configuration we have and the name of the

02:15:54.140 | file which is the epoch but with zeros in front and we save our model.

02:16:06.860 | It is very good idea when we want to be able to resume the training to also save not only

02:16:12.740 | the state of the model but also the state of the optimizer because the optimizer also

02:16:18.340 | keep tracks of some statistics one for each weight to understand how to move each weight

02:16:24.820 | independently and usually actually I saw that the optimizer dictionary is quite big so even

02:16:35.860 | if it's big if you want your training to be resumable you need to save it otherwise the

02:16:40.800 | optimizer will always start from zero and we'll have to figure out from zero even if

02:16:45.980 | you start from a previous epoch how to move each weight so every time we save some snapshot

02:16:52.660 | I always include it.

02:16:58.660 | State of the model this is all the weights of the model we also want to save the optimizer

02:17:09.020 | let's do also the global step and we want to save all this into the file

02:17:28.240 | name so model file name and that's it now let's build the code to run this so if name

02:17:44.280 | I really find the warnings frustrating so I want to filter them out because I have some

02:17:50.260 | a lot of libraries especially CUDA I already know what's the content and so I don't want

02:17:57.040 | to visualize them every time but for sure for you guys I suggest watching them at least

02:18:02.680 | once to understand if there is any big problem otherwise they're just complaining from CUDA

02:18:22.440 | okay let's try to run this code and see if everything is working fine we should what

02:18:29.760 | we expect is that the code should download the data set the first time then it should

02:18:35.240 | create the tokenizer and save it into its file and it should also start training the

02:18:42.960 | model for 30 epochs of course it will never finish but let's do it let me check again

02:18:48.760 | the configuration tokenizer okay let's run it

02:19:13.520 | okay it's building the tokenizer and we have some problem here sequence length okay finally

02:19:20.040 | the model is training I show you recap you guys what I had mistaken first of all the

02:19:26.920 | sequence length was written incorrectly there was a capital L here and also in the data

02:19:32.720 | set I forgot to save it here and here I had it also written capitalized so L was capital

02:19:41.120 | and now the training is going on and as you can see the training is quite fast or at least

02:19:48.440 | on my computer actually not so fast but because I chose a batch size of 8 I could try to increase

02:19:55.680 | it and it's happening on CUDA the loss is decreasing and the weights will be saved here

02:20:03.080 | so if we reach the end of the epoch it will create the first weight here so let's wait

02:20:07.720 | until the end of the epoch and see if the weight is actually created before actually

02:20:12.800 | finishing the training of the model let's do another thing we also would like to visualize

02:20:19.000 | the output of the model while we are training and this is called validation so we want to

02:20:23.960 | check how our model is evolving while it is getting trained so what we want to build is

02:20:31.160 | a validation loop which will allow us to evaluate the model which also means that we want to

02:20:36.940 | inference from this model and check some sample sentences and see if how they get translated

02:20:43.520 | so let's start building the validation loop the first thing we do is we build a new method

02:20:48.280 | called run validation and this method will accept some parameters that we will use for

02:21:01.920 | now I just write all of them and later I explain how they will be used

02:21:05.500 | so, we have a new method called run validation and this method will accept some parameters, we will use them later.

02:21:12.500 | [typing]

02:21:38.500 | okay the first thing we do to run the validation is we put our model into evaluation mode so

02:21:46.260 | we do model.eval and this means that this tells PyTorch that we are going to evaluate

02:21:52.100 | our model and then what we will do we will inference two sentences and see what is the

02:22:01.740 | output of the model. [typing]

02:22:30.740 | so with torch.nodegrad we are disabling the gradient calculation for this for every tensor

02:22:47.400 | that we will run inside this with block and this is exactly what we want we just want

02:22:52.100 | to inference from the model we don't want to train it during this loop so let's get

02:22:58.700 | a batch from the validation data set because we want to inference only two so we keep a

02:23:05.380 | count of how many we have already processed and we get the input from this current batch

02:23:13.340 | I want to remind you that for the validation ds we only have a batch size of 1 [typing]

02:23:28.120 | this is the encoder input and we can also get the encoder mask

02:23:43.100 | let's just verify that the size of the batch is actually 1 [typing]

02:24:01.040 | and now let's go to the interesting part so as you remember when we calculate the when

02:24:09.360 | we want to inference the model we need to calculate the encoder output only once and

02:24:14.440 | reuse it for every token that the model will output from the decoder so let's create another

02:24:20.600 | function that will run the greedy decoding on our model and we will see that it will

02:24:26.520 | run the encoder only once so let's call this function greedy decode [typing]

02:24:54.080 | okay let's create some tokens that we will need so the SOS token which is the start of

02:25:01.140 | sentence we can get it from either tokenizer it doesn't matter if it's the target or the

02:25:19.660 | target EOS okay and then we what we do is we pre-compute the encoder output and reuse

02:25:35.740 | it for every token we get from the decoder so

02:25:42.180 | we just give the source and the source mask which is the encoder input and the encoder

02:25:57.340 | mask we can also call it encoder input and encoder mask then we get the then we okay

02:26:06.800 | how do we do the inferencing the first thing we do is we give to the decoder the start

02:26:11.820 | of sentence token so that the decoder will output the first token of the sentence of

02:26:18.060 | the translated sentence then at every iteration just like we saw in my slides at every iteration

02:26:24.540 | we add the previous token to the to the decoder input and so that the decoder can output the

02:26:31.660 | next token then we take the next token we put it again in front of the input to the

02:26:36.620 | decoder and we get the successive token so let's build a decoder input for the first

02:26:42.780 | iteration which is only the start of sentence token

02:27:06.780 | we fill this one with the start of sentence token

02:27:17.280 | and it has the same type as the encoder input okay now we will keep in asking the decoder

02:27:26.740 | to output the next token until we reach either the end of sentence token or the max length

02:27:32.420 | we have defined here so we can do a while true and then our first stopping condition

02:27:39.260 | is if we the decoder output which is becomes the input of the next step becomes large larger

02:27:46.760 | than max length or which is max length

02:27:58.140 | here why do we have two dimensions one is for the batch and one is for the tokens of

02:28:02.620 | the of the decoder input

02:28:09.740 | now we also need to create a mask for this

02:28:24.820 | we can use our function causal mask to say that we don't want the input to watch future

02:28:31.580 | words

02:28:39.780 | and we don't need the other mask because here we don't have any padding token as you can

02:28:43.260 | see

02:28:52.900 | now we calculate the output

02:29:07.820 | we reuse the output of the encoder for every iteration of the loop we reuse the source

02:29:15.260 | mask so the input the mask of the encoder then we give the decoder input and along with

02:29:21.220 | its mask the decoder mask and then we get the next token

02:29:31.860 | so we get the probabilities of the next token using the projection layer

02:29:39.860 | but we only want the projection of the last token so the next token after the last we

02:29:47.260 | have given to the encoder

02:29:51.980 | now we can use the max

02:29:59.460 | so we get the token with the maximum probability this is the greedy search

02:30:18.620 | and then we get this word and we append it back to this one because it will become the

02:30:24.740 | input of the next iteration

02:30:30.140 | and we concat

02:30:33.700 | so we take the decoder input and we append the next token so we create another tensor

02:30:39.900 | for that

02:31:08.900 | yeah should be correct okay if the next token so if the next word or token is equal equal

02:31:25.700 | to the end of sentence token then we also stop the loop

02:31:31.260 | and this is our greedy search now we can just return the output so the output is basically

02:31:37.660 | the decoder input because every time we are appending the next token to it and we remove

02:31:42.540 | the batch dimension so we squeeze it

02:31:47.660 | and that's our greedy decoding now we can use it here in this function so in the validation

02:31:54.060 | function so we can finally get the model output is equal to greedy decode in which we give

02:32:02.300 | him all the parameters

02:32:19.620 | and then we want to compare this model output with what we expected so with the label so

02:32:26.900 | let's append all of these so what we give to the input we gave to the model what the

02:32:32.820 | model output the output of the model so the predicted and what we expected as output we

02:32:38.500 | save all of this in this lists and then at the end of the loop we will print them on

02:32:57.780 | to get the text of the output of the model we need to use the tokenizer again to convert

02:33:18.500 | the tokens back into text and we use of course the target tokenizer because this is the target

02:33:25.940 | the language

02:33:40.460 | okay and now we save them all of this into their respective lists

02:34:06.620 | and we can also print it on the console

02:34:13.820 | while we are using why we are we using this function called print message and why not

02:34:18.540 | just use the print of the Python because we are using here in the main loop in the training

02:34:23.940 | loop we are using here tqdm which is our really nice looking progress bar but it is not suggested

02:34:32.260 | to print directly on the console when this progress bar is running so to print on the

02:34:38.460 | console there is one method called the print provided by tqdm and we will give this method

02:34:45.740 | to this function so that the output does not interfere with the progress bar printing

02:34:58.900 | so we print some bars

02:35:04.420 | and then we print all the messages

02:35:05.860 | okay so we have a message here and we have a message here and we have a message here.

02:35:34.300 | and if we have already processed number of examples then we just break so why we have

02:35:48.260 | created these lists actually we can also send all of this to to a tensor board so we can

02:35:58.700 | so for example if we have tensor board enabled we can send all of this to the tensor board

02:36:04.280 | and to do that actually we need another library that allow us to calculate some metrics I

02:36:10.900 | think we can skip this part but if you are really interested in the code I published

02:36:17.620 | on github you will find that I use this library called the torch metrics that allows us to

02:36:24.020 | calculate the char error rate and the bleu metric which is really useful for translation

02:36:31.820 | tasks and the word error rate so if you really interested you can find the code on the github

02:36:39.420 | but for our demonstration I think it's not necessary so and actually this we can also

02:36:47.220 | remove it given that we are not doing this part okay so now that we have our run validation

02:36:55.140 | method we can just call it okay what I usually do is I run the validation at every few steps

02:37:05.700 | but because we want to see it as soon as possible what we will do is we will first run it at

02:37:13.880 | every iteration and we also put this model.train inside of this loop so that every time after

02:37:22.100 | we run the validation the model is back into its training mode so now we can just run validation

02:37:29.500 | and we give it all the parameter that it needs to run the validation so give it model okay

02:37:57.140 | for printing message are we printing any message we are so let's create a lambda and we just

02:38:09.020 | do and this is the message to write with the tqdm then we need to give the global step

02:38:25.060 | and the writer which we will not use but okay now I think we can run the training again

02:38:33.660 | and see if the validation works

02:38:58.660 | all right looks like it is working so the model is okay it's running the validation

02:39:05.780 | at every step which is not desirable at all but at least we know that the greedy search

02:39:10.820 | is working and it's not at least looks like it is working and the model is not predicting

02:39:17.060 | anything useful actually it's just predicting a bunch of commas because it's not training

02:39:23.980 | at all but if we train the model after a while we should see that at after a few epochs the

02:39:29.940 | model should become better and better and better so let's stop this training and let's

02:39:36.980 | put this one back to where it belongs so at the end of every epoch here and this one we

02:39:43.880 | can keep it here no problem yeah okay I will now skip fast forward to a model that has

02:39:52.860 | been pre-trained I pre-trained it for a few hours so that we can inference it and we can

02:39:58.980 | visualize the attention I have copied the pre-trained weights that I pre-calculated

02:40:05.660 | and I also created this notebook reusing the functions that we have defined before in the

02:40:10.900 | train file the code is very simple actually I just copy and pasted the code from the train

02:40:15.980 | file I just load the model and run the validation the same method that we just wrote and then

02:40:22.460 | I ran the validation on the pre-trained let's run it again for example and as you can see

02:40:28.860 | the model is inferencing 10 examples sentences and the result is not bad I mean we can see

02:40:35.020 | that LevinSmile, LevinSorrisse, LevinSorrisse it's matching and most of them matching actually

02:40:40.420 | we could also say that it's nearly over fit for this particular data but this is the power

02:40:47.740 | of the transformer I didn't train it for many days I just trained it for a few hours if

02:40:53.020 | I remember correctly and the results are really really good and now let's write let's make

02:40:58.760 | the notebook that we will use to visualize the attention of this pre-trained model given

02:41:06.080 | the file that we built before so train.py you can also train your own model choosing

02:41:10.820 | the language of your choice which I highly recommend that you change the language and

02:41:15.060 | try to see how the model is performing and try to diagnose why the model is performing

02:41:21.240 | bad if it's performing bad or if it's performing well try to understand how can you improve

02:41:26.620 | it further so let's try to visualize the attention so let's create a new notebook let's call

02:41:33.660 | it let's say attention visualization okay so the first thing we do we import all the

02:41:47.740 | libraries we will need.

02:42:14.060 | I will also be using this library called Altair it's a visualization library for charts it's

02:42:25.180 | nothing related to deep learning actually it's just a visualization function and in

02:42:30.420 | particular the visualization function actually I found it online it's not written by me just

02:42:34.660 | like most of the visualization functions you can find easily on the internet if you want

02:42:38.240 | to build a chart or if you want to build a histogram etc so I am using this library mostly

02:42:43.920 | because I copied the code from the internet to visualize it but all the rest is my own

02:42:48.700 | code so let's import it okay let's import all of this and of course you will have to

02:43:16.780 | install this particular library when you run the code on your computer let's also define

02:43:22.280 | the device you can just copy the code from here

02:43:38.760 | and then we load the model which we can copy from here like this okay let's paste it here

02:43:51.640 | and this one becomes vocabulary source and vocabulary target

02:44:18.320 | okay now let's make a function to load the batch

02:44:22.560 | okay so we have a function called load_batch which is a function that loads the batch and

02:44:51.000 | oops.

02:44:58.000 | I will convert the batch into tokens.

02:45:27.960 | I will convert the tokens now using the tokenizer

02:45:29.600 | (keyboard clicking)

02:45:32.600 | (keyboard clicking)

02:45:35.600 | (keyboard clicking)

02:45:38.600 | (keyboard clicking)

02:45:41.600 | (keyboard clicking)

02:45:50.600 | (keyboard clicking)

02:45:59.600 | (keyboard clicking)

02:46:06.600 | (keyboard clicking)

02:46:08.760 | And of course, for the decoder,

02:46:10.000 | we use the target vocabulary.

02:46:11.760 | So the target tokenizer.

02:46:15.560 | (keyboard clicking)

02:46:19.560 | So let's just infer

02:46:30.800 | using our greedy decode algorithm.

02:46:36.800 | So we provide the model.

02:46:38.520 | (keyboard clicking)

02:47:05.000 | We return all this information.

02:47:07.520 | (keyboard clicking)

02:47:10.520 | Okay, now I will build the necessary functions

02:47:18.520 | to visualize the attention.

02:47:23.440 | I will copy some functions from another file

02:47:25.480 | because actually what we are going to build

02:47:27.520 | is nothing interesting from a learning point of view

02:47:31.480 | with regards to the deep learning.

02:47:33.720 | It's mostly functions to visualize the data.

02:47:36.240 | So I will copy it because it's quite long to write

02:47:39.040 | and the salient part I will explain, of course.

02:47:42.280 | And this is the function.

02:47:44.160 | Okay, what does this function do?

02:47:46.600 | Basically, we have the attention

02:47:48.680 | that we will get from the encoder.

02:47:50.440 | How to get the attention from the encoder?

02:47:53.120 | For example, the attention we have in three positions.

02:47:55.800 | First is in the encoder.

02:47:57.120 | The second one is in the decoder

02:47:59.200 | at the beginning of the decoder,

02:48:00.560 | so the self-attention of the decoder.

02:48:02.400 | And then we have the cross-attention

02:48:04.160 | between the encoder and the decoder.

02:48:06.360 | So we can visualize three type of attention.

02:48:09.000 | How to get the information about the attention?

02:48:11.560 | Well, we load the other model.

02:48:13.600 | We have the encoder.

02:48:14.800 | We choose which layer we want to get the attention from.

02:48:18.400 | And then from each layer,

02:48:19.400 | we can get the self-attention block

02:48:21.880 | and then its attention scores.

02:48:23.720 | How do, where does this variable come from?

02:48:28.000 | If you remember when we defined

02:48:29.720 | the attention calculation here,

02:48:32.240 | here, when we calculate the attention,

02:48:36.960 | we not only return the output to the next layer,

02:48:40.680 | we also give this attention scores,

02:48:42.680 | which is the output of the softmax.

02:48:45.640 | And we save it here in this variable,

02:48:49.560 | self.attentionscores.

02:48:51.200 | Now we can just retrieve it and visualize it.

02:48:55.720 | So this function will,

02:48:58.080 | based on which attention we want to get

02:49:00.320 | from which layer and from which head,

02:49:02.120 | will select the matrix, the correct matrix.

02:49:06.440 | This function builds a data frame

02:49:09.400 | to visualize the information.

02:49:11.400 | So the tokens and the score

02:49:14.400 | extracted from this matrix here.

02:49:16.640 | So it will, this matrix,

02:49:18.280 | we extract the row and the column.

02:49:21.320 | And then we also build the chart.

02:49:24.040 | The chart is built with Altair.

02:49:27.600 | And what we will build, actually,

02:49:29.760 | is we will get the attention for all the,

02:49:33.920 | I built this method to get the attention

02:49:37.560 | for all the heads and all the layers

02:49:40.000 | that we pass to this function as input.

02:49:43.840 | So let me run this cell now.

02:49:46.120 | Okay, let's create a new cell.

02:49:49.520 | And then let's just run it.

02:49:51.320 | Okay, first we want to visualize

02:49:53.200 | the sentence that we are dealing with.

02:49:55.320 | So the batch.

02:49:57.200 | Order input tokens.

02:50:04.440 | So we load a batch.

02:50:10.680 | And then we visualize what is the source and the target.

02:50:14.640 | (keys clacking)

02:50:17.320 | (keys clacking)

02:50:46.560 | And then also the target.

02:50:48.640 | And finally we calculate also the length.

02:50:57.680 | What is the length?

02:51:08.040 | Okay, it's basically all the characters

02:51:10.400 | that come before the padding character.

02:51:12.440 | So the first occurrence of the padding character.

02:51:15.000 | Because this is the batch taken from the dataset,

02:51:17.320 | which is already the tensor built for training,

02:51:19.680 | so they already include the padding.

02:51:21.400 | In our case, we just want to retrieve

02:51:23.000 | the number of actual characters in our sentence.

02:51:26.880 | So this one, we can,

02:51:28.280 | the number of actual words in our sentence,

02:51:30.600 | so we can check the number of words

02:51:32.560 | that come before padding.

02:51:34.000 | So let's run this one.

02:51:37.120 | And there is some problem.

02:51:39.120 | (keys clacking)

02:51:41.800 | Ah, here I forgot to,

02:51:48.680 | this function was wrong, so now it should work.

02:51:54.120 | Okay, this sentence is too small, let's get a longer one.

02:51:57.560 | Okay, let me check the quality.

02:52:00.920 | You cannot remain as you are, especially you.

02:52:03.240 | (speaking in foreign language)

02:52:06.240 | Okay, looks not bad.

02:52:08.920 | Okay, let's print the attention for the layers,

02:52:11.800 | let's say, zero, one, and two.

02:52:14.680 | Because we have six of them,

02:52:15.960 | if you remember, the parameter is n is equal to six.

02:52:19.520 | So we will just visualize three layers.

02:52:22.440 | And we will visualize all the heads.

02:52:24.840 | We have eight of them for each layer.

02:52:27.080 | So the head number zero,

02:52:28.320 | one, two, three, four, five, six, seven, and seven.

02:52:31.860 | Okay, let's first visualize the encoder self-attention.

02:52:38.800 | And we do get all attention maps.

02:52:41.440 | Which one we want?

02:52:42.400 | So the encoder one.

02:52:44.520 | And we want these layers and these heads.

02:52:48.160 | And what are the row tokens?

02:52:49.840 | The encoder input tokens.

02:52:52.600 | And what do we want in the column?

02:52:55.580 | Because we are gonna build a grid.

02:52:57.320 | So as you know, the attention is a grid

02:53:01.680 | that correlates rows with columns.

02:53:04.000 | In our case, we are talking about

02:53:05.600 | the self-attention of the encoder.

02:53:07.040 | So it's the same sentence that is attending itself.

02:53:11.160 | So we need to provide the input sentence of the encoder

02:53:14.760 | on both the rows and the columns.

02:53:17.360 | And what is the maximum number of length

02:53:19.380 | that we want to visualize?

02:53:21.120 | Okay, let's say we want to visualize no more than 20.

02:53:24.000 | So the minimum of 20 and sentence length.

02:53:27.240 | Okay, this is our visualization.

02:53:32.780 | We can see, and as we expected, actually,

02:53:36.920 | when we visualize the attention,

02:53:38.260 | we expect the values along the diagonals to be high

02:53:42.280 | because it's the dot product of each token with itself.

02:53:45.880 | And we can see also that there are

02:53:48.800 | other interesting relationship.

02:53:50.160 | For example, we see that the start of sentence token

02:53:53.520 | and the end of sentence token,

02:53:54.840 | at least for the head zero and the layer zero,

02:53:57.640 | they are not related to other words,

02:53:59.880 | like I would expect, actually.

02:54:02.880 | But other heads, they do learn some very small mapping.

02:54:07.640 | If we hover over each of the grid cells,

02:54:10.840 | we can see the actual value of the self-attention,

02:54:14.320 | so the score of the self-attention.

02:54:16.360 | For example, we can see the attention is very strong here,

02:54:19.480 | so the word especially and specially are related,

02:54:23.100 | so it's the same word with itself,

02:54:24.980 | but also especially and now.

02:54:28.280 | And we can visualize this kind of attention

02:54:31.560 | for all the layers.

02:54:32.520 | So because each head will watch different aspect

02:54:37.520 | of each word because we are distributing

02:54:39.760 | the word embedding among the heads equally,

02:54:42.760 | so each head will see a different part

02:54:45.720 | of the embedding of the word.

02:54:47.360 | We also hope that they learn different kind of mapping

02:54:51.300 | between the words.

02:54:52.160 | And this is actually the case.

02:54:54.760 | And between one layer and the next,

02:54:57.360 | we also have different WQ, WK, and WV metrics.

02:55:01.880 | So they should also learn different relationships.

02:55:06.560 | Now we can also want, we may also want to visualize

02:55:09.520 | the attention of the decoder.

02:55:12.260 | So let's do it.

02:55:13.960 | Let me just copy the code and just change the parameters.

02:55:31.520 | Okay, here we want the decoder one,

02:55:34.080 | we want the same layers, et cetera,

02:55:36.100 | but the tokens that will be on the rows and the columns

02:55:41.040 | are the decoder tokens.

02:55:43.160 | So decoder input tokens and decoder input tokens.

02:55:46.480 | Let's visualize.

02:55:48.120 | And also we should see Italian language now

02:55:50.600 | because we are using the decoder self-attention.

02:55:53.480 | And it is.

02:55:55.160 | So here we see a different kind of attention

02:55:57.840 | on the decoder side.

02:55:59.720 | And also here we have multiple heads

02:56:02.560 | that should learn different mapping

02:56:04.320 | and also different layers should learn

02:56:06.020 | different mappings between words.

02:56:07.720 | The one I find most interesting is the cross attention.

02:56:11.840 | So let's have a look at that.

02:56:13.320 | Okay, let me just copy the code and run it again.

02:56:26.080 | Okay, so if you remember the method,

02:56:29.120 | it's encoder, decoder, same layer.

02:56:32.520 | So here on the rows we will show the encoder input

02:56:36.960 | and on the columns we will show the decoder input tokens

02:56:40.120 | because it's a cross attention

02:56:41.520 | between the encoder and the decoder.

02:56:43.280 | Okay, this is more or less how the interaction

02:56:50.880 | between the encoder and the decoder works

02:56:55.240 | and how it happens.

02:56:56.960 | So this is where we find the cross attention calculated

02:57:01.960 | using the keys and the values coming from the encoder

02:57:07.160 | while the query is coming from the decoder.

02:57:09.600 | So this is actually where the translation task happens.

02:57:14.480 | And this is how the model learns to relate

02:57:19.480 | these two sentences to each other

02:57:21.840 | to actually calculate the translation.

02:57:24.880 | So I invite you guys to run the code by yourself.

02:57:29.520 | So the first suggestion I give you

02:57:31.160 | is to write the code along with me with the video.

02:57:35.040 | You can pause the video, you can write the code by yourself.

02:57:39.800 | Okay, let me give you some practical examples.

02:57:42.400 | For example, when I'm writing the model code,

02:57:45.120 | I suggest you watch me write the code

02:57:48.440 | for one particular layer and then stop the video,

02:57:52.680 | write it by yourself, take some time.

02:57:55.040 | Don't watch the solution right away.

02:57:57.440 | Try to figure out what is going wrong.

02:57:59.920 | And if you really cannot, after one, two minutes,

02:58:02.440 | you cannot really figure out what is the problem,

02:58:04.440 | you can have a glimpse at the video.

02:58:06.160 | But try to do it by yourself.

02:58:08.200 | Some things, of course, you cannot come up by yourself.

02:58:11.120 | So for example, for the positional encoding

02:58:13.000 | and all this calculation,

02:58:14.200 | it's basically just an application of formulas.

02:58:17.920 | But the point is you should at least be able

02:58:20.480 | to come with a structure by yourself.

02:58:22.320 | So how all the layers are interacting with each other.

02:58:26.360 | This is my first recommendation.

02:58:28.280 | And while about the training loop,

02:58:30.440 | the training part actually is quite standard.

02:58:33.800 | So it's very similar to other training loops

02:58:37.440 | that you may have seen.

02:58:38.840 | The interesting part is how we calculate the loss

02:58:43.160 | and how we use the transformer model.

02:58:46.120 | And the last thing that is really important

02:58:48.640 | is how we inference the model,

02:58:50.480 | which is in this greedy decode.

02:58:52.800 | So thank you everyone for watching the video

02:58:55.280 | and for staying with me for so long.

02:58:57.480 | I can assure you that it was worth it.

02:58:59.760 | And I hope in the next videos to make more examples

02:59:04.680 | of transformers and other models that I am familiar with

02:59:09.160 | and I also want to explore with you guys.

02:59:11.640 | So let me know if there is something

02:59:13.320 | that you don't understand or you want me to explain better.

02:59:16.960 | I will also for sure follow the comment section.

02:59:20.280 | and please write me.

02:59:21.800 | Thank you and have a nice day.

Coding a Transformer from scratch on PyTorch, with full explanation, training and inference.

Chapters