back to index

Attention is all you need (Transformer) - Model explanation (including math), Inference and Training


Chapters

0:0 Intro
1:10 RNN and their problems
8:4 Transformer Model
9:2 Maths background and notations
12:20 Encoder (overview)
12:31 Input Embeddings
15:4 Positional Encoding
20:8 Single Head Self-Attention
28:30 Multi-Head Attention
35:39 Query, Key, Value
37:55 Layer Normalization
40:13 Decoder (overview)
42:24 Masked Multi-Head Attention
44:59 Training
52:9 Inference

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hello guys, welcome to my video about the transformer and this is actually the version 2.0 of my series on the transformer.
00:00:08.500 | I had a previous video in which I talked about the transformer but the audio quality was not good and as suggested by my viewers,
00:00:15.520 | as the video had a huge success, the viewers suggested me to improve the audio quality, so this is why I'm doing this video.
00:00:24.120 | You don't have to watch the previous series because I would be doing basically the same things but with some improvements,
00:00:30.080 | so I'm actually compensating from some mistakes I made or from some improvements that I could add.
00:00:35.580 | After watching this video, I suggest watching my other video about how to code a transformer model from scratch,
00:00:43.500 | so how to code the model itself, how to train it on a data and how to inference it.
00:00:49.220 | Stick with me because it's gonna be a little long journey but for sure worth it.
00:00:54.160 | Now, before we talk about the transformer, I want to first talk about recurrent neural networks,
00:01:00.320 | so the networks that were used before they introduced the transformer for most of the sequence-to-sequence tasks.
00:01:08.500 | So let's review them.
00:01:11.160 | Recurrent neural networks existed a long time before the transformer and they allowed to map one sequence of input to another sequence of output.
00:01:21.400 | In this case, our input is X and we want an input sequence Y.
00:01:26.680 | What we did before is that we split the sequence into single items, so we gave the recurrent neural network the first item as input, so X1,
00:01:36.760 | along with an initial state, usually made up of only zeros, and the recurrent neural network produced an output, let's call it Y1.
00:01:46.540 | And this happened at the first time step.
00:01:49.740 | Then we took the hidden state, this is called the hidden state of the network of the previous time step,
00:01:56.760 | along with the next input token, so X2, and the network had to produce the second output token, Y2.
00:02:06.360 | And then we did the same procedure at the third time step, in which we took the hidden state of the previous time step,
00:02:13.500 | along with the input token at the time step 3, and the network had to produce the next output token, which is Y3.
00:02:23.520 | If you have n tokens, you need n time steps to map an n-sequence input into an n-sequence output.
00:02:33.320 | This worked fine for a lot of tasks, but had some problems. Let's review them.
00:02:40.760 | The problems with recurrent neural networks, first of all, are that they are slow for long sequences,
00:02:46.800 | because think of the process we did before, we have kind of like a for loop in which we do the same operation for every token in the input.
00:02:57.080 | So if you have the longer the sequence, the longer this computation, and this made the network not easy to train for long sequences.
00:03:07.120 | The second problem was the vanishing or the exploding gradients.
00:03:11.280 | Now, you may have heard these terms or expression on the Internet or from other videos,
00:03:16.160 | but I will try to give you a brief insight on what do they mean on a practical level.
00:03:23.080 | So as you know, frameworks like PyTorch, they convert our networks into a computation graph.
00:03:31.200 | So basically, suppose we have a computation graph. This is not a neural network.
00:03:36.320 | I will be making a computational graph that is very simple, has nothing to do with neural networks, but will show you the problems that we have.
00:03:45.120 | So imagine we have two inputs X and another input, let's call it Y.
00:03:51.880 | Our computational graph first, let's say, multiplies these two numbers.
00:03:55.880 | So we have a first function, let's call it f of X and Y.
00:04:02.560 | That is X multiplied by Y, I mean, multiplied.
00:04:08.000 | And the result, let's call it Z, is given to another function.
00:04:14.440 | Let's call this function g of Z is equal to, let's say, Z squared.
00:04:22.400 | What our PyTorch, for example, does, it's that PyTorch want to calculate, usually we have a loss function.
00:04:30.160 | PyTorch calculates the derivative of the loss function with respect to each weight.
00:04:35.840 | In this case, we just calculate the derivative of the g function, so the output function with respect to all of its inputs.
00:04:42.520 | So derivative of g with respect to X, let's say, is equal to the derivative of g with respect to f.
00:04:55.680 | And multiplied by the derivative of f with respect to X.
00:05:02.640 | These two should kind of cancel out, this is called the chain rule.
00:05:07.040 | Now, as you can see, the longer the chain of computation, so if we have many nodes, one after another,
00:05:14.440 | the longer this multiplication chain, so here we have two because the distance from this node and this is two,
00:05:21.720 | but imagine you have 100 or 1000.
00:05:25.960 | Now imagine this number is 0.5 and this number is 0.5 also.
00:05:33.000 | The resulting numbers, when multiplied together, is a number that is smaller than the two initial numbers.
00:05:38.920 | It's going to be 0.25 because it's 1/2 multiplied by 1/2 is 1/4.
00:05:46.480 | So if we have two numbers that are smaller than one and we multiply them together, they will produce an even smaller number.
00:05:53.240 | And if we have two numbers that are bigger than one and we multiply them together,
00:05:57.400 | they will produce a number that is bigger than both of them.
00:06:00.480 | So if we have a very long chain of computation, it eventually will either become a very big number or a very small number.
00:06:08.520 | And this is not desirable, first of all, because our CPU of our GPU can only represent numbers up to a certain precision,
00:06:17.680 | let's say 32 bit or 64 bit.
00:06:20.200 | And if the number becomes too small, the contribution of this number to the output will become very small.
00:06:26.760 | So when the PyTorch or our automatic, let's say our framework, will calculate how to adjust the weights,
00:06:35.560 | the weight will move very, very, very slowly because the contribution of this product will be a very small number.
00:06:44.400 | And this means that we have the gradient is vanishing or in the other case, it can explode, become very big numbers.
00:06:53.920 | And this is a problem. The next problem is difficulty in accessing information from a long time ago.
00:06:59.720 | What does it mean? It means that, as you remember from the previous slide,
00:07:03.360 | we saw that the first input token is given to the recurrent neural network along with the first state.
00:07:10.440 | Now, we need to think that the recurrent neural network is a long graph of computation.
00:07:14.480 | It will produce a new hidden state.
00:07:16.680 | Then we will use the new hidden state along with the next token to produce the next output.
00:07:22.880 | If we have a very long sequence of input sequence, the last token will have a hidden state whose contribution from the first token has nearly gone because of this long chain of multiplication.
00:07:37.200 | So actually, the last token will not depend much on the first token.
00:07:43.000 | And this is also not good because, for example, we know as humans that in a text, in a quite long text,
00:07:49.600 | the context that we saw, let's say 200 words before, is still relevant to the context of the current words.
00:07:57.480 | And this is something that the RNN could not map. And this is why we have the transformer.
00:08:05.680 | So the transformer solves these problems with the recurrent neural networks, and we will see how.
00:08:11.640 | The structure of the transformer, we can divide into two macro blocks.
00:08:17.440 | The first macro block is called the encoder, and it's this part here.
00:08:22.560 | The second macro block is called the decoder, and it's the second part here.
00:08:28.000 | The third part here you see on the top, it's just a linear layer, and we will see why it's there and what its function.
00:08:35.720 | So and the two layers, so the encoder and the decoder, are connected by this connection you can see here,
00:08:43.680 | in which some output of the encoder is sent as input to the decoder. And we will also see how.
00:08:50.320 | Let's start, first of all, with some notations that I will be using during my explanation.
00:08:57.640 | And you should be familiar with this notation. Also to review some maths.
00:09:02.200 | So the first thing we should be familiar with is matrix multiplication.
00:09:06.480 | So imagine we have an input matrix, which is a sequence of, let's say, words.
00:09:12.840 | So sequence by d model, and we will see why it's called sequence by d model.
00:09:17.400 | So imagine we have a matrix that is 6 by 512, in which each row is a word.
00:09:27.120 | And this word is not made of characters, but by 512 numbers.
00:09:31.600 | So each word is represented by 512 numbers, OK, like this.
00:09:37.040 | Imagine you have 512 of them along this row, 512 along this other row, etc, etc.
00:09:43.520 | 1, 2, 3, 4, 5, so we need another one here, OK.
00:09:47.720 | The first word we will call it A, the second B, the C, D, E, and F.
00:09:54.920 | If we multiply this matrix by another matrix, let's say the transpose of this matrix,
00:10:01.320 | so it's a matrix where the rows become columns, so 3, 4, 5, and 6.
00:10:15.960 | This word will be here, B, C, D, E, and F.
00:10:21.920 | And then we have 512 numbers along each column, because before we had them on the rows, now
00:10:33.480 | they will become on the columns.
00:10:35.160 | So here we have the 512th number, etc, etc.
00:10:40.240 | This is a matrix that is 512 by 6, so let me add some brackets here.
00:10:47.440 | If we multiply them, we will get a new matrix that is, we cancel the inner dimensions and
00:10:54.320 | we get the outer dimensions, so it will become 6 by 6.
00:10:58.300 | So it will be 6 rows by 6 rows.
00:11:00.820 | So let's draw it.
00:11:02.720 | How do we calculate the values of this output matrix?
00:11:05.720 | This is 6 by 6.
00:11:08.840 | This is the dot product of the first row with the first column.
00:11:13.680 | So this is A multiplied by A. The second value is the first row with the second column.
00:11:21.160 | The third value is the first row with the third column until the last column, so A multiplied
00:11:29.360 | by F, etc.
00:11:31.240 | What is the dot product?
00:11:32.400 | It's basically you take the first number of the first row, so here we have 512 numbers,
00:11:39.760 | here we have 512 numbers.
00:11:41.620 | So you take the first number of the first row and the first number of the first column,
00:11:46.480 | you multiply them together.
00:11:48.640 | Second value of the first row, second value of the first column, you multiply them together.
00:11:54.880 | And then you add all these numbers together, so it will be, let's say, this number multiplied
00:12:02.120 | by this plus this number multiplied by this plus this number multiplied by this plus this
00:12:08.260 | number multiplied by this plus you sum all these numbers together and this is the A dot
00:12:13.940 | product A. So we should be familiar with this notation because I will be using it a lot
00:12:19.120 | in the next slides.
00:12:20.920 | Let's start our journey of the transformer by looking at the encoder.
00:12:26.820 | So the encoder starts with the input embeddings.
00:12:31.160 | So what is an input embedding?
00:12:33.800 | First of all, let's start with our sentence.
00:12:36.400 | We have a sentence of, in this case, six words.
00:12:40.460 | What we do is we tokenize it.
00:12:42.220 | We transform the sentence into tokens.
00:12:44.880 | What does it mean to tokenize?
00:12:46.080 | We split them into single words.
00:12:49.240 | It is not necessary to always split the sentence using single words.
00:12:54.440 | We can even split the sentence in smaller parts that are even smaller than a single
00:13:00.440 | word.
00:13:01.440 | For example, we can split this sentence into, let's say, 20 tokens by splitting each word
00:13:08.600 | into multiple words.
00:13:10.480 | This is usually done in most modern transformer models, but we will not be doing it, otherwise
00:13:18.120 | it's really difficult to visualize.
00:13:20.400 | So let's suppose we have this input sentence and we split it into tokens and each token
00:13:26.160 | is a single word.
00:13:27.940 | The next step we do is we map these words into numbers and these numbers represent the
00:13:35.160 | position of these words in our vocabulary.
00:13:38.320 | So imagine we have a vocabulary of all the possible words that appear in our training
00:13:44.600 | Each word will occupy a position in this vocabulary.
00:13:47.880 | So for example, the word "your" will occupy the position 105, the word "cat" will occupy
00:13:52.680 | the position 6,500, etc.
00:13:56.800 | And as you can see, this cat here has the same number as this cat here because they
00:14:01.120 | occupy the same position in the vocabulary.
00:14:04.200 | We take these numbers, which are called input IDs, and we map them into a vector of size
00:14:12.840 | This vector is a vector made of 512 numbers and we always map the same word to always
00:14:19.880 | the same embedding.
00:14:22.000 | However, this number is not fixed, it's a parameter for our model.
00:14:28.320 | So our model will learn to change these numbers in such a way that it represents the meaning
00:14:34.280 | of the word.
00:14:35.360 | So the input IDs never change because our vocabulary is fixed, but the embedding will
00:14:40.240 | change along with the training process of the model.
00:14:43.240 | So the embedding numbers will change according to the needs of the loss function.
00:14:48.680 | So the input embedding are basically mapping our single word into an embedding of size
00:14:55.120 | And we call this quantity 512D_MODEL because it's the same name that is also used in the
00:15:00.920 | paper.
00:15:01.920 | Attention is all you need.
00:15:05.320 | Let's look at the next layer of the encoder, which is the positional encoding.
00:15:10.680 | So what is positional encoding?
00:15:13.600 | What we want is that each word should carry some information about its position in the
00:15:19.920 | sentence.
00:15:21.260 | Because now we built a matrix of words that are embeddings, but they don't convey any
00:15:27.200 | information about where that particular word is inside the sentence.
00:15:32.920 | And this is the job of the positional encoding.
00:15:35.700 | So what we do, we want the model to treat words that appear close to each other as close
00:15:42.080 | and words that are distant as distant.
00:15:44.900 | So we want the model to see this information about the spatial information that we see
00:15:49.680 | with our eyes.
00:15:50.680 | So for example, when we see this sentence, what is positional encoding, we know that
00:15:54.840 | the word "what" is more far from the word "is" compared to encoding.
00:16:02.520 | Because we have this spatial information given by our eyes, but the model cannot see this.
00:16:07.340 | So we need to give some information to the model about how the words are spatially distributed
00:16:13.380 | inside of the sentence.
00:16:15.960 | And we want the positional encoding to represent a pattern that the model can learn.
00:16:21.780 | And we will see how.
00:16:24.980 | Imagine we have our original sentence, "Your cat is a lovely cat".
00:16:28.940 | What we do is we first convert into embeddings using the previous layer, so the input embeddings.
00:16:35.540 | And these are embeddings of size 512.
00:16:38.740 | Then we create some special vectors called the positional encoding vectors that we add
00:16:43.900 | to these embeddings.
00:16:45.340 | So this vector we see here in red is a vector of size 512, which is not learned.
00:16:53.220 | It's computed once and not learned along with the training process.
00:16:57.340 | It's fixed.
00:16:58.800 | And this word, this vector represents the position of the word inside of the sentence.
00:17:04.740 | And this should give us an output that is a vector of size, again, 512, because we are
00:17:12.220 | summing this number with this number, this number with this number.
00:17:17.580 | So the first dimension with the first dimension, the second dimension with the second.
00:17:21.020 | So we will get a new vector of the same size of the input vectors.
00:17:26.380 | How are these positional embeddings calculated?
00:17:29.140 | Let's see.
00:17:31.220 | Again we have a smaller sentence, let's say "your cat is".
00:17:35.140 | And you may have seen the following expressions from the paper.
00:17:39.500 | What we do is we create a vector of size D model, so 512.
00:17:47.040 | And for each position in this vector, we calculate the value using these two expressions, using
00:17:54.900 | these arguments.
00:17:56.020 | So the first argument indicates the position of the word inside of the sentence.
00:18:01.020 | So the word "your" occupies the position zero.
00:18:04.740 | And we use the, for the even dimension, so the zero, the two, the four, the 510, etc.
00:18:13.580 | We use the first expression, so the sign.
00:18:16.740 | And for the odd positions of this vector, we use the second expression.
00:18:22.620 | And we do this for all the words inside of the sentence.
00:18:25.820 | So this particular embedding is calculated PE of 10, because it's the first word embedding
00:18:32.900 | zero.
00:18:33.940 | So this one represents the argument "pause".
00:18:37.940 | And this zero represents the argument "2i".
00:18:42.180 | And PE of 11 means that the first word dimension one.
00:18:48.980 | So we will use the cosine, given the position one, and the 2i will be equal to, 2i + 1 will
00:18:56.140 | be equal to 1.
00:18:59.060 | And we do this for this third word, etc.
00:19:03.320 | If we have another sentence, we will not have different positional encodings.
00:19:08.780 | We will have the same vectors, even for different sentences, because the positional encoding
00:19:15.140 | are computed once and reused for every sentence that our model will see, during inference
00:19:21.580 | or training.
00:19:23.000 | So we only compute the positional encoding once, when we create the model, we save them,
00:19:28.140 | and then we reuse them.
00:19:29.140 | We don't need to compute it every time we feed a sentence to the model.
00:19:36.520 | So why the authors chose the cosine and the sine functions to represent positional encodings?
00:19:42.620 | Because let's watch the plot of these two functions.
00:19:46.760 | You can see the plot is by position, so the position of the word inside of the sentence,
00:19:51.260 | and this depth is the dimension along the vector, so the 2i that you saw before in the
00:19:57.240 | previous expressions.
00:19:59.460 | And if we plot them, we can see, as humans, a pattern here.
00:20:02.960 | And we hope that the model can also see this pattern.
00:20:07.100 | Okay, the next layer of the encoder is the multi-head attention.
00:20:12.420 | We will not go inside of the multi-head attention first, we will first visualize the single-head
00:20:19.260 | attention, so the self-attention with a single head.
00:20:22.920 | And let's do it.
00:20:24.820 | So what is self-attention?
00:20:26.980 | Self-attention is a mechanism that existed before they introduced the transformer.
00:20:32.060 | The authors of the transformer just changed it into a multi-head attention.
00:20:37.660 | So how did the self-attention work?
00:20:41.140 | The self-attention allows the model to relate words to each other.
00:20:45.940 | Okay, so we had the input embeddings that capture the meaning of the word.
00:20:52.060 | Then we had the positional encoding that gives the information about the position of the
00:20:57.620 | word inside of the sentence.
00:20:59.400 | Now we want this self-attention to relate words to each other.
00:21:04.220 | Now imagine we have an input sequence of 6 words with a d-model of size 512, which can
00:21:13.340 | be represented as a matrix that we will call Q, K, and V.
00:21:18.140 | So our Q, K, and V are the same matrix representing the input.
00:21:25.220 | So the input of 6 words with the dimension of 512.
00:21:29.940 | So each word is represented by a vector of size 512.
00:21:34.860 | We basically apply this formula we saw here from the paper to calculate the attention,
00:21:40.180 | the self-attention in this case.
00:21:41.700 | Why self-attention?
00:21:42.780 | Because it's each word in the sentence related to other words in the same sentence.
00:21:48.660 | So it's self-attention.
00:21:51.940 | So we start with our Q matrix, which is the input sentence.
00:21:56.300 | So let's visualize it, for example.
00:21:57.780 | So we have 6 rows, and on the columns we have 512 columns.
00:22:03.300 | Now they are really difficult to draw, but let's say we have 512 columns, and here we
00:22:08.860 | have 6.
00:22:11.700 | Now what we do, according to this formula, we multiply it by the same sentence but transposed.
00:22:17.860 | So the transposed of the K, which is again the same input sequence, we divide it by the
00:22:23.940 | square root of 512, and then we apply the softmax.
00:22:29.100 | The output of this, as we saw before in the initial matrix notations, we saw that when
00:22:36.060 | we multiply 6 by 512 with another matrix that is 512 by 6, we obtain a new matrix that is
00:22:44.820 | 6 by 6.
00:22:46.300 | And each value in this matrix represents the dot product of the first row with the first
00:22:52.300 | column.
00:22:53.300 | Each row represents the dot product of the first row with the second column, etc.
00:22:58.740 | The values here are actually randomly generated, so don't concentrate on the values.
00:23:03.180 | What you should notice is that the softmax makes all these values in such a way that
00:23:08.380 | they sum up to 1.
00:23:10.160 | So this row, for example, here, sums up to 1.
00:23:14.620 | This other row also sums up to 1, etc., etc.
00:23:18.220 | And this value we see here is the dot product of the first word with the embedding of the
00:23:25.660 | word itself.
00:23:27.020 | This value here is the dot product of the embedding of the word "your" with the embedding
00:23:33.420 | of the word "cat".
00:23:35.460 | And this value here is the dot product of the embedding of the word "your" with the
00:23:40.740 | embedding of the word "is".
00:23:44.500 | And this value represents somehow a score, that how intense is the relationship between
00:23:50.860 | one word and another.
00:23:53.060 | Let's go ahead with the formula.
00:23:55.100 | So for now we just multiplied Q by K, divided by the square root of dK, applied to the softmax,
00:24:02.020 | but we didn't multiply by V.
00:24:04.760 | So let's go forward.
00:24:06.500 | We multiply this matrix by V, and we obtain a new matrix, which is 6 by 512.
00:24:12.740 | So if we multiply a matrix that is 6 by 6 with another that is 6 by 512, we get a new
00:24:18.860 | matrix that is 6 by 512.
00:24:21.780 | And one thing you should notice is that the dimension of this matrix is exactly the dimension
00:24:26.380 | of the initial matrix from which we started.
00:24:29.820 | This, what does it mean?
00:24:32.420 | That we obtain a new matrix that is 6 rows, so let's say 6 rows, with 512 columns, in
00:24:40.780 | which each, these are our words, so we have 6 words, and each word has an embedding of
00:24:46.420 | dimension 512.
00:24:48.020 | So now this embedding here represents not only the meaning of the word, which was given
00:24:55.020 | by the input embedding, not only the position of the word, which was added by the positional
00:25:00.460 | encoding, but now somehow this special embedding, so these values represent a special embedding
00:25:06.740 | that also captures the relationship of this particular word with all the other words.
00:25:14.300 | And this particular embedding of this word here also captures not only its meaning, not
00:25:20.060 | only its position inside of the sentence, but also the relationship of this word with
00:25:25.360 | all the other words.
00:25:27.420 | I want to remind you that this is not the multi-head attention, we are just watching
00:25:31.240 | the self-attention, so one head.
00:25:34.060 | We will see later how this becomes the multi-head attention.
00:25:41.780 | Self-attention has some properties that are very desirable.
00:25:45.180 | First of all, it's permutation invariant.
00:25:47.860 | What does it mean to be permutation invariant?
00:25:49.980 | It means that if we have a matrix, let's say, first we have a matrix of 6 words, in this
00:25:57.420 | case let's say just 4 words, so A, B, C, and D, and suppose by applying the formula before
00:26:04.700 | this produces this particular matrix in which there is new special embedding for the word
00:26:12.220 | A, a new special embedding for the word B, a new special embedding for the word C and
00:26:16.060 | D, so let's call it A', B', C', D'.
00:26:18.900 | If we change the position of these two rows, the values will not change, the position of
00:26:25.180 | the output will change accordingly.
00:26:26.860 | So the values of B' will not change, it will just change in the position, and also the
00:26:32.160 | C will also change position, but the values in each vector will not change, and this is
00:26:36.620 | a desirable property.
00:26:39.220 | Self-attention as of now requires no parameters, I mean, I didn't introduce any parameter that
00:26:44.500 | is learned by the model, I just took the initial sentence of, in this case, 6 words, we multiplied
00:26:52.220 | it by itself, we divide it by a fixed quantity which is the square root of 512 and then we
00:26:58.860 | apply the softmax which is not introducing any parameter, so for now the self-attention
00:27:04.100 | didn't require any parameter except for the embedding of the words.
00:27:10.100 | This will change later when we introduce the multi-head attention.
00:27:14.860 | So we expect, because each value in the self-attention, in the softmax matrix, is a dot product of
00:27:23.020 | the word embedding with itself and the other words, we expect the values along the diagonal
00:27:28.220 | to be the maximum, because it's the dot product of each word with itself.
00:27:35.420 | And there is another property of this matrix, that is, before we apply the softmax, if we
00:27:43.620 | replace the value in this matrix, suppose we don't want the word your and cat to interact
00:27:50.100 | with each other, or we don't want the word, let's say, is and the lovely to interact with
00:27:54.820 | each other, what we can do is, before we apply the softmax, we can replace this value with
00:27:59.980 | -infinity and also this value with -infinity, and when we apply the softmax, the softmax
00:28:08.900 | will replace -infinity with 0, because as you remember the softmax is e to the power
00:28:15.300 | of x, if x is going to -infinity, e to the power of -infinity will become very very close
00:28:21.700 | to 0, so basically 0.
00:28:25.580 | This is a desirable property that we will use in the decoder of the transformer.
00:28:30.660 | Now let's have a look at what is a multi-head attention.
00:28:34.140 | So what we just saw was the self attention, and we want to convert it into a multi-head
00:28:39.420 | attention.
00:28:40.420 | You may have seen these expressions from the paper, but don't worry, I will explain them
00:28:44.500 | one by one.
00:28:45.500 | So let's go.
00:28:47.540 | Imagine we have our encoder, so we are on the encoder side of the transformer, and we
00:28:52.900 | have our input sentence, which is, let's say, 6 by 512, so 6 word by 512 is the size of
00:29:01.300 | the embedding of each word.
00:29:03.280 | In this case I call it sequence by dmodel, so sequence is the sequence length, as you
00:29:07.980 | can see on the legend in the bottom left of the slide, and the dmodel is the size of the
00:29:14.300 | embedding vector, which is 512.
00:29:17.740 | What we do, just like the picture shows, we take this input and we make 4 copies of it.
00:29:25.100 | One will be sent along this connection we can see here, and 3 will be sent to the multi-head
00:29:34.380 | attention with 3 respective names, so it's the same input that becomes 3 matrices that
00:29:41.360 | are equal to input.
00:29:43.160 | One is called query, one is called key, and one is called value.
00:29:47.160 | So basically we are taking this input and making 3 copies of it, one we call q, k, and
00:29:52.880 | They have, of course, the same dimension.
00:29:55.240 | What does the multi-head attention do?
00:29:57.080 | First of all it multiplies these 3 matrices by 3 parameter matrices called wq, wk, and
00:30:07.120 | These matrices have dimension dmodel by dmodel.
00:30:10.640 | So if we multiply a matrix that is sequence by dmodel with another one that is dmodel
00:30:15.520 | by dmodel, we get a new matrix as output that is sequence by dmodel.
00:30:21.420 | So basically the same dimension as the starting matrix.
00:30:26.140 | And we will call them q', k' and v'.
00:30:30.560 | Our next step is to split these matrices into smaller matrices.
00:30:35.560 | Let's see how.
00:30:37.160 | We can split this matrix q' by the sequence dimension or by the dmodel dimension.
00:30:44.720 | In the multi-head attention we always split by the dmodel dimension.
00:30:48.780 | So every head will see the full sentence but a smaller part of the embedding of each word.
00:30:57.280 | So if we have an embedding of let's say 512, it will become smaller embeddings of 512 divided
00:31:05.280 | by 4.
00:31:06.860 | And we call this quantity dk.
00:31:09.020 | So dk is dmodel divided by h, where h is the number of heads.
00:31:13.600 | In our case we have h equal to 4.
00:31:17.560 | We can calculate the attention between these smaller matrices, so q1, k1 and v1 using the
00:31:23.860 | expression taken from the paper.
00:31:28.600 | And this will result into a small matrix called head1, head2, head3 and head4.
00:31:36.460 | The dimension of head1 up to head4 is sequence by dv.
00:31:43.200 | What is dv is basically it's equal to dk, it's just called dv because the last multiplication
00:31:49.740 | is done by v and in the paper they call it dv so I'm also sticking to the same names.
00:31:56.140 | Our next step is to combine these matrices, these small heads, by concatenating them along
00:32:05.560 | the dv dimension, just like the paper says.
00:32:09.660 | So we concat all these heads together and we get a new matrix that is sequence by h
00:32:16.380 | multiplied by dv, where h multiplied by dv, as we know dv is equal to dk, so h multiplied
00:32:24.500 | by dv is equal to dmodel.
00:32:26.820 | So we get back the initial shape, so it's sequence by dmodel here.
00:32:35.040 | The next step is to multiply the result of this concatenation by wo and wo is a matrix
00:32:42.160 | that is h multiplied by dv, so dmodel, with the other dimension being dmodel.
00:32:48.960 | And the result of this is a new matrix that is the result of the multi-head attention
00:32:53.700 | which is sequence by dmodel.
00:32:56.920 | So the multi-head attention, instead of calculating the attention between these matrices here,
00:33:04.000 | so q', k' and v', splits them along the dmodel dimension into smaller matrices and calculates
00:33:12.720 | the attention between these smaller matrices.
00:33:15.720 | So each head is watching the full sentence but a different aspect of the embedding of
00:33:22.160 | each word.
00:33:23.640 | Why we want this?
00:33:25.100 | Because we want each head to watch different aspect of the same word.
00:33:30.480 | For example, in the Chinese language, but also in other languages, one word may be a
00:33:35.640 | noun in some cases, may be a verb in some other cases, may be an adverb in some other
00:33:40.400 | cases, depending on the context.
00:33:43.380 | So what we want is that one head maybe learns to relate that word as a noun, another head
00:33:50.480 | maybe learns to relate that word as a verb, and another head learns to relate that verb
00:33:56.120 | as an adjective or an adverb.
00:33:59.000 | So this is why we want multi-head attention.
00:34:03.160 | Now you may also have seen online that the attention can be visualized.
00:34:09.720 | And I will show you how.
00:34:12.040 | When we calculate the attention between the q and the k matrices, so when we do this operation,
00:34:18.800 | so the softmax of q multiplied by the k divided by the square root of dk, we get a new matrix
00:34:26.240 | just like we saw before, which is sequence by sequence.
00:34:29.760 | And this represents a score that represents the intensity of the relationship between
00:34:35.120 | the two words.
00:34:37.160 | We can visualize this, and this will produce a visualization similar to this one, which
00:34:44.240 | I took from the paper, in which we see how all the heads work.
00:34:48.360 | So for example, if we concentrate on this word "making", this word here, we can see
00:34:53.560 | that "making" is related to the word "difficult", so this word here, by different heads.
00:34:59.300 | So the blue head, the red head, and the green head.
00:35:03.560 | But let's say the violet head is not relating these two words together.
00:35:09.500 | So "making" and "difficult" is not related by the violet or the pink head.
00:35:14.920 | The violet head or the pink head, they are relating the word "making" to other words,
00:35:20.300 | for example to this word "2009".
00:35:25.080 | Why this is the case?
00:35:26.480 | Because maybe this pink head could see the part of the embedding that these other heads
00:35:32.080 | could not see, that made this interaction possible between these two words.
00:35:41.120 | You may be also wondering why these three matrices are called "query keys" and "values".
00:35:47.080 | Okay, the terms come from the database terminology, or from the Python-like dictionaries.
00:35:53.520 | But I would also like to give an interpretation of my own, making a very simple example.
00:35:57.960 | I think it's quite easy to understand.
00:36:03.540 | So imagine we have a Python-like dictionary, or a database, in which we have keys and values.
00:36:10.180 | The keys are the category of movies, and the values are the movies belonging to that category.
00:36:16.360 | In my case, I just put one value.
00:36:19.800 | So we have Romantics category, which includes Titanic, we have action movies that include
00:36:25.160 | The Dark Knight, etc.
00:36:27.280 | Imagine we also have a user that makes a query, and the query is "love".
00:36:32.980 | Because we are in the transformer world, all these words actually are represented by embeddings
00:36:37.880 | of size 512.
00:36:40.440 | So what our transformer will do, he will convert this word "love" into an embedding of 512.
00:36:46.680 | All these queries and values are already embeddings of 512, and it will calculate the dot product
00:36:53.480 | between the query and all the keys, just like the formula.
00:36:57.960 | So as you remember, the formula is a softmax of query multiplied by the transpose of the
00:37:02.780 | keys, divided by the square root of the model.
00:37:06.320 | So we are doing the dot product of all the queries with all the keys.
00:37:11.140 | In this case, the word "love" with all the keys, one by one.
00:37:15.520 | And this will result in a score that will amplify some values or not amplify other values.
00:37:25.920 | In this case, our embedding may be in such a way that the word "love" and "romantic"
00:37:31.560 | are related to each other, the word "love" and "comedy" are also related to each other,
00:37:37.000 | but not so intensively like the word "love" and "romantic".
00:37:41.280 | So it's more, how to say, less strong relationship.
00:37:46.000 | But maybe the word "horror" and "love" are not related at all, so maybe their softmax
00:37:50.700 | score is very close to zero.
00:37:56.600 | Our next layer in the encoder is the add and norm.
00:38:02.520 | And to introduce the add and norm, we need the layer normalization.
00:38:05.820 | So let's see what is the layer normalization.
00:38:08.120 | Layer normalization is a layer that, okay, let's make a practical example.
00:38:15.480 | Imagine we have a batch of n items, in this case n is equal to 3, item 1, item 2, item
00:38:25.240 | Each of these items will have some features, it could be an embedding, so for example it
00:38:30.360 | could be a feature of a vector of size 512, but it could be a very big matrix of thousands
00:38:36.100 | of features, doesn't matter.
00:38:38.320 | What we do is we calculate the mean and the variance of each of these items independently
00:38:43.320 | from each other, and we replace each value with another value that is given by this expression.
00:38:49.820 | So basically we are normalizing so that the new values are all in the range 0 to 1.
00:38:57.160 | Actually we also multiply this new value with a parameter called gamma, and then we add
00:39:03.900 | another parameter called beta, and this gamma and beta are learnable parameters.
00:39:10.280 | And the model should learn to multiply and add these parameters so as to amplify the
00:39:17.180 | value that it wants to be amplified and not amplify the value that it doesn't want to
00:39:22.320 | be amplified.
00:39:25.260 | So we don't just normalize, we actually introduce some parameters.
00:39:30.360 | And I found a really nice visualization from paperswithcode.com in which we see the difference
00:39:36.420 | between batch norm and layer norm.
00:39:39.060 | So as we can see in the layer normalization we are calculating if n is the batch dimension,
00:39:45.720 | we are calculating all the values belonging to one item in the batch, while in the batch
00:39:51.880 | norm we are calculating the same feature for all the batch.
00:39:57.440 | So for all the items in the batch.
00:39:59.800 | So we are mixing, let's say, values from different items of the batch, while in the layer normalization
00:40:06.200 | we are treating each item in the batch independently, which will have its own mean and its own variance.
00:40:15.000 | Let's look at the decoder now.
00:40:17.440 | Now, in the encoder we saw the input embeddings, in this case they are called output embeddings,
00:40:24.600 | but the underlying working is the same.
00:40:27.960 | Here also we have the positional encoding, and they are also the same as the encoder.
00:40:35.880 | The next layer is the masked multihead attention, and we will see it now.
00:40:41.160 | We also have the multihead attention here, here we should see that there is the encoder
00:40:49.980 | here that produces the output and is sent to the decoder in the forms of keys and values.
00:41:02.240 | While the query, so this connection here is the query coming from the decoder.
00:41:08.120 | So in this multihead attention, it's not a self-attention anymore, it's a cross-attention
00:41:14.040 | because we are taking two sentences.
00:41:16.420 | One is sent from the encoder side, so let's write encoder, in which we provide the output
00:41:22.480 | of the encoder and we use it as keys and values, while the output of the masked multihead attention
00:41:30.160 | is used as the query in this multihead attention.
00:41:34.840 | The masked multihead attention is a self-attention of the input sentence of the decoder.
00:41:40.760 | So we take the input sentence of the decoder, we transform into embeddings, we add the positional
00:41:46.880 | encoding, we give it to this multihead attention in which the query key and values are the
00:41:51.520 | same input sequence, we do the add and norm, then we send this as the queries of the multihead
00:42:00.080 | attention, while the keys and the values are coming from the encoder, then we do the add
00:42:04.680 | and norm.
00:42:05.680 | I will not be showing the feedforward, which is just a fully connected layer.
00:42:10.840 | We then send the output of the feedforward to the add and norm, and finally to the linear
00:42:16.040 | layer, which we will see later.
00:42:18.640 | So let's have a look at the masked multihead attention and how it differs from a normal
00:42:23.300 | multihead attention.
00:42:26.360 | What we want, our goal, is that we want to make the model causal.
00:42:31.080 | It means that the output at a certain position can only depend on the words on the previous
00:42:36.880 | position.
00:42:37.880 | So the model must not be able to see future words.
00:42:41.400 | How can we achieve that?
00:42:44.120 | As you saw, the output of the softmax in the attention calculation formula is this metric,
00:42:51.240 | sequence by sequence.
00:42:52.240 | If we want to hide the interaction of some words with other words, we delete this value
00:42:58.320 | and we replace it with minus infinity before we apply the softmax, so that the softmax
00:43:04.540 | will replace this value with zero.
00:43:08.520 | And we do this for all the interaction that we don't want.
00:43:12.100 | So we don't want your to watch future words.
00:43:15.000 | So we don't want your to watch cat is a lovely cat.
00:43:19.080 | And we don't want the word cat to watch future words, but only all the words that come before
00:43:23.880 | it or the word itself.
00:43:25.920 | So we don't want this, this, this, this.
00:43:29.160 | Also the same for the other words, etc.
00:43:32.880 | So we can see that we are replacing all the word, all these values here that are above
00:43:40.000 | this diagonal here.
00:43:41.840 | So this is the principal diagonal of the matrix.
00:43:44.480 | And we want all the values that are above this diagonal to be replaced with minus infinity
00:43:50.160 | so that, so that the softmax will replace them with zero.
00:43:54.840 | Let's see in which stage of the multi-head attention this mechanism is introduced.
00:44:00.960 | So when we calculate the attention between these molar matrices, so Q1, K1, and V1, before
00:44:09.720 | we apply the softmax, we replace these values.
00:44:12.960 | So this one, this one, this one, this one, this one, etc. with minus infinity.
00:44:18.720 | Then we apply the softmax and then the softmax will take care of transforming these values
00:44:25.040 | into zeros.
00:44:26.200 | So basically we don't want these words to interact with each other.
00:44:31.440 | And if we don't want this interaction, the model will learn to not make them interact
00:44:35.480 | because the model will not get any information from this interaction.
00:44:39.040 | So it's like this word cannot interact.
00:44:41.620 | Now let's look at how the inference and training works for a transformer model.
00:44:46.780 | As I said previously, we are dealing with it, we will be dealing with a translation
00:44:52.360 | task.
00:44:53.360 | So because it's easy to visualize and it's easy to understand all the steps.
00:44:58.200 | Let's start with the training of the model.
00:45:01.040 | We will go from an English sentence "I love you very much" into an Italian sentence "Ti
00:45:05.960 | amo molto".
00:45:06.960 | It's a very simple sentence.
00:45:07.960 | It's easy to describe.
00:45:11.440 | Let's go.
00:45:12.800 | We start with a description of the transformer model and we start with our English sentence
00:45:20.240 | which is sent to the encoder.
00:45:22.460 | So our English sentence here on which we prepend and append two special tokens.
00:45:29.200 | One is called start of sentence and one is called end of sentence.
00:45:33.520 | These two tokens are taken from the vocabulary.
00:45:36.960 | So they are special tokens in our vocabulary that tells the model what is the start position
00:45:43.120 | of a sentence and what is the end of a sentence.
00:45:46.480 | We will see later why we need them.
00:45:49.240 | For now just think that we take our sentence, we prepend a special token and we append a
00:45:53.920 | special token.
00:45:56.160 | Then what we do?
00:45:57.160 | As you can see from the picture, we take our inputs, we transform into input embeddings,
00:46:02.000 | we add the positional encoding and then we send it to the encoder.
00:46:06.560 | So this is our encoder input, sequence by D model, we send it to the encoder, it will
00:46:11.040 | produce an output which is sequence by D model and it's called encoder output.
00:46:17.600 | So as we saw previously, the output of the encoder is another matrix that has the same
00:46:23.560 | dimension as the input matrix in which the embedding, we can see it as a sequence of
00:46:31.120 | embeddings in which this embedding is special because it captures not only the meaning of
00:46:36.160 | the word which was given by the input embedding we saw here, so by this, not only the position
00:46:42.480 | which was given by the positional encoding, but also the interaction of every word with
00:46:48.420 | every other word in the same sentence because this is the encoder.
00:46:52.520 | So we are talking about self-attention.
00:46:55.080 | So it's the interaction of each word in the sentence with all the other words in the same
00:47:00.240 | sentence.
00:47:03.040 | We want to convert this sentence into Italian, so we prepare the input of the decoder, which
00:47:08.440 | is start of sentence "ti amo molto".
00:47:12.180 | As you can see from the picture of the transformer, the outputs here you can see shifted right.
00:47:19.060 | What does it mean to shift right?
00:47:20.060 | Basically, it means we prepend a special token called SOS, start of sentence.
00:47:26.880 | You should also notice that these two sequences actually, when we code the transformer, so
00:47:35.300 | if you watch my other video on how to code a transformer, you will see that we make this
00:47:39.920 | sequence of fixed length so that if we have a sentence that is "ti amo molto" or a very
00:47:44.840 | long sequence, actually when we feed them to the transformer, they all become of the
00:47:50.680 | same length.
00:47:53.360 | How to do this?
00:47:54.360 | We add padding words to reach the desired length.
00:47:58.360 | So if our model can support, let's say, a sequence length of 1000, in this case we have
00:48:03.800 | 4 tokens, we will add 996 tokens of padding to make this sentence long enough to reach
00:48:12.280 | the sequence length.
00:48:13.640 | Of course, I'm not doing it here because it's not easy to visualize otherwise.
00:48:17.760 | Okay, we prepare this input for the decoder.
00:48:21.640 | We add transform into embeddings, we add the positional encoding, then we send it first
00:48:28.360 | to the multi-head attention, to the masked multi-head attention, so along with the causal
00:48:32.680 | mask and then we take the output of the encoder and we send it to the decoder as keys and
00:48:41.360 | values while the queries are coming from the mask, so the queries are coming from this
00:48:47.760 | layer and the keys and the values are the output of the encoder.
00:48:53.200 | The output of all this block here, so all this big block here, will be a matrix that
00:49:00.400 | is sequenced by the model, just like for the encoder.
00:49:04.720 | However, we can see that this is still an embedding because it's a D model, it's a vector
00:49:11.280 | of size 512.
00:49:12.920 | How can we relate this embedding back into our dictionary?
00:49:17.840 | How can we understand what is this word in our vocabulary?
00:49:23.520 | That's why we need a linear layer that will map sequence by D model into another sequence
00:49:29.960 | by vocabulary size.
00:49:31.720 | So it will tell for every embedding that it sees what is the position of that word in
00:49:37.560 | our vocabulary, so that we can understand what is the actual token that is output by
00:49:42.860 | the model.
00:49:46.180 | After that we apply the softmax and then we have our label, what we expect the model to
00:49:54.000 | output given this English sentence.
00:49:59.980 | We expect the model to output this "ti amo molto" end of sentence and this is called
00:50:06.480 | the label or the target.
00:50:08.680 | What we do when we have the output of the model and the corresponding label?
00:50:13.080 | We calculate the loss, in this case is the cross entropy loss, and then we backpropagate
00:50:18.640 | the loss to all the weights.
00:50:21.360 | Now let's understand why we have these special tokens called SOS and EOS.
00:50:27.320 | Basically you can see that here the sequence length is 4, actually it's 1000 because we
00:50:32.000 | have the embedding, but let's say we don't have any embedding, so it's 4 tokens, start
00:50:36.160 | of sentence "ti amo molto" and what we want is "ti amo molto" end of sentence.
00:50:42.100 | So our model, when it will see the start of sentence token, it will output the first token
00:50:49.840 | as output "ti".
00:50:51.920 | When it will see "ti" it will output "amo", when it will see "amo" it will output "molto"
00:50:59.400 | and when it will see "molto" it will output end of sentence, which will indicate that
00:51:05.240 | ok, the translation is done, and we will see this mechanism in the inference.
00:51:11.920 | Ah, this all happens in one time step, just like I promised at the beginning of the video.
00:51:19.400 | I said that with recurrent neural networks we have n time steps to map n input sequence
00:51:27.160 | into n output sequence, but this problem would be solved with the transformer, yes, it has
00:51:33.880 | been solved, because you can see here we didn't do any for loop, we just did all in one pass,
00:51:40.160 | we give an input sequence to the encoder, an input sequence to the decoder, we produced
00:51:46.240 | some outputs, we calculated the cross entropy loss with the label and that's it, it all
00:51:52.240 | happens in one time step, and this is the power of the transformer, because it made
00:51:57.200 | it very easy and very fast to train very long sequences and with very very nice performance
00:52:04.480 | that you can see in chatGPT, you can see in GPT, in BERT, etc.
00:52:10.400 | Let's have a look at how inference works.
00:52:14.560 | Again we have our English sentence "I love you very much", we want to map it into an
00:52:18.740 | Italian sentence "ti amo molto".
00:52:22.720 | We have our usual transformer, we prepare the input for the encoder, which is start
00:52:28.320 | of sentence "I love you very much", end of sentence.
00:52:31.920 | We convert into input embeddings, then we add the positional encoding, we prepare the
00:52:35.680 | input for the encoder and we send it to the encoder.
00:52:39.000 | The encoder will produce an output, which is sequenced by the model, and we saw it before
00:52:43.320 | that it's a sequence of special embeddings that capture the meaning, the position, but
00:52:48.040 | also the interaction of all the words with other words.
00:52:52.720 | What we do is, for the decoder, we give him just the start of sentence, and of course
00:52:59.280 | we add enough embedding tokens to reach our sequence length.
00:53:04.680 | We just give the model the start of sentence token, and again, for this single token we
00:53:11.520 | convert into embeddings, we add the positional encoding and we send it to the decoder as
00:53:16.880 | decoder input.
00:53:18.440 | The decoder will take this, his input as a query, and the key and the values coming from
00:53:25.160 | the encoder, and it will produce an output, which is sequenced by the model.
00:53:31.400 | Again, we want the linear layer to project it back to our vocabulary, and this projection
00:53:36.960 | is called logits.
00:53:39.600 | What we do is, we apply the softmax, which will select, given the logits, the position
00:53:47.440 | of the output word will have the maximum score with the softmax.
00:53:52.780 | This is how we know what words to select from the vocabulary.
00:53:57.480 | And this, hopefully, should produce the first output token, which is T, if the model has
00:54:03.540 | been trained correctly.
00:54:06.060 | This, however, happens at time step 1, so when we train the model, the transformer model,
00:54:11.780 | it happens in one pass, so we have one input sequence, one output sequence, we give it
00:54:16.400 | to the model, we do it one time step, and the model will learn it.
00:54:20.340 | When we inference, however, we need to do it token by token, and we will also see why
00:54:24.380 | this is the case.
00:54:27.460 | At time step 2, we don't need to recompute the encoder output again, because our English
00:54:36.740 | sentence didn't change, so we hope the encoder should produce the same output for it.
00:54:44.340 | And then, what we do is, we take the output of the previous sentence, so T, we append
00:54:52.740 | it to the input of the decoder, and then we feed it to the decoder, again with the output
00:55:00.420 | of the encoder from the previous step, which will produce an output sequence from the decoder
00:55:05.860 | side, which we again project back into our vocabulary, and we get the next token, which
00:55:13.340 | is AMO.
00:55:14.780 | So, as I said before, we are not recalculating the output of the encoder for every time step,
00:55:23.700 | because our English sentence didn't change at all.
00:55:26.780 | What is changing is the input of the decoder, because at every time step, we are appending
00:55:31.060 | the output of the previous step to the input of the decoder.
00:55:35.220 | We do the same for the time step 3, and we do the same for the time step 4.
00:55:42.140 | And hopefully, we will stop when we see the end of sentence token, because that's how
00:55:48.300 | the model tells us to stop inferencing.
00:55:52.020 | And this is how the inference works.
00:55:54.320 | Why we needed four time steps.
00:55:57.460 | When we inference a model, like in this case the translation model, there are many strategies
00:56:04.060 | for inferencing.
00:56:05.300 | What we used is called greedy strategy.
00:56:07.740 | So for every step, we get the word with the maximum softmax value.
00:56:14.500 | And however this strategy works, usually not bad, but there are better strategies, and
00:56:22.700 | one of them is called beam search.
00:56:25.100 | In beam search, instead of always greedily, so that's why it's called greedy, instead
00:56:30.780 | of greedily taking the maximum soft value, we take the top B values, and then for each
00:56:38.780 | of these choices, we inference what are the next possible tokens for each of the top B
00:56:45.860 | values at every step, and we keep only the one with the B most probable sequences, and
00:56:53.380 | we delete the others.
00:56:55.440 | This is called beam search, and generally it performs better.
00:57:00.880 | So thank you guys for watching.
00:57:03.560 | I know it was a long video, but it was really worth it to go through each aspect of the
00:57:09.220 | transformer.
00:57:10.220 | I hope you enjoyed this journey with me, so please subscribe to the channel, and don't
00:57:14.680 | forget to watch my other video on how to code a transformer model from scratch, in which
00:57:19.860 | I describe not only again the structure of the transformer model while coding it, but
00:57:26.040 | I also show you how to train it on a dataset of your choice, how to inference it, and I
00:57:33.680 | also provided the code on GitHub, and a Colab notebook to train the model directly on Colab.
00:57:44.480 | Please subscribe to the channel, and let me know what you didn't understand, so that I
00:57:49.480 | can give more explanation, and please tell me what are the problems in this kind of videos,
00:57:55.520 | or in this particular video, that I can improve for the next videos.
00:58:00.320 | Thank you very much, and have a great rest of the day!