back to indexAttention is all you need (Transformer) - Model explanation (including math), Inference and Training
Chapters
0:0 Intro
1:10 RNN and their problems
8:4 Transformer Model
9:2 Maths background and notations
12:20 Encoder (overview)
12:31 Input Embeddings
15:4 Positional Encoding
20:8 Single Head Self-Attention
28:30 Multi-Head Attention
35:39 Query, Key, Value
37:55 Layer Normalization
40:13 Decoder (overview)
42:24 Masked Multi-Head Attention
44:59 Training
52:9 Inference
00:00:00.000 |
Hello guys, welcome to my video about the transformer and this is actually the version 2.0 of my series on the transformer. 00:00:08.500 |
I had a previous video in which I talked about the transformer but the audio quality was not good and as suggested by my viewers, 00:00:15.520 |
as the video had a huge success, the viewers suggested me to improve the audio quality, so this is why I'm doing this video. 00:00:24.120 |
You don't have to watch the previous series because I would be doing basically the same things but with some improvements, 00:00:30.080 |
so I'm actually compensating from some mistakes I made or from some improvements that I could add. 00:00:35.580 |
After watching this video, I suggest watching my other video about how to code a transformer model from scratch, 00:00:43.500 |
so how to code the model itself, how to train it on a data and how to inference it. 00:00:49.220 |
Stick with me because it's gonna be a little long journey but for sure worth it. 00:00:54.160 |
Now, before we talk about the transformer, I want to first talk about recurrent neural networks, 00:01:00.320 |
so the networks that were used before they introduced the transformer for most of the sequence-to-sequence tasks. 00:01:11.160 |
Recurrent neural networks existed a long time before the transformer and they allowed to map one sequence of input to another sequence of output. 00:01:21.400 |
In this case, our input is X and we want an input sequence Y. 00:01:26.680 |
What we did before is that we split the sequence into single items, so we gave the recurrent neural network the first item as input, so X1, 00:01:36.760 |
along with an initial state, usually made up of only zeros, and the recurrent neural network produced an output, let's call it Y1. 00:01:49.740 |
Then we took the hidden state, this is called the hidden state of the network of the previous time step, 00:01:56.760 |
along with the next input token, so X2, and the network had to produce the second output token, Y2. 00:02:06.360 |
And then we did the same procedure at the third time step, in which we took the hidden state of the previous time step, 00:02:13.500 |
along with the input token at the time step 3, and the network had to produce the next output token, which is Y3. 00:02:23.520 |
If you have n tokens, you need n time steps to map an n-sequence input into an n-sequence output. 00:02:33.320 |
This worked fine for a lot of tasks, but had some problems. Let's review them. 00:02:40.760 |
The problems with recurrent neural networks, first of all, are that they are slow for long sequences, 00:02:46.800 |
because think of the process we did before, we have kind of like a for loop in which we do the same operation for every token in the input. 00:02:57.080 |
So if you have the longer the sequence, the longer this computation, and this made the network not easy to train for long sequences. 00:03:07.120 |
The second problem was the vanishing or the exploding gradients. 00:03:11.280 |
Now, you may have heard these terms or expression on the Internet or from other videos, 00:03:16.160 |
but I will try to give you a brief insight on what do they mean on a practical level. 00:03:23.080 |
So as you know, frameworks like PyTorch, they convert our networks into a computation graph. 00:03:31.200 |
So basically, suppose we have a computation graph. This is not a neural network. 00:03:36.320 |
I will be making a computational graph that is very simple, has nothing to do with neural networks, but will show you the problems that we have. 00:03:45.120 |
So imagine we have two inputs X and another input, let's call it Y. 00:03:51.880 |
Our computational graph first, let's say, multiplies these two numbers. 00:03:55.880 |
So we have a first function, let's call it f of X and Y. 00:04:02.560 |
That is X multiplied by Y, I mean, multiplied. 00:04:08.000 |
And the result, let's call it Z, is given to another function. 00:04:14.440 |
Let's call this function g of Z is equal to, let's say, Z squared. 00:04:22.400 |
What our PyTorch, for example, does, it's that PyTorch want to calculate, usually we have a loss function. 00:04:30.160 |
PyTorch calculates the derivative of the loss function with respect to each weight. 00:04:35.840 |
In this case, we just calculate the derivative of the g function, so the output function with respect to all of its inputs. 00:04:42.520 |
So derivative of g with respect to X, let's say, is equal to the derivative of g with respect to f. 00:04:55.680 |
And multiplied by the derivative of f with respect to X. 00:05:02.640 |
These two should kind of cancel out, this is called the chain rule. 00:05:07.040 |
Now, as you can see, the longer the chain of computation, so if we have many nodes, one after another, 00:05:14.440 |
the longer this multiplication chain, so here we have two because the distance from this node and this is two, 00:05:25.960 |
Now imagine this number is 0.5 and this number is 0.5 also. 00:05:33.000 |
The resulting numbers, when multiplied together, is a number that is smaller than the two initial numbers. 00:05:38.920 |
It's going to be 0.25 because it's 1/2 multiplied by 1/2 is 1/4. 00:05:46.480 |
So if we have two numbers that are smaller than one and we multiply them together, they will produce an even smaller number. 00:05:53.240 |
And if we have two numbers that are bigger than one and we multiply them together, 00:05:57.400 |
they will produce a number that is bigger than both of them. 00:06:00.480 |
So if we have a very long chain of computation, it eventually will either become a very big number or a very small number. 00:06:08.520 |
And this is not desirable, first of all, because our CPU of our GPU can only represent numbers up to a certain precision, 00:06:20.200 |
And if the number becomes too small, the contribution of this number to the output will become very small. 00:06:26.760 |
So when the PyTorch or our automatic, let's say our framework, will calculate how to adjust the weights, 00:06:35.560 |
the weight will move very, very, very slowly because the contribution of this product will be a very small number. 00:06:44.400 |
And this means that we have the gradient is vanishing or in the other case, it can explode, become very big numbers. 00:06:53.920 |
And this is a problem. The next problem is difficulty in accessing information from a long time ago. 00:06:59.720 |
What does it mean? It means that, as you remember from the previous slide, 00:07:03.360 |
we saw that the first input token is given to the recurrent neural network along with the first state. 00:07:10.440 |
Now, we need to think that the recurrent neural network is a long graph of computation. 00:07:16.680 |
Then we will use the new hidden state along with the next token to produce the next output. 00:07:22.880 |
If we have a very long sequence of input sequence, the last token will have a hidden state whose contribution from the first token has nearly gone because of this long chain of multiplication. 00:07:37.200 |
So actually, the last token will not depend much on the first token. 00:07:43.000 |
And this is also not good because, for example, we know as humans that in a text, in a quite long text, 00:07:49.600 |
the context that we saw, let's say 200 words before, is still relevant to the context of the current words. 00:07:57.480 |
And this is something that the RNN could not map. And this is why we have the transformer. 00:08:05.680 |
So the transformer solves these problems with the recurrent neural networks, and we will see how. 00:08:11.640 |
The structure of the transformer, we can divide into two macro blocks. 00:08:17.440 |
The first macro block is called the encoder, and it's this part here. 00:08:22.560 |
The second macro block is called the decoder, and it's the second part here. 00:08:28.000 |
The third part here you see on the top, it's just a linear layer, and we will see why it's there and what its function. 00:08:35.720 |
So and the two layers, so the encoder and the decoder, are connected by this connection you can see here, 00:08:43.680 |
in which some output of the encoder is sent as input to the decoder. And we will also see how. 00:08:50.320 |
Let's start, first of all, with some notations that I will be using during my explanation. 00:08:57.640 |
And you should be familiar with this notation. Also to review some maths. 00:09:02.200 |
So the first thing we should be familiar with is matrix multiplication. 00:09:06.480 |
So imagine we have an input matrix, which is a sequence of, let's say, words. 00:09:12.840 |
So sequence by d model, and we will see why it's called sequence by d model. 00:09:17.400 |
So imagine we have a matrix that is 6 by 512, in which each row is a word. 00:09:27.120 |
And this word is not made of characters, but by 512 numbers. 00:09:31.600 |
So each word is represented by 512 numbers, OK, like this. 00:09:37.040 |
Imagine you have 512 of them along this row, 512 along this other row, etc, etc. 00:09:43.520 |
1, 2, 3, 4, 5, so we need another one here, OK. 00:09:47.720 |
The first word we will call it A, the second B, the C, D, E, and F. 00:09:54.920 |
If we multiply this matrix by another matrix, let's say the transpose of this matrix, 00:10:01.320 |
so it's a matrix where the rows become columns, so 3, 4, 5, and 6. 00:10:21.920 |
And then we have 512 numbers along each column, because before we had them on the rows, now 00:10:40.240 |
This is a matrix that is 512 by 6, so let me add some brackets here. 00:10:47.440 |
If we multiply them, we will get a new matrix that is, we cancel the inner dimensions and 00:10:54.320 |
we get the outer dimensions, so it will become 6 by 6. 00:11:02.720 |
How do we calculate the values of this output matrix? 00:11:08.840 |
This is the dot product of the first row with the first column. 00:11:13.680 |
So this is A multiplied by A. The second value is the first row with the second column. 00:11:21.160 |
The third value is the first row with the third column until the last column, so A multiplied 00:11:32.400 |
It's basically you take the first number of the first row, so here we have 512 numbers, 00:11:41.620 |
So you take the first number of the first row and the first number of the first column, 00:11:48.640 |
Second value of the first row, second value of the first column, you multiply them together. 00:11:54.880 |
And then you add all these numbers together, so it will be, let's say, this number multiplied 00:12:02.120 |
by this plus this number multiplied by this plus this number multiplied by this plus this 00:12:08.260 |
number multiplied by this plus you sum all these numbers together and this is the A dot 00:12:13.940 |
product A. So we should be familiar with this notation because I will be using it a lot 00:12:20.920 |
Let's start our journey of the transformer by looking at the encoder. 00:12:26.820 |
So the encoder starts with the input embeddings. 00:12:36.400 |
We have a sentence of, in this case, six words. 00:12:49.240 |
It is not necessary to always split the sentence using single words. 00:12:54.440 |
We can even split the sentence in smaller parts that are even smaller than a single 00:13:01.440 |
For example, we can split this sentence into, let's say, 20 tokens by splitting each word 00:13:10.480 |
This is usually done in most modern transformer models, but we will not be doing it, otherwise 00:13:20.400 |
So let's suppose we have this input sentence and we split it into tokens and each token 00:13:27.940 |
The next step we do is we map these words into numbers and these numbers represent the 00:13:38.320 |
So imagine we have a vocabulary of all the possible words that appear in our training 00:13:44.600 |
Each word will occupy a position in this vocabulary. 00:13:47.880 |
So for example, the word "your" will occupy the position 105, the word "cat" will occupy 00:13:56.800 |
And as you can see, this cat here has the same number as this cat here because they 00:14:04.200 |
We take these numbers, which are called input IDs, and we map them into a vector of size 00:14:12.840 |
This vector is a vector made of 512 numbers and we always map the same word to always 00:14:22.000 |
However, this number is not fixed, it's a parameter for our model. 00:14:28.320 |
So our model will learn to change these numbers in such a way that it represents the meaning 00:14:35.360 |
So the input IDs never change because our vocabulary is fixed, but the embedding will 00:14:40.240 |
change along with the training process of the model. 00:14:43.240 |
So the embedding numbers will change according to the needs of the loss function. 00:14:48.680 |
So the input embedding are basically mapping our single word into an embedding of size 00:14:55.120 |
And we call this quantity 512D_MODEL because it's the same name that is also used in the 00:15:05.320 |
Let's look at the next layer of the encoder, which is the positional encoding. 00:15:13.600 |
What we want is that each word should carry some information about its position in the 00:15:21.260 |
Because now we built a matrix of words that are embeddings, but they don't convey any 00:15:27.200 |
information about where that particular word is inside the sentence. 00:15:32.920 |
And this is the job of the positional encoding. 00:15:35.700 |
So what we do, we want the model to treat words that appear close to each other as close 00:15:44.900 |
So we want the model to see this information about the spatial information that we see 00:15:50.680 |
So for example, when we see this sentence, what is positional encoding, we know that 00:15:54.840 |
the word "what" is more far from the word "is" compared to encoding. 00:16:02.520 |
Because we have this spatial information given by our eyes, but the model cannot see this. 00:16:07.340 |
So we need to give some information to the model about how the words are spatially distributed 00:16:15.960 |
And we want the positional encoding to represent a pattern that the model can learn. 00:16:24.980 |
Imagine we have our original sentence, "Your cat is a lovely cat". 00:16:28.940 |
What we do is we first convert into embeddings using the previous layer, so the input embeddings. 00:16:38.740 |
Then we create some special vectors called the positional encoding vectors that we add 00:16:45.340 |
So this vector we see here in red is a vector of size 512, which is not learned. 00:16:53.220 |
It's computed once and not learned along with the training process. 00:16:58.800 |
And this word, this vector represents the position of the word inside of the sentence. 00:17:04.740 |
And this should give us an output that is a vector of size, again, 512, because we are 00:17:12.220 |
summing this number with this number, this number with this number. 00:17:17.580 |
So the first dimension with the first dimension, the second dimension with the second. 00:17:21.020 |
So we will get a new vector of the same size of the input vectors. 00:17:26.380 |
How are these positional embeddings calculated? 00:17:31.220 |
Again we have a smaller sentence, let's say "your cat is". 00:17:35.140 |
And you may have seen the following expressions from the paper. 00:17:39.500 |
What we do is we create a vector of size D model, so 512. 00:17:47.040 |
And for each position in this vector, we calculate the value using these two expressions, using 00:17:56.020 |
So the first argument indicates the position of the word inside of the sentence. 00:18:01.020 |
So the word "your" occupies the position zero. 00:18:04.740 |
And we use the, for the even dimension, so the zero, the two, the four, the 510, etc. 00:18:16.740 |
And for the odd positions of this vector, we use the second expression. 00:18:22.620 |
And we do this for all the words inside of the sentence. 00:18:25.820 |
So this particular embedding is calculated PE of 10, because it's the first word embedding 00:18:42.180 |
And PE of 11 means that the first word dimension one. 00:18:48.980 |
So we will use the cosine, given the position one, and the 2i will be equal to, 2i + 1 will 00:19:03.320 |
If we have another sentence, we will not have different positional encodings. 00:19:08.780 |
We will have the same vectors, even for different sentences, because the positional encoding 00:19:15.140 |
are computed once and reused for every sentence that our model will see, during inference 00:19:23.000 |
So we only compute the positional encoding once, when we create the model, we save them, 00:19:29.140 |
We don't need to compute it every time we feed a sentence to the model. 00:19:36.520 |
So why the authors chose the cosine and the sine functions to represent positional encodings? 00:19:42.620 |
Because let's watch the plot of these two functions. 00:19:46.760 |
You can see the plot is by position, so the position of the word inside of the sentence, 00:19:51.260 |
and this depth is the dimension along the vector, so the 2i that you saw before in the 00:19:59.460 |
And if we plot them, we can see, as humans, a pattern here. 00:20:02.960 |
And we hope that the model can also see this pattern. 00:20:07.100 |
Okay, the next layer of the encoder is the multi-head attention. 00:20:12.420 |
We will not go inside of the multi-head attention first, we will first visualize the single-head 00:20:19.260 |
attention, so the self-attention with a single head. 00:20:26.980 |
Self-attention is a mechanism that existed before they introduced the transformer. 00:20:32.060 |
The authors of the transformer just changed it into a multi-head attention. 00:20:41.140 |
The self-attention allows the model to relate words to each other. 00:20:45.940 |
Okay, so we had the input embeddings that capture the meaning of the word. 00:20:52.060 |
Then we had the positional encoding that gives the information about the position of the 00:20:59.400 |
Now we want this self-attention to relate words to each other. 00:21:04.220 |
Now imagine we have an input sequence of 6 words with a d-model of size 512, which can 00:21:13.340 |
be represented as a matrix that we will call Q, K, and V. 00:21:18.140 |
So our Q, K, and V are the same matrix representing the input. 00:21:25.220 |
So the input of 6 words with the dimension of 512. 00:21:29.940 |
So each word is represented by a vector of size 512. 00:21:34.860 |
We basically apply this formula we saw here from the paper to calculate the attention, 00:21:42.780 |
Because it's each word in the sentence related to other words in the same sentence. 00:21:51.940 |
So we start with our Q matrix, which is the input sentence. 00:21:57.780 |
So we have 6 rows, and on the columns we have 512 columns. 00:22:03.300 |
Now they are really difficult to draw, but let's say we have 512 columns, and here we 00:22:11.700 |
Now what we do, according to this formula, we multiply it by the same sentence but transposed. 00:22:17.860 |
So the transposed of the K, which is again the same input sequence, we divide it by the 00:22:23.940 |
square root of 512, and then we apply the softmax. 00:22:29.100 |
The output of this, as we saw before in the initial matrix notations, we saw that when 00:22:36.060 |
we multiply 6 by 512 with another matrix that is 512 by 6, we obtain a new matrix that is 00:22:46.300 |
And each value in this matrix represents the dot product of the first row with the first 00:22:53.300 |
Each row represents the dot product of the first row with the second column, etc. 00:22:58.740 |
The values here are actually randomly generated, so don't concentrate on the values. 00:23:03.180 |
What you should notice is that the softmax makes all these values in such a way that 00:23:10.160 |
So this row, for example, here, sums up to 1. 00:23:18.220 |
And this value we see here is the dot product of the first word with the embedding of the 00:23:27.020 |
This value here is the dot product of the embedding of the word "your" with the embedding 00:23:35.460 |
And this value here is the dot product of the embedding of the word "your" with the 00:23:44.500 |
And this value represents somehow a score, that how intense is the relationship between 00:23:55.100 |
So for now we just multiplied Q by K, divided by the square root of dK, applied to the softmax, 00:24:06.500 |
We multiply this matrix by V, and we obtain a new matrix, which is 6 by 512. 00:24:12.740 |
So if we multiply a matrix that is 6 by 6 with another that is 6 by 512, we get a new 00:24:21.780 |
And one thing you should notice is that the dimension of this matrix is exactly the dimension 00:24:32.420 |
That we obtain a new matrix that is 6 rows, so let's say 6 rows, with 512 columns, in 00:24:40.780 |
which each, these are our words, so we have 6 words, and each word has an embedding of 00:24:48.020 |
So now this embedding here represents not only the meaning of the word, which was given 00:24:55.020 |
by the input embedding, not only the position of the word, which was added by the positional 00:25:00.460 |
encoding, but now somehow this special embedding, so these values represent a special embedding 00:25:06.740 |
that also captures the relationship of this particular word with all the other words. 00:25:14.300 |
And this particular embedding of this word here also captures not only its meaning, not 00:25:20.060 |
only its position inside of the sentence, but also the relationship of this word with 00:25:27.420 |
I want to remind you that this is not the multi-head attention, we are just watching 00:25:34.060 |
We will see later how this becomes the multi-head attention. 00:25:41.780 |
Self-attention has some properties that are very desirable. 00:25:47.860 |
What does it mean to be permutation invariant? 00:25:49.980 |
It means that if we have a matrix, let's say, first we have a matrix of 6 words, in this 00:25:57.420 |
case let's say just 4 words, so A, B, C, and D, and suppose by applying the formula before 00:26:04.700 |
this produces this particular matrix in which there is new special embedding for the word 00:26:12.220 |
A, a new special embedding for the word B, a new special embedding for the word C and 00:26:18.900 |
If we change the position of these two rows, the values will not change, the position of 00:26:26.860 |
So the values of B' will not change, it will just change in the position, and also the 00:26:32.160 |
C will also change position, but the values in each vector will not change, and this is 00:26:39.220 |
Self-attention as of now requires no parameters, I mean, I didn't introduce any parameter that 00:26:44.500 |
is learned by the model, I just took the initial sentence of, in this case, 6 words, we multiplied 00:26:52.220 |
it by itself, we divide it by a fixed quantity which is the square root of 512 and then we 00:26:58.860 |
apply the softmax which is not introducing any parameter, so for now the self-attention 00:27:04.100 |
didn't require any parameter except for the embedding of the words. 00:27:10.100 |
This will change later when we introduce the multi-head attention. 00:27:14.860 |
So we expect, because each value in the self-attention, in the softmax matrix, is a dot product of 00:27:23.020 |
the word embedding with itself and the other words, we expect the values along the diagonal 00:27:28.220 |
to be the maximum, because it's the dot product of each word with itself. 00:27:35.420 |
And there is another property of this matrix, that is, before we apply the softmax, if we 00:27:43.620 |
replace the value in this matrix, suppose we don't want the word your and cat to interact 00:27:50.100 |
with each other, or we don't want the word, let's say, is and the lovely to interact with 00:27:54.820 |
each other, what we can do is, before we apply the softmax, we can replace this value with 00:27:59.980 |
-infinity and also this value with -infinity, and when we apply the softmax, the softmax 00:28:08.900 |
will replace -infinity with 0, because as you remember the softmax is e to the power 00:28:15.300 |
of x, if x is going to -infinity, e to the power of -infinity will become very very close 00:28:25.580 |
This is a desirable property that we will use in the decoder of the transformer. 00:28:30.660 |
Now let's have a look at what is a multi-head attention. 00:28:34.140 |
So what we just saw was the self attention, and we want to convert it into a multi-head 00:28:40.420 |
You may have seen these expressions from the paper, but don't worry, I will explain them 00:28:47.540 |
Imagine we have our encoder, so we are on the encoder side of the transformer, and we 00:28:52.900 |
have our input sentence, which is, let's say, 6 by 512, so 6 word by 512 is the size of 00:29:03.280 |
In this case I call it sequence by dmodel, so sequence is the sequence length, as you 00:29:07.980 |
can see on the legend in the bottom left of the slide, and the dmodel is the size of the 00:29:17.740 |
What we do, just like the picture shows, we take this input and we make 4 copies of it. 00:29:25.100 |
One will be sent along this connection we can see here, and 3 will be sent to the multi-head 00:29:34.380 |
attention with 3 respective names, so it's the same input that becomes 3 matrices that 00:29:43.160 |
One is called query, one is called key, and one is called value. 00:29:47.160 |
So basically we are taking this input and making 3 copies of it, one we call q, k, and 00:29:57.080 |
First of all it multiplies these 3 matrices by 3 parameter matrices called wq, wk, and 00:30:07.120 |
These matrices have dimension dmodel by dmodel. 00:30:10.640 |
So if we multiply a matrix that is sequence by dmodel with another one that is dmodel 00:30:15.520 |
by dmodel, we get a new matrix as output that is sequence by dmodel. 00:30:21.420 |
So basically the same dimension as the starting matrix. 00:30:30.560 |
Our next step is to split these matrices into smaller matrices. 00:30:37.160 |
We can split this matrix q' by the sequence dimension or by the dmodel dimension. 00:30:44.720 |
In the multi-head attention we always split by the dmodel dimension. 00:30:48.780 |
So every head will see the full sentence but a smaller part of the embedding of each word. 00:30:57.280 |
So if we have an embedding of let's say 512, it will become smaller embeddings of 512 divided 00:31:09.020 |
So dk is dmodel divided by h, where h is the number of heads. 00:31:17.560 |
We can calculate the attention between these smaller matrices, so q1, k1 and v1 using the 00:31:28.600 |
And this will result into a small matrix called head1, head2, head3 and head4. 00:31:36.460 |
The dimension of head1 up to head4 is sequence by dv. 00:31:43.200 |
What is dv is basically it's equal to dk, it's just called dv because the last multiplication 00:31:49.740 |
is done by v and in the paper they call it dv so I'm also sticking to the same names. 00:31:56.140 |
Our next step is to combine these matrices, these small heads, by concatenating them along 00:32:09.660 |
So we concat all these heads together and we get a new matrix that is sequence by h 00:32:16.380 |
multiplied by dv, where h multiplied by dv, as we know dv is equal to dk, so h multiplied 00:32:26.820 |
So we get back the initial shape, so it's sequence by dmodel here. 00:32:35.040 |
The next step is to multiply the result of this concatenation by wo and wo is a matrix 00:32:42.160 |
that is h multiplied by dv, so dmodel, with the other dimension being dmodel. 00:32:48.960 |
And the result of this is a new matrix that is the result of the multi-head attention 00:32:56.920 |
So the multi-head attention, instead of calculating the attention between these matrices here, 00:33:04.000 |
so q', k' and v', splits them along the dmodel dimension into smaller matrices and calculates 00:33:12.720 |
the attention between these smaller matrices. 00:33:15.720 |
So each head is watching the full sentence but a different aspect of the embedding of 00:33:25.100 |
Because we want each head to watch different aspect of the same word. 00:33:30.480 |
For example, in the Chinese language, but also in other languages, one word may be a 00:33:35.640 |
noun in some cases, may be a verb in some other cases, may be an adverb in some other 00:33:43.380 |
So what we want is that one head maybe learns to relate that word as a noun, another head 00:33:50.480 |
maybe learns to relate that word as a verb, and another head learns to relate that verb 00:34:03.160 |
Now you may also have seen online that the attention can be visualized. 00:34:12.040 |
When we calculate the attention between the q and the k matrices, so when we do this operation, 00:34:18.800 |
so the softmax of q multiplied by the k divided by the square root of dk, we get a new matrix 00:34:26.240 |
just like we saw before, which is sequence by sequence. 00:34:29.760 |
And this represents a score that represents the intensity of the relationship between 00:34:37.160 |
We can visualize this, and this will produce a visualization similar to this one, which 00:34:44.240 |
I took from the paper, in which we see how all the heads work. 00:34:48.360 |
So for example, if we concentrate on this word "making", this word here, we can see 00:34:53.560 |
that "making" is related to the word "difficult", so this word here, by different heads. 00:34:59.300 |
So the blue head, the red head, and the green head. 00:35:03.560 |
But let's say the violet head is not relating these two words together. 00:35:09.500 |
So "making" and "difficult" is not related by the violet or the pink head. 00:35:14.920 |
The violet head or the pink head, they are relating the word "making" to other words, 00:35:26.480 |
Because maybe this pink head could see the part of the embedding that these other heads 00:35:32.080 |
could not see, that made this interaction possible between these two words. 00:35:41.120 |
You may be also wondering why these three matrices are called "query keys" and "values". 00:35:47.080 |
Okay, the terms come from the database terminology, or from the Python-like dictionaries. 00:35:53.520 |
But I would also like to give an interpretation of my own, making a very simple example. 00:36:03.540 |
So imagine we have a Python-like dictionary, or a database, in which we have keys and values. 00:36:10.180 |
The keys are the category of movies, and the values are the movies belonging to that category. 00:36:19.800 |
So we have Romantics category, which includes Titanic, we have action movies that include 00:36:27.280 |
Imagine we also have a user that makes a query, and the query is "love". 00:36:32.980 |
Because we are in the transformer world, all these words actually are represented by embeddings 00:36:40.440 |
So what our transformer will do, he will convert this word "love" into an embedding of 512. 00:36:46.680 |
All these queries and values are already embeddings of 512, and it will calculate the dot product 00:36:53.480 |
between the query and all the keys, just like the formula. 00:36:57.960 |
So as you remember, the formula is a softmax of query multiplied by the transpose of the 00:37:02.780 |
keys, divided by the square root of the model. 00:37:06.320 |
So we are doing the dot product of all the queries with all the keys. 00:37:11.140 |
In this case, the word "love" with all the keys, one by one. 00:37:15.520 |
And this will result in a score that will amplify some values or not amplify other values. 00:37:25.920 |
In this case, our embedding may be in such a way that the word "love" and "romantic" 00:37:31.560 |
are related to each other, the word "love" and "comedy" are also related to each other, 00:37:37.000 |
but not so intensively like the word "love" and "romantic". 00:37:41.280 |
So it's more, how to say, less strong relationship. 00:37:46.000 |
But maybe the word "horror" and "love" are not related at all, so maybe their softmax 00:37:56.600 |
Our next layer in the encoder is the add and norm. 00:38:02.520 |
And to introduce the add and norm, we need the layer normalization. 00:38:05.820 |
So let's see what is the layer normalization. 00:38:08.120 |
Layer normalization is a layer that, okay, let's make a practical example. 00:38:15.480 |
Imagine we have a batch of n items, in this case n is equal to 3, item 1, item 2, item 00:38:25.240 |
Each of these items will have some features, it could be an embedding, so for example it 00:38:30.360 |
could be a feature of a vector of size 512, but it could be a very big matrix of thousands 00:38:38.320 |
What we do is we calculate the mean and the variance of each of these items independently 00:38:43.320 |
from each other, and we replace each value with another value that is given by this expression. 00:38:49.820 |
So basically we are normalizing so that the new values are all in the range 0 to 1. 00:38:57.160 |
Actually we also multiply this new value with a parameter called gamma, and then we add 00:39:03.900 |
another parameter called beta, and this gamma and beta are learnable parameters. 00:39:10.280 |
And the model should learn to multiply and add these parameters so as to amplify the 00:39:17.180 |
value that it wants to be amplified and not amplify the value that it doesn't want to 00:39:25.260 |
So we don't just normalize, we actually introduce some parameters. 00:39:30.360 |
And I found a really nice visualization from paperswithcode.com in which we see the difference 00:39:39.060 |
So as we can see in the layer normalization we are calculating if n is the batch dimension, 00:39:45.720 |
we are calculating all the values belonging to one item in the batch, while in the batch 00:39:51.880 |
norm we are calculating the same feature for all the batch. 00:39:59.800 |
So we are mixing, let's say, values from different items of the batch, while in the layer normalization 00:40:06.200 |
we are treating each item in the batch independently, which will have its own mean and its own variance. 00:40:17.440 |
Now, in the encoder we saw the input embeddings, in this case they are called output embeddings, 00:40:27.960 |
Here also we have the positional encoding, and they are also the same as the encoder. 00:40:35.880 |
The next layer is the masked multihead attention, and we will see it now. 00:40:41.160 |
We also have the multihead attention here, here we should see that there is the encoder 00:40:49.980 |
here that produces the output and is sent to the decoder in the forms of keys and values. 00:41:02.240 |
While the query, so this connection here is the query coming from the decoder. 00:41:08.120 |
So in this multihead attention, it's not a self-attention anymore, it's a cross-attention 00:41:16.420 |
One is sent from the encoder side, so let's write encoder, in which we provide the output 00:41:22.480 |
of the encoder and we use it as keys and values, while the output of the masked multihead attention 00:41:30.160 |
is used as the query in this multihead attention. 00:41:34.840 |
The masked multihead attention is a self-attention of the input sentence of the decoder. 00:41:40.760 |
So we take the input sentence of the decoder, we transform into embeddings, we add the positional 00:41:46.880 |
encoding, we give it to this multihead attention in which the query key and values are the 00:41:51.520 |
same input sequence, we do the add and norm, then we send this as the queries of the multihead 00:42:00.080 |
attention, while the keys and the values are coming from the encoder, then we do the add 00:42:05.680 |
I will not be showing the feedforward, which is just a fully connected layer. 00:42:10.840 |
We then send the output of the feedforward to the add and norm, and finally to the linear 00:42:18.640 |
So let's have a look at the masked multihead attention and how it differs from a normal 00:42:26.360 |
What we want, our goal, is that we want to make the model causal. 00:42:31.080 |
It means that the output at a certain position can only depend on the words on the previous 00:42:37.880 |
So the model must not be able to see future words. 00:42:44.120 |
As you saw, the output of the softmax in the attention calculation formula is this metric, 00:42:52.240 |
If we want to hide the interaction of some words with other words, we delete this value 00:42:58.320 |
and we replace it with minus infinity before we apply the softmax, so that the softmax 00:43:08.520 |
And we do this for all the interaction that we don't want. 00:43:15.000 |
So we don't want your to watch cat is a lovely cat. 00:43:19.080 |
And we don't want the word cat to watch future words, but only all the words that come before 00:43:32.880 |
So we can see that we are replacing all the word, all these values here that are above 00:43:41.840 |
So this is the principal diagonal of the matrix. 00:43:44.480 |
And we want all the values that are above this diagonal to be replaced with minus infinity 00:43:50.160 |
so that, so that the softmax will replace them with zero. 00:43:54.840 |
Let's see in which stage of the multi-head attention this mechanism is introduced. 00:44:00.960 |
So when we calculate the attention between these molar matrices, so Q1, K1, and V1, before 00:44:09.720 |
we apply the softmax, we replace these values. 00:44:12.960 |
So this one, this one, this one, this one, this one, etc. with minus infinity. 00:44:18.720 |
Then we apply the softmax and then the softmax will take care of transforming these values 00:44:26.200 |
So basically we don't want these words to interact with each other. 00:44:31.440 |
And if we don't want this interaction, the model will learn to not make them interact 00:44:35.480 |
because the model will not get any information from this interaction. 00:44:41.620 |
Now let's look at how the inference and training works for a transformer model. 00:44:46.780 |
As I said previously, we are dealing with it, we will be dealing with a translation 00:44:53.360 |
So because it's easy to visualize and it's easy to understand all the steps. 00:45:01.040 |
We will go from an English sentence "I love you very much" into an Italian sentence "Ti 00:45:12.800 |
We start with a description of the transformer model and we start with our English sentence 00:45:22.460 |
So our English sentence here on which we prepend and append two special tokens. 00:45:29.200 |
One is called start of sentence and one is called end of sentence. 00:45:33.520 |
These two tokens are taken from the vocabulary. 00:45:36.960 |
So they are special tokens in our vocabulary that tells the model what is the start position 00:45:43.120 |
of a sentence and what is the end of a sentence. 00:45:49.240 |
For now just think that we take our sentence, we prepend a special token and we append a 00:45:57.160 |
As you can see from the picture, we take our inputs, we transform into input embeddings, 00:46:02.000 |
we add the positional encoding and then we send it to the encoder. 00:46:06.560 |
So this is our encoder input, sequence by D model, we send it to the encoder, it will 00:46:11.040 |
produce an output which is sequence by D model and it's called encoder output. 00:46:17.600 |
So as we saw previously, the output of the encoder is another matrix that has the same 00:46:23.560 |
dimension as the input matrix in which the embedding, we can see it as a sequence of 00:46:31.120 |
embeddings in which this embedding is special because it captures not only the meaning of 00:46:36.160 |
the word which was given by the input embedding we saw here, so by this, not only the position 00:46:42.480 |
which was given by the positional encoding, but also the interaction of every word with 00:46:48.420 |
every other word in the same sentence because this is the encoder. 00:46:55.080 |
So it's the interaction of each word in the sentence with all the other words in the same 00:47:03.040 |
We want to convert this sentence into Italian, so we prepare the input of the decoder, which 00:47:12.180 |
As you can see from the picture of the transformer, the outputs here you can see shifted right. 00:47:20.060 |
Basically, it means we prepend a special token called SOS, start of sentence. 00:47:26.880 |
You should also notice that these two sequences actually, when we code the transformer, so 00:47:35.300 |
if you watch my other video on how to code a transformer, you will see that we make this 00:47:39.920 |
sequence of fixed length so that if we have a sentence that is "ti amo molto" or a very 00:47:44.840 |
long sequence, actually when we feed them to the transformer, they all become of the 00:47:54.360 |
We add padding words to reach the desired length. 00:47:58.360 |
So if our model can support, let's say, a sequence length of 1000, in this case we have 00:48:03.800 |
4 tokens, we will add 996 tokens of padding to make this sentence long enough to reach 00:48:13.640 |
Of course, I'm not doing it here because it's not easy to visualize otherwise. 00:48:21.640 |
We add transform into embeddings, we add the positional encoding, then we send it first 00:48:28.360 |
to the multi-head attention, to the masked multi-head attention, so along with the causal 00:48:32.680 |
mask and then we take the output of the encoder and we send it to the decoder as keys and 00:48:41.360 |
values while the queries are coming from the mask, so the queries are coming from this 00:48:47.760 |
layer and the keys and the values are the output of the encoder. 00:48:53.200 |
The output of all this block here, so all this big block here, will be a matrix that 00:49:00.400 |
is sequenced by the model, just like for the encoder. 00:49:04.720 |
However, we can see that this is still an embedding because it's a D model, it's a vector 00:49:12.920 |
How can we relate this embedding back into our dictionary? 00:49:17.840 |
How can we understand what is this word in our vocabulary? 00:49:23.520 |
That's why we need a linear layer that will map sequence by D model into another sequence 00:49:31.720 |
So it will tell for every embedding that it sees what is the position of that word in 00:49:37.560 |
our vocabulary, so that we can understand what is the actual token that is output by 00:49:46.180 |
After that we apply the softmax and then we have our label, what we expect the model to 00:49:59.980 |
We expect the model to output this "ti amo molto" end of sentence and this is called 00:50:08.680 |
What we do when we have the output of the model and the corresponding label? 00:50:13.080 |
We calculate the loss, in this case is the cross entropy loss, and then we backpropagate 00:50:21.360 |
Now let's understand why we have these special tokens called SOS and EOS. 00:50:27.320 |
Basically you can see that here the sequence length is 4, actually it's 1000 because we 00:50:32.000 |
have the embedding, but let's say we don't have any embedding, so it's 4 tokens, start 00:50:36.160 |
of sentence "ti amo molto" and what we want is "ti amo molto" end of sentence. 00:50:42.100 |
So our model, when it will see the start of sentence token, it will output the first token 00:50:51.920 |
When it will see "ti" it will output "amo", when it will see "amo" it will output "molto" 00:50:59.400 |
and when it will see "molto" it will output end of sentence, which will indicate that 00:51:05.240 |
ok, the translation is done, and we will see this mechanism in the inference. 00:51:11.920 |
Ah, this all happens in one time step, just like I promised at the beginning of the video. 00:51:19.400 |
I said that with recurrent neural networks we have n time steps to map n input sequence 00:51:27.160 |
into n output sequence, but this problem would be solved with the transformer, yes, it has 00:51:33.880 |
been solved, because you can see here we didn't do any for loop, we just did all in one pass, 00:51:40.160 |
we give an input sequence to the encoder, an input sequence to the decoder, we produced 00:51:46.240 |
some outputs, we calculated the cross entropy loss with the label and that's it, it all 00:51:52.240 |
happens in one time step, and this is the power of the transformer, because it made 00:51:57.200 |
it very easy and very fast to train very long sequences and with very very nice performance 00:52:04.480 |
that you can see in chatGPT, you can see in GPT, in BERT, etc. 00:52:14.560 |
Again we have our English sentence "I love you very much", we want to map it into an 00:52:22.720 |
We have our usual transformer, we prepare the input for the encoder, which is start 00:52:28.320 |
of sentence "I love you very much", end of sentence. 00:52:31.920 |
We convert into input embeddings, then we add the positional encoding, we prepare the 00:52:35.680 |
input for the encoder and we send it to the encoder. 00:52:39.000 |
The encoder will produce an output, which is sequenced by the model, and we saw it before 00:52:43.320 |
that it's a sequence of special embeddings that capture the meaning, the position, but 00:52:48.040 |
also the interaction of all the words with other words. 00:52:52.720 |
What we do is, for the decoder, we give him just the start of sentence, and of course 00:52:59.280 |
we add enough embedding tokens to reach our sequence length. 00:53:04.680 |
We just give the model the start of sentence token, and again, for this single token we 00:53:11.520 |
convert into embeddings, we add the positional encoding and we send it to the decoder as 00:53:18.440 |
The decoder will take this, his input as a query, and the key and the values coming from 00:53:25.160 |
the encoder, and it will produce an output, which is sequenced by the model. 00:53:31.400 |
Again, we want the linear layer to project it back to our vocabulary, and this projection 00:53:39.600 |
What we do is, we apply the softmax, which will select, given the logits, the position 00:53:47.440 |
of the output word will have the maximum score with the softmax. 00:53:52.780 |
This is how we know what words to select from the vocabulary. 00:53:57.480 |
And this, hopefully, should produce the first output token, which is T, if the model has 00:54:06.060 |
This, however, happens at time step 1, so when we train the model, the transformer model, 00:54:11.780 |
it happens in one pass, so we have one input sequence, one output sequence, we give it 00:54:16.400 |
to the model, we do it one time step, and the model will learn it. 00:54:20.340 |
When we inference, however, we need to do it token by token, and we will also see why 00:54:27.460 |
At time step 2, we don't need to recompute the encoder output again, because our English 00:54:36.740 |
sentence didn't change, so we hope the encoder should produce the same output for it. 00:54:44.340 |
And then, what we do is, we take the output of the previous sentence, so T, we append 00:54:52.740 |
it to the input of the decoder, and then we feed it to the decoder, again with the output 00:55:00.420 |
of the encoder from the previous step, which will produce an output sequence from the decoder 00:55:05.860 |
side, which we again project back into our vocabulary, and we get the next token, which 00:55:14.780 |
So, as I said before, we are not recalculating the output of the encoder for every time step, 00:55:23.700 |
because our English sentence didn't change at all. 00:55:26.780 |
What is changing is the input of the decoder, because at every time step, we are appending 00:55:31.060 |
the output of the previous step to the input of the decoder. 00:55:35.220 |
We do the same for the time step 3, and we do the same for the time step 4. 00:55:42.140 |
And hopefully, we will stop when we see the end of sentence token, because that's how 00:55:57.460 |
When we inference a model, like in this case the translation model, there are many strategies 00:56:07.740 |
So for every step, we get the word with the maximum softmax value. 00:56:14.500 |
And however this strategy works, usually not bad, but there are better strategies, and 00:56:25.100 |
In beam search, instead of always greedily, so that's why it's called greedy, instead 00:56:30.780 |
of greedily taking the maximum soft value, we take the top B values, and then for each 00:56:38.780 |
of these choices, we inference what are the next possible tokens for each of the top B 00:56:45.860 |
values at every step, and we keep only the one with the B most probable sequences, and 00:56:55.440 |
This is called beam search, and generally it performs better. 00:57:03.560 |
I know it was a long video, but it was really worth it to go through each aspect of the 00:57:10.220 |
I hope you enjoyed this journey with me, so please subscribe to the channel, and don't 00:57:14.680 |
forget to watch my other video on how to code a transformer model from scratch, in which 00:57:19.860 |
I describe not only again the structure of the transformer model while coding it, but 00:57:26.040 |
I also show you how to train it on a dataset of your choice, how to inference it, and I 00:57:33.680 |
also provided the code on GitHub, and a Colab notebook to train the model directly on Colab. 00:57:44.480 |
Please subscribe to the channel, and let me know what you didn't understand, so that I 00:57:49.480 |
can give more explanation, and please tell me what are the problems in this kind of videos, 00:57:55.520 |
or in this particular video, that I can improve for the next videos. 00:58:00.320 |
Thank you very much, and have a great rest of the day!