Attention is all you need (Transformer) - Model explanation (including math), Inference and Training

Hello guys, welcome to my video about the transformer and this is actually the version 2.0 of my series on the transformer. I had a previous video in which I talked about the transformer but the audio quality was not good and as suggested by my viewers, as the video had a huge success, the viewers suggested me to improve the audio quality, so this is why I'm doing this video.

You don't have to watch the previous series because I would be doing basically the same things but with some improvements, so I'm actually compensating from some mistakes I made or from some improvements that I could add. After watching this video, I suggest watching my other video about how to code a transformer model from scratch, so how to code the model itself, how to train it on a data and how to inference it.

Stick with me because it's gonna be a little long journey but for sure worth it. Now, before we talk about the transformer, I want to first talk about recurrent neural networks, so the networks that were used before they introduced the transformer for most of the sequence-to-sequence tasks. So let's review them.

Recurrent neural networks existed a long time before the transformer and they allowed to map one sequence of input to another sequence of output. In this case, our input is X and we want an input sequence Y. What we did before is that we split the sequence into single items, so we gave the recurrent neural network the first item as input, so X1, along with an initial state, usually made up of only zeros, and the recurrent neural network produced an output, let's call it Y1.

And this happened at the first time step. Then we took the hidden state, this is called the hidden state of the network of the previous time step, along with the next input token, so X2, and the network had to produce the second output token, Y2. And then we did the same procedure at the third time step, in which we took the hidden state of the previous time step, along with the input token at the time step 3, and the network had to produce the next output token, which is Y3.

If you have n tokens, you need n time steps to map an n-sequence input into an n-sequence output. This worked fine for a lot of tasks, but had some problems. Let's review them. The problems with recurrent neural networks, first of all, are that they are slow for long sequences, because think of the process we did before, we have kind of like a for loop in which we do the same operation for every token in the input.

So if you have the longer the sequence, the longer this computation, and this made the network not easy to train for long sequences. The second problem was the vanishing or the exploding gradients. Now, you may have heard these terms or expression on the Internet or from other videos, but I will try to give you a brief insight on what do they mean on a practical level.

So as you know, frameworks like PyTorch, they convert our networks into a computation graph. So basically, suppose we have a computation graph. This is not a neural network. I will be making a computational graph that is very simple, has nothing to do with neural networks, but will show you the problems that we have.

So imagine we have two inputs X and another input, let's call it Y. Our computational graph first, let's say, multiplies these two numbers. So we have a first function, let's call it f of X and Y. That is X multiplied by Y, I mean, multiplied. And the result, let's call it Z, is given to another function.

Let's call this function g of Z is equal to, let's say, Z squared. What our PyTorch, for example, does, it's that PyTorch want to calculate, usually we have a loss function. PyTorch calculates the derivative of the loss function with respect to each weight. In this case, we just calculate the derivative of the g function, so the output function with respect to all of its inputs.

So derivative of g with respect to X, let's say, is equal to the derivative of g with respect to f. And multiplied by the derivative of f with respect to X. These two should kind of cancel out, this is called the chain rule. Now, as you can see, the longer the chain of computation, so if we have many nodes, one after another, the longer this multiplication chain, so here we have two because the distance from this node and this is two, but imagine you have 100 or 1000.

Now imagine this number is 0.5 and this number is 0.5 also. The resulting numbers, when multiplied together, is a number that is smaller than the two initial numbers. It's going to be 0.25 because it's 1/2 multiplied by 1/2 is 1/4. So if we have two numbers that are smaller than one and we multiply them together, they will produce an even smaller number.

And if we have two numbers that are bigger than one and we multiply them together, they will produce a number that is bigger than both of them. So if we have a very long chain of computation, it eventually will either become a very big number or a very small number.

And this is not desirable, first of all, because our CPU of our GPU can only represent numbers up to a certain precision, let's say 32 bit or 64 bit. And if the number becomes too small, the contribution of this number to the output will become very small. So when the PyTorch or our automatic, let's say our framework, will calculate how to adjust the weights, the weight will move very, very, very slowly because the contribution of this product will be a very small number.

And this means that we have the gradient is vanishing or in the other case, it can explode, become very big numbers. And this is a problem. The next problem is difficulty in accessing information from a long time ago. What does it mean? It means that, as you remember from the previous slide, we saw that the first input token is given to the recurrent neural network along with the first state.

Now, we need to think that the recurrent neural network is a long graph of computation. It will produce a new hidden state. Then we will use the new hidden state along with the next token to produce the next output. If we have a very long sequence of input sequence, the last token will have a hidden state whose contribution from the first token has nearly gone because of this long chain of multiplication.

So actually, the last token will not depend much on the first token. And this is also not good because, for example, we know as humans that in a text, in a quite long text, the context that we saw, let's say 200 words before, is still relevant to the context of the current words.

And this is something that the RNN could not map. And this is why we have the transformer. So the transformer solves these problems with the recurrent neural networks, and we will see how. The structure of the transformer, we can divide into two macro blocks. The first macro block is called the encoder, and it's this part here.

The second macro block is called the decoder, and it's the second part here. The third part here you see on the top, it's just a linear layer, and we will see why it's there and what its function. So and the two layers, so the encoder and the decoder, are connected by this connection you can see here, in which some output of the encoder is sent as input to the decoder.

And we will also see how. Let's start, first of all, with some notations that I will be using during my explanation. And you should be familiar with this notation. Also to review some maths. So the first thing we should be familiar with is matrix multiplication. So imagine we have an input matrix, which is a sequence of, let's say, words.

So sequence by d model, and we will see why it's called sequence by d model. So imagine we have a matrix that is 6 by 512, in which each row is a word. And this word is not made of characters, but by 512 numbers. So each word is represented by 512 numbers, OK, like this.

Imagine you have 512 of them along this row, 512 along this other row, etc, etc. 1, 2, 3, 4, 5, so we need another one here, OK. The first word we will call it A, the second B, the C, D, E, and F. If we multiply this matrix by another matrix, let's say the transpose of this matrix, so it's a matrix where the rows become columns, so 3, 4, 5, and 6.

This word will be here, B, C, D, E, and F. And then we have 512 numbers along each column, because before we had them on the rows, now they will become on the columns. So here we have the 512th number, etc, etc. This is a matrix that is 512 by 6, so let me add some brackets here.

If we multiply them, we will get a new matrix that is, we cancel the inner dimensions and we get the outer dimensions, so it will become 6 by 6. So it will be 6 rows by 6 rows. So let's draw it. How do we calculate the values of this output matrix?

This is 6 by 6. This is the dot product of the first row with the first column. So this is A multiplied by A. The second value is the first row with the second column. The third value is the first row with the third column until the last column, so A multiplied by F, etc.

What is the dot product? It's basically you take the first number of the first row, so here we have 512 numbers, here we have 512 numbers. So you take the first number of the first row and the first number of the first column, you multiply them together. Second value of the first row, second value of the first column, you multiply them together.

And then you add all these numbers together, so it will be, let's say, this number multiplied by this plus this number multiplied by this plus this number multiplied by this plus this number multiplied by this plus you sum all these numbers together and this is the A dot product A.

So we should be familiar with this notation because I will be using it a lot in the next slides. Let's start our journey of the transformer by looking at the encoder. So the encoder starts with the input embeddings. So what is an input embedding? First of all, let's start with our sentence.

We have a sentence of, in this case, six words. What we do is we tokenize it. We transform the sentence into tokens. What does it mean to tokenize? We split them into single words. It is not necessary to always split the sentence using single words. We can even split the sentence in smaller parts that are even smaller than a single word.

For example, we can split this sentence into, let's say, 20 tokens by splitting each word into multiple words. This is usually done in most modern transformer models, but we will not be doing it, otherwise it's really difficult to visualize. So let's suppose we have this input sentence and we split it into tokens and each token is a single word.

The next step we do is we map these words into numbers and these numbers represent the position of these words in our vocabulary. So imagine we have a vocabulary of all the possible words that appear in our training set. Each word will occupy a position in this vocabulary. So for example, the word "your" will occupy the position 105, the word "cat" will occupy the position 6,500, etc.

And as you can see, this cat here has the same number as this cat here because they occupy the same position in the vocabulary. We take these numbers, which are called input IDs, and we map them into a vector of size 512. This vector is a vector made of 512 numbers and we always map the same word to always the same embedding.

However, this number is not fixed, it's a parameter for our model. So our model will learn to change these numbers in such a way that it represents the meaning of the word. So the input IDs never change because our vocabulary is fixed, but the embedding will change along with the training process of the model.

So the embedding numbers will change according to the needs of the loss function. So the input embedding are basically mapping our single word into an embedding of size 512. And we call this quantity 512D_MODEL because it's the same name that is also used in the paper. Attention is all you need.

Let's look at the next layer of the encoder, which is the positional encoding. So what is positional encoding? What we want is that each word should carry some information about its position in the sentence. Because now we built a matrix of words that are embeddings, but they don't convey any information about where that particular word is inside the sentence.

And this is the job of the positional encoding. So what we do, we want the model to treat words that appear close to each other as close and words that are distant as distant. So we want the model to see this information about the spatial information that we see with our eyes.

So for example, when we see this sentence, what is positional encoding, we know that the word "what" is more far from the word "is" compared to encoding. Because we have this spatial information given by our eyes, but the model cannot see this. So we need to give some information to the model about how the words are spatially distributed inside of the sentence.

And we want the positional encoding to represent a pattern that the model can learn. And we will see how. Imagine we have our original sentence, "Your cat is a lovely cat". What we do is we first convert into embeddings using the previous layer, so the input embeddings. And these are embeddings of size 512.

Then we create some special vectors called the positional encoding vectors that we add to these embeddings. So this vector we see here in red is a vector of size 512, which is not learned. It's computed once and not learned along with the training process. It's fixed. And this word, this vector represents the position of the word inside of the sentence.

And this should give us an output that is a vector of size, again, 512, because we are summing this number with this number, this number with this number. So the first dimension with the first dimension, the second dimension with the second. So we will get a new vector of the same size of the input vectors.

How are these positional embeddings calculated? Let's see. Again we have a smaller sentence, let's say "your cat is". And you may have seen the following expressions from the paper. What we do is we create a vector of size D model, so 512. And for each position in this vector, we calculate the value using these two expressions, using these arguments.

So the first argument indicates the position of the word inside of the sentence. So the word "your" occupies the position zero. And we use the, for the even dimension, so the zero, the two, the four, the 510, etc. We use the first expression, so the sign. And for the odd positions of this vector, we use the second expression.

And we do this for all the words inside of the sentence. So this particular embedding is calculated PE of 10, because it's the first word embedding zero. So this one represents the argument "pause". And this zero represents the argument "2i". And PE of 11 means that the first word dimension one.

So we will use the cosine, given the position one, and the 2i will be equal to, 2i + 1 will be equal to 1. And we do this for this third word, etc. If we have another sentence, we will not have different positional encodings. We will have the same vectors, even for different sentences, because the positional encoding are computed once and reused for every sentence that our model will see, during inference or training.

So we only compute the positional encoding once, when we create the model, we save them, and then we reuse them. We don't need to compute it every time we feed a sentence to the model. So why the authors chose the cosine and the sine functions to represent positional encodings?

Because let's watch the plot of these two functions. You can see the plot is by position, so the position of the word inside of the sentence, and this depth is the dimension along the vector, so the 2i that you saw before in the previous expressions. And if we plot them, we can see, as humans, a pattern here.

And we hope that the model can also see this pattern. Okay, the next layer of the encoder is the multi-head attention. We will not go inside of the multi-head attention first, we will first visualize the single-head attention, so the self-attention with a single head. And let's do it. So what is self-attention?

Self-attention is a mechanism that existed before they introduced the transformer. The authors of the transformer just changed it into a multi-head attention. So how did the self-attention work? The self-attention allows the model to relate words to each other. Okay, so we had the input embeddings that capture the meaning of the word.

Then we had the positional encoding that gives the information about the position of the word inside of the sentence. Now we want this self-attention to relate words to each other. Now imagine we have an input sequence of 6 words with a d-model of size 512, which can be represented as a matrix that we will call Q, K, and V.

So our Q, K, and V are the same matrix representing the input. So the input of 6 words with the dimension of 512. So each word is represented by a vector of size 512. We basically apply this formula we saw here from the paper to calculate the attention, the self-attention in this case.

Why self-attention? Because it's each word in the sentence related to other words in the same sentence. So it's self-attention. So we start with our Q matrix, which is the input sentence. So let's visualize it, for example. So we have 6 rows, and on the columns we have 512 columns.

Now they are really difficult to draw, but let's say we have 512 columns, and here we have 6. Now what we do, according to this formula, we multiply it by the same sentence but transposed. So the transposed of the K, which is again the same input sequence, we divide it by the square root of 512, and then we apply the softmax.

The output of this, as we saw before in the initial matrix notations, we saw that when we multiply 6 by 512 with another matrix that is 512 by 6, we obtain a new matrix that is 6 by 6. And each value in this matrix represents the dot product of the first row with the first column.

Each row represents the dot product of the first row with the second column, etc. The values here are actually randomly generated, so don't concentrate on the values. What you should notice is that the softmax makes all these values in such a way that they sum up to 1. So this row, for example, here, sums up to 1.

This other row also sums up to 1, etc., etc. And this value we see here is the dot product of the first word with the embedding of the word itself. This value here is the dot product of the embedding of the word "your" with the embedding of the word "cat".

And this value here is the dot product of the embedding of the word "your" with the embedding of the word "is". And this value represents somehow a score, that how intense is the relationship between one word and another. Let's go ahead with the formula. So for now we just multiplied Q by K, divided by the square root of dK, applied to the softmax, but we didn't multiply by V.

So let's go forward. We multiply this matrix by V, and we obtain a new matrix, which is 6 by 512. So if we multiply a matrix that is 6 by 6 with another that is 6 by 512, we get a new matrix that is 6 by 512. And one thing you should notice is that the dimension of this matrix is exactly the dimension of the initial matrix from which we started.

This, what does it mean? That we obtain a new matrix that is 6 rows, so let's say 6 rows, with 512 columns, in which each, these are our words, so we have 6 words, and each word has an embedding of dimension 512. So now this embedding here represents not only the meaning of the word, which was given by the input embedding, not only the position of the word, which was added by the positional encoding, but now somehow this special embedding, so these values represent a special embedding that also captures the relationship of this particular word with all the other words.

And this particular embedding of this word here also captures not only its meaning, not only its position inside of the sentence, but also the relationship of this word with all the other words. I want to remind you that this is not the multi-head attention, we are just watching the self-attention, so one head.

We will see later how this becomes the multi-head attention. Self-attention has some properties that are very desirable. First of all, it's permutation invariant. What does it mean to be permutation invariant? It means that if we have a matrix, let's say, first we have a matrix of 6 words, in this case let's say just 4 words, so A, B, C, and D, and suppose by applying the formula before this produces this particular matrix in which there is new special embedding for the word A, a new special embedding for the word B, a new special embedding for the word C and D, so let's call it A', B', C', D'.

If we change the position of these two rows, the values will not change, the position of the output will change accordingly. So the values of B' will not change, it will just change in the position, and also the C will also change position, but the values in each vector will not change, and this is a desirable property.

Self-attention as of now requires no parameters, I mean, I didn't introduce any parameter that is learned by the model, I just took the initial sentence of, in this case, 6 words, we multiplied it by itself, we divide it by a fixed quantity which is the square root of 512 and then we apply the softmax which is not introducing any parameter, so for now the self-attention didn't require any parameter except for the embedding of the words.

This will change later when we introduce the multi-head attention. So we expect, because each value in the self-attention, in the softmax matrix, is a dot product of the word embedding with itself and the other words, we expect the values along the diagonal to be the maximum, because it's the dot product of each word with itself.

And there is another property of this matrix, that is, before we apply the softmax, if we replace the value in this matrix, suppose we don't want the word your and cat to interact with each other, or we don't want the word, let's say, is and the lovely to interact with each other, what we can do is, before we apply the softmax, we can replace this value with -infinity and also this value with -infinity, and when we apply the softmax, the softmax will replace -infinity with 0, because as you remember the softmax is e to the power of x, if x is going to -infinity, e to the power of -infinity will become very very close to 0, so basically 0.

This is a desirable property that we will use in the decoder of the transformer. Now let's have a look at what is a multi-head attention. So what we just saw was the self attention, and we want to convert it into a multi-head attention. You may have seen these expressions from the paper, but don't worry, I will explain them one by one.

So let's go. Imagine we have our encoder, so we are on the encoder side of the transformer, and we have our input sentence, which is, let's say, 6 by 512, so 6 word by 512 is the size of the embedding of each word. In this case I call it sequence by dmodel, so sequence is the sequence length, as you can see on the legend in the bottom left of the slide, and the dmodel is the size of the embedding vector, which is 512.

What we do, just like the picture shows, we take this input and we make 4 copies of it. One will be sent along this connection we can see here, and 3 will be sent to the multi-head attention with 3 respective names, so it's the same input that becomes 3 matrices that are equal to input.

One is called query, one is called key, and one is called value. So basically we are taking this input and making 3 copies of it, one we call q, k, and b. They have, of course, the same dimension. What does the multi-head attention do? First of all it multiplies these 3 matrices by 3 parameter matrices called wq, wk, and wv.

These matrices have dimension dmodel by dmodel. So if we multiply a matrix that is sequence by dmodel with another one that is dmodel by dmodel, we get a new matrix as output that is sequence by dmodel. So basically the same dimension as the starting matrix. And we will call them q', k' and v'.

Our next step is to split these matrices into smaller matrices. Let's see how. We can split this matrix q' by the sequence dimension or by the dmodel dimension. In the multi-head attention we always split by the dmodel dimension. So every head will see the full sentence but a smaller part of the embedding of each word.

So if we have an embedding of let's say 512, it will become smaller embeddings of 512 divided by 4. And we call this quantity dk. So dk is dmodel divided by h, where h is the number of heads. In our case we have h equal to 4. We can calculate the attention between these smaller matrices, so q1, k1 and v1 using the expression taken from the paper.

And this will result into a small matrix called head1, head2, head3 and head4. The dimension of head1 up to head4 is sequence by dv. What is dv is basically it's equal to dk, it's just called dv because the last multiplication is done by v and in the paper they call it dv so I'm also sticking to the same names.

Our next step is to combine these matrices, these small heads, by concatenating them along the dv dimension, just like the paper says. So we concat all these heads together and we get a new matrix that is sequence by h multiplied by dv, where h multiplied by dv, as we know dv is equal to dk, so h multiplied by dv is equal to dmodel.

So we get back the initial shape, so it's sequence by dmodel here. The next step is to multiply the result of this concatenation by wo and wo is a matrix that is h multiplied by dv, so dmodel, with the other dimension being dmodel. And the result of this is a new matrix that is the result of the multi-head attention which is sequence by dmodel.

So the multi-head attention, instead of calculating the attention between these matrices here, so q', k' and v', splits them along the dmodel dimension into smaller matrices and calculates the attention between these smaller matrices. So each head is watching the full sentence but a different aspect of the embedding of each word.

Why we want this? Because we want each head to watch different aspect of the same word. For example, in the Chinese language, but also in other languages, one word may be a noun in some cases, may be a verb in some other cases, may be an adverb in some other cases, depending on the context.

So what we want is that one head maybe learns to relate that word as a noun, another head maybe learns to relate that word as a verb, and another head learns to relate that verb as an adjective or an adverb. So this is why we want multi-head attention. Now you may also have seen online that the attention can be visualized.

And I will show you how. When we calculate the attention between the q and the k matrices, so when we do this operation, so the softmax of q multiplied by the k divided by the square root of dk, we get a new matrix just like we saw before, which is sequence by sequence.

And this represents a score that represents the intensity of the relationship between the two words. We can visualize this, and this will produce a visualization similar to this one, which I took from the paper, in which we see how all the heads work. So for example, if we concentrate on this word "making", this word here, we can see that "making" is related to the word "difficult", so this word here, by different heads.

So the blue head, the red head, and the green head. But let's say the violet head is not relating these two words together. So "making" and "difficult" is not related by the violet or the pink head. The violet head or the pink head, they are relating the word "making" to other words, for example to this word "2009".

Why this is the case? Because maybe this pink head could see the part of the embedding that these other heads could not see, that made this interaction possible between these two words. You may be also wondering why these three matrices are called "query keys" and "values". Okay, the terms come from the database terminology, or from the Python-like dictionaries.

But I would also like to give an interpretation of my own, making a very simple example. I think it's quite easy to understand. So imagine we have a Python-like dictionary, or a database, in which we have keys and values. The keys are the category of movies, and the values are the movies belonging to that category.

In my case, I just put one value. So we have Romantics category, which includes Titanic, we have action movies that include The Dark Knight, etc. Imagine we also have a user that makes a query, and the query is "love". Because we are in the transformer world, all these words actually are represented by embeddings of size 512.

So what our transformer will do, he will convert this word "love" into an embedding of 512. All these queries and values are already embeddings of 512, and it will calculate the dot product between the query and all the keys, just like the formula. So as you remember, the formula is a softmax of query multiplied by the transpose of the keys, divided by the square root of the model.

So we are doing the dot product of all the queries with all the keys. In this case, the word "love" with all the keys, one by one. And this will result in a score that will amplify some values or not amplify other values. In this case, our embedding may be in such a way that the word "love" and "romantic" are related to each other, the word "love" and "comedy" are also related to each other, but not so intensively like the word "love" and "romantic".

So it's more, how to say, less strong relationship. But maybe the word "horror" and "love" are not related at all, so maybe their softmax score is very close to zero. Our next layer in the encoder is the add and norm. And to introduce the add and norm, we need the layer normalization.

So let's see what is the layer normalization. Layer normalization is a layer that, okay, let's make a practical example. Imagine we have a batch of n items, in this case n is equal to 3, item 1, item 2, item 3. Each of these items will have some features, it could be an embedding, so for example it could be a feature of a vector of size 512, but it could be a very big matrix of thousands of features, doesn't matter.

What we do is we calculate the mean and the variance of each of these items independently from each other, and we replace each value with another value that is given by this expression. So basically we are normalizing so that the new values are all in the range 0 to 1.

Actually we also multiply this new value with a parameter called gamma, and then we add another parameter called beta, and this gamma and beta are learnable parameters. And the model should learn to multiply and add these parameters so as to amplify the value that it wants to be amplified and not amplify the value that it doesn't want to be amplified.

So we don't just normalize, we actually introduce some parameters. And I found a really nice visualization from paperswithcode.com in which we see the difference between batch norm and layer norm. So as we can see in the layer normalization we are calculating if n is the batch dimension, we are calculating all the values belonging to one item in the batch, while in the batch norm we are calculating the same feature for all the batch.

So for all the items in the batch. So we are mixing, let's say, values from different items of the batch, while in the layer normalization we are treating each item in the batch independently, which will have its own mean and its own variance. Let's look at the decoder now.

Now, in the encoder we saw the input embeddings, in this case they are called output embeddings, but the underlying working is the same. Here also we have the positional encoding, and they are also the same as the encoder. The next layer is the masked multihead attention, and we will see it now.

We also have the multihead attention here, here we should see that there is the encoder here that produces the output and is sent to the decoder in the forms of keys and values. While the query, so this connection here is the query coming from the decoder. So in this multihead attention, it's not a self-attention anymore, it's a cross-attention because we are taking two sentences.

One is sent from the encoder side, so let's write encoder, in which we provide the output of the encoder and we use it as keys and values, while the output of the masked multihead attention is used as the query in this multihead attention. The masked multihead attention is a self-attention of the input sentence of the decoder.

So we take the input sentence of the decoder, we transform into embeddings, we add the positional encoding, we give it to this multihead attention in which the query key and values are the same input sequence, we do the add and norm, then we send this as the queries of the multihead attention, while the keys and the values are coming from the encoder, then we do the add and norm.

I will not be showing the feedforward, which is just a fully connected layer. We then send the output of the feedforward to the add and norm, and finally to the linear layer, which we will see later. So let's have a look at the masked multihead attention and how it differs from a normal multihead attention.

What we want, our goal, is that we want to make the model causal. It means that the output at a certain position can only depend on the words on the previous position. So the model must not be able to see future words. How can we achieve that? As you saw, the output of the softmax in the attention calculation formula is this metric, sequence by sequence.

If we want to hide the interaction of some words with other words, we delete this value and we replace it with minus infinity before we apply the softmax, so that the softmax will replace this value with zero. And we do this for all the interaction that we don't want.

So we don't want your to watch future words. So we don't want your to watch cat is a lovely cat. And we don't want the word cat to watch future words, but only all the words that come before it or the word itself. So we don't want this, this, this, this.

Also the same for the other words, etc. So we can see that we are replacing all the word, all these values here that are above this diagonal here. So this is the principal diagonal of the matrix. And we want all the values that are above this diagonal to be replaced with minus infinity so that, so that the softmax will replace them with zero.

Let's see in which stage of the multi-head attention this mechanism is introduced. So when we calculate the attention between these molar matrices, so Q1, K1, and V1, before we apply the softmax, we replace these values. So this one, this one, this one, this one, this one, etc. with minus infinity.

Then we apply the softmax and then the softmax will take care of transforming these values into zeros. So basically we don't want these words to interact with each other. And if we don't want this interaction, the model will learn to not make them interact because the model will not get any information from this interaction.

So it's like this word cannot interact. Now let's look at how the inference and training works for a transformer model. As I said previously, we are dealing with it, we will be dealing with a translation task. So because it's easy to visualize and it's easy to understand all the steps.

Let's start with the training of the model. We will go from an English sentence "I love you very much" into an Italian sentence "Ti amo molto". It's a very simple sentence. It's easy to describe. Let's go. We start with a description of the transformer model and we start with our English sentence which is sent to the encoder.

So our English sentence here on which we prepend and append two special tokens. One is called start of sentence and one is called end of sentence. These two tokens are taken from the vocabulary. So they are special tokens in our vocabulary that tells the model what is the start position of a sentence and what is the end of a sentence.

We will see later why we need them. For now just think that we take our sentence, we prepend a special token and we append a special token. Then what we do? As you can see from the picture, we take our inputs, we transform into input embeddings, we add the positional encoding and then we send it to the encoder.

So this is our encoder input, sequence by D model, we send it to the encoder, it will produce an output which is sequence by D model and it's called encoder output. So as we saw previously, the output of the encoder is another matrix that has the same dimension as the input matrix in which the embedding, we can see it as a sequence of embeddings in which this embedding is special because it captures not only the meaning of the word which was given by the input embedding we saw here, so by this, not only the position which was given by the positional encoding, but also the interaction of every word with every other word in the same sentence because this is the encoder.

So we are talking about self-attention. So it's the interaction of each word in the sentence with all the other words in the same sentence. We want to convert this sentence into Italian, so we prepare the input of the decoder, which is start of sentence "ti amo molto". As you can see from the picture of the transformer, the outputs here you can see shifted right.

What does it mean to shift right? Basically, it means we prepend a special token called SOS, start of sentence. You should also notice that these two sequences actually, when we code the transformer, so if you watch my other video on how to code a transformer, you will see that we make this sequence of fixed length so that if we have a sentence that is "ti amo molto" or a very long sequence, actually when we feed them to the transformer, they all become of the same length.

How to do this? We add padding words to reach the desired length. So if our model can support, let's say, a sequence length of 1000, in this case we have 4 tokens, we will add 996 tokens of padding to make this sentence long enough to reach the sequence length.

Of course, I'm not doing it here because it's not easy to visualize otherwise. Okay, we prepare this input for the decoder. We add transform into embeddings, we add the positional encoding, then we send it first to the multi-head attention, to the masked multi-head attention, so along with the causal mask and then we take the output of the encoder and we send it to the decoder as keys and values while the queries are coming from the mask, so the queries are coming from this layer and the keys and the values are the output of the encoder.

The output of all this block here, so all this big block here, will be a matrix that is sequenced by the model, just like for the encoder. However, we can see that this is still an embedding because it's a D model, it's a vector of size 512. How can we relate this embedding back into our dictionary?

How can we understand what is this word in our vocabulary? That's why we need a linear layer that will map sequence by D model into another sequence by vocabulary size. So it will tell for every embedding that it sees what is the position of that word in our vocabulary, so that we can understand what is the actual token that is output by the model.

After that we apply the softmax and then we have our label, what we expect the model to output given this English sentence. We expect the model to output this "ti amo molto" end of sentence and this is called the label or the target. What we do when we have the output of the model and the corresponding label?

We calculate the loss, in this case is the cross entropy loss, and then we backpropagate the loss to all the weights. Now let's understand why we have these special tokens called SOS and EOS. Basically you can see that here the sequence length is 4, actually it's 1000 because we have the embedding, but let's say we don't have any embedding, so it's 4 tokens, start of sentence "ti amo molto" and what we want is "ti amo molto" end of sentence.

So our model, when it will see the start of sentence token, it will output the first token as output "ti". When it will see "ti" it will output "amo", when it will see "amo" it will output "molto" and when it will see "molto" it will output end of sentence, which will indicate that ok, the translation is done, and we will see this mechanism in the inference.

Ah, this all happens in one time step, just like I promised at the beginning of the video. I said that with recurrent neural networks we have n time steps to map n input sequence into n output sequence, but this problem would be solved with the transformer, yes, it has been solved, because you can see here we didn't do any for loop, we just did all in one pass, we give an input sequence to the encoder, an input sequence to the decoder, we produced some outputs, we calculated the cross entropy loss with the label and that's it, it all happens in one time step, and this is the power of the transformer, because it made it very easy and very fast to train very long sequences and with very very nice performance that you can see in chatGPT, you can see in GPT, in BERT, etc.

Let's have a look at how inference works. Again we have our English sentence "I love you very much", we want to map it into an Italian sentence "ti amo molto". We have our usual transformer, we prepare the input for the encoder, which is start of sentence "I love you very much", end of sentence.

We convert into input embeddings, then we add the positional encoding, we prepare the input for the encoder and we send it to the encoder. The encoder will produce an output, which is sequenced by the model, and we saw it before that it's a sequence of special embeddings that capture the meaning, the position, but also the interaction of all the words with other words.

What we do is, for the decoder, we give him just the start of sentence, and of course we add enough embedding tokens to reach our sequence length. We just give the model the start of sentence token, and again, for this single token we convert into embeddings, we add the positional encoding and we send it to the decoder as decoder input.

The decoder will take this, his input as a query, and the key and the values coming from the encoder, and it will produce an output, which is sequenced by the model. Again, we want the linear layer to project it back to our vocabulary, and this projection is called logits.

What we do is, we apply the softmax, which will select, given the logits, the position of the output word will have the maximum score with the softmax. This is how we know what words to select from the vocabulary. And this, hopefully, should produce the first output token, which is T, if the model has been trained correctly.

This, however, happens at time step 1, so when we train the model, the transformer model, it happens in one pass, so we have one input sequence, one output sequence, we give it to the model, we do it one time step, and the model will learn it. When we inference, however, we need to do it token by token, and we will also see why this is the case.

At time step 2, we don't need to recompute the encoder output again, because our English sentence didn't change, so we hope the encoder should produce the same output for it. And then, what we do is, we take the output of the previous sentence, so T, we append it to the input of the decoder, and then we feed it to the decoder, again with the output of the encoder from the previous step, which will produce an output sequence from the decoder side, which we again project back into our vocabulary, and we get the next token, which is AMO.

So, as I said before, we are not recalculating the output of the encoder for every time step, because our English sentence didn't change at all. What is changing is the input of the decoder, because at every time step, we are appending the output of the previous step to the input of the decoder.

We do the same for the time step 3, and we do the same for the time step 4. And hopefully, we will stop when we see the end of sentence token, because that's how the model tells us to stop inferencing. And this is how the inference works. Why we needed four time steps.

When we inference a model, like in this case the translation model, there are many strategies for inferencing. What we used is called greedy strategy. So for every step, we get the word with the maximum softmax value. And however this strategy works, usually not bad, but there are better strategies, and one of them is called beam search.

In beam search, instead of always greedily, so that's why it's called greedy, instead of greedily taking the maximum soft value, we take the top B values, and then for each of these choices, we inference what are the next possible tokens for each of the top B values at every step, and we keep only the one with the B most probable sequences, and we delete the others.

This is called beam search, and generally it performs better. So thank you guys for watching. I know it was a long video, but it was really worth it to go through each aspect of the transformer. I hope you enjoyed this journey with me, so please subscribe to the channel, and don't forget to watch my other video on how to code a transformer model from scratch, in which I describe not only again the structure of the transformer model while coding it, but I also show you how to train it on a dataset of your choice, how to inference it, and I also provided the code on GitHub, and a Colab notebook to train the model directly on Colab.

Please subscribe to the channel, and let me know what you didn't understand, so that I can give more explanation, and please tell me what are the problems in this kind of videos, or in this particular video, that I can improve for the next videos. Thank you very much, and have a great rest of the day!

Attention is all you need (Transformer) - Model explanation (including math), Inference and Training

Chapters

Transcript