Back to Index

BERT explained: Training, Inference, BERT vs GPT/LLamA, Fine tuning, [CLS] token


Chapters

0:0 Introduction
2:0 Language Models
3:10 Training (Language Models)
7:23 Inference (Language Models)
9:15 Transformer architecture (Encoder)
10:28 Input Embeddings
14:17 Positional Encoding
17:14 Self-Attention and causal mask
29:14 BERT (overview)
32:8 BERT vs GPT/LLaMA
34:25 Left context and right context
36:36 BERT pre-training
37:5 Masked Language Model
45:1 [CLS] token
48:26 BERT fine-tuning
49:0 Text classification
50:50 Question answering

Transcript

Hello guys, welcome to my new video on BERT. In this video I will be explaining BERT from scratch. What do I mean by this? I will explain actually all the building blocks that make up BERT along with all the background knowledge that you need to understand this. So we will start with a little review of what are language models, how they are trained, how we inference language models, then we will see the transformer architecture, at least the one used by language models, so the encoder part.

We will review embedding vectors, positional encoding, self-attention and causal mask. And then I will introduce BERT. So I will first give you a big background knowledge before you can start learning BERT so you understand each building blocks of BERT. And then we will see concepts like what is the left context, what is the right context and how it's used in BERT.

We will see the two tasks on which BERT has been pre-trained, so the masked language model and the next sentence prediction task. And finally, we will also see what is fine tuning and how do we fine tune BERT with the text classification task and the question answering task. What do I expect you to know before watching this video?

Well, for sure, I hope that you are familiar with the transformer model. For example, if you have watched my previous video on the transformer model, that would be great because even if I will review some of the concepts of the transformer, I will not review all of them. So for example, I will not touch concepts like the cross attention or the multi-head attention or the normalization, the feed-forward layer, et cetera, et cetera, because I will also only give you enough background to understand BERT, so it's not a video on the transformer model.

So please, if you are not familiar with the transformer, please go watch my previous video on the transformer model and then you can watch this video if you want to fully understand BERT. So let's start our journey with language models. What is a language model? Well, a language model is a probabilistic model that assigns probabilities to sequence of words.

In practice, a language model allows us to compute the following probability, which is the probability of the word, for example, China, following the sentence, "Shanghai is a city in". So what is the probability that the word China comes next in the sentence, "Shanghai is a city in"? This is the kind of probability that we model using language models, which is a neural network trained on a very large corpora of text.

When this large corpora of text is very, very, very large, we also call them large language models. There are many examples of large language models. For example, we have Lama. For example, we have GPT. They are also called foundation models because they have been pre-trained on a very big corpora of text, for example, the entire Wikipedia or billions of pages from the internet.

And then we can use them with prompting or with fine tuning. Later we will see how do we do this. Okay, let's review how we train a large language model or a language model in general, actually. So the training of a large language model involves that we have some corpora, so a piece of text, which could be the entire Wikipedia.

It would be web pages. It could be just one book or it could be anything. In my example, imagine we want to train a large language model or a large language model on Chinese poems. And suppose we only have one poem. This one, this is a very famous poem from Li Bai.

It's one of the first poems that you learn if you want to study Chinese literature or the Chinese language. So we will concentrate only on the following line. So before my bed lies a pool of moon bright. Let's see how to train the large language model. Well, the first thing we do is we create a sequence of the line that we want to teach to the model.

So let's start with the first line. We create a sentence to which we prepend one token called start of sentence. This sentence or input, which is made up of tokens, in this case, in our very simple case, we can consider that each word is a token, but this is not always the case because depending on the tokenizer, each word may be split into multiple tokens.

But suppose for simplicity that our tokenizer always takes one word as one token. So we feed this input sequence to our neural network, which is modeling the language model. Usually it's the trans, the encoder part of the transformer. So only this part here, along with the linear and the softmax, for example, like llama.

Then we, this transformer encoder will output a sequence of tokens. So transformer is a sequence to sequence model, means that if you give it a sequence of 10 tokens, it will output a sequence of 10 tokens. When we train a model, we all, we have an input. We have also a target.

What is, that is what we want the model to output. So we want the model to output the same sentence, but without prepending anything, but by appending one last token called end of sentence. So the total length is still 10 tokens. So it's the same sentence as before, but instead of having a token at the beginning, it has a token at the end.

This token is called end of sentence token. Now let's review why do we need this start of sentence token and end of sentence token? Because as I said before, the neural network that we are using, which is a transform is a sequence to sequence model. It means that if we have an input of N tokens as input, it will produce N tokens as output.

So if, for example, we give this neural network only the first token, so the start of sentence, it will only output one token as output. So in case it has already been trained, it should output the first token of the target. So before, if we, let me switch to the pen.

Okay. If we input the first two tokens, for example, start of sentence before it should output the first two tokens of the target before my, if we input the first three tokens, start of sentence before my, it should output the first three tokens of the output. So before my bed, as you can see, every time we give it an input, the last token of the output, in case it has already been trained and it's matching the target is the next token that we need to complete the sentence.

So for example, if we give only the first two tokens, start of sentence before the model outputs before my, so it will give us the next token after before. If we give it start of sentence before my, it will output before my bed. So the next token after the word my, so bed, et cetera, et cetera.

So every time we give the model a token and the model will return the next token as the last token in the output sequence. And this is how we train a large language model. So once we have our target token and the output, we compute the loss, which is the cross entropy loss and we run back propagation.

Now let's review how we inference from a language model. So imagine you are a student who had to memorize Leiby's poem, but you only remember the first two words. How do you survive an exam? So you only remember the first two words of the first line of the poem.

What you could do, imagine you already have a language model that has been trained on Chinese poem. You could ask the language model to write the rest of the poem for you, but of course you need to give the language model some input on what you want. That input is called the prompt.

So you tell the language model the first two tokens and the model will come up with the following tokens that make up the poem. Let's review the poem again, which is before my bed lies a pool of moon bright. So let's start our inferencing. We give the model of our first two tokens by prepending the start of sentence token and we feed it to the neural network, which has been already pre-trained.

Then the model should output before my bed. We take this last token, we append it to the input and we give it back to the model. So now we give it before my bed and we feed it again to the model and the model will output the next token, which is lies.

We take this last token lies, we append it again to the input and we feed it again to the transformer model and the model will output the next token of the line. And then we keep doing like this until we arrive to the end of the line or the end of the poem, which is indicated by the end of sentence token, depending on how we train the model.

In our case, suppose we only trained it on the first line of this poem. And this is how we inference from a language model. In a language model like Lama, for example. Now let's understand the architecture of the transformer model, at least the part that we are interested in, which is the encoder, because this is the architecture that is also used in BERT.

So we need to understand the building blocks. Even if you have already watched the transformer model, my video on the transformer model, let's review all this concept again. So as I said before, this is the vanilla transformer. So this is the vanilla transformer, the transformer model as presented in the original paper.

Attention is all you need. In most language models are actually modeled just using the encoder side or the decoder side, depending on the application. And they have the last linear layer to project back the tokens in the vocabulary. Let's review all these building blocks of this model. I will not review actually all of them.

So I will not review the normalization, the linear layer, or the feed forward. Because this, I hope you're already familiar with. What I am actually interested in is the input embeddings, the positional encodings, and the multi-head attention, for which I will only actually do the case of the single head, because it's easier to visualize.

So if you want to have more information, please watch my previous video on the transformer model. Now, let's review the embedding vectors, what they are and how do we use them. Okay. Usually when we train a language model or we inference a language model, we use a prompt or some input for training.

This input is a text. What the first thing we do is we split this text into tokens. In our simple case, we will do a very simple tokenization in which we split each word. Each word becomes a token. This is actually not always the case with language models. For example, in Lama or in other language models, we use the BPE tokenizer.

In BERT, we will see, we use the word piece tokenizer. So each word can become multiple tokens. In our simple case, we just pretend that each word is actually a token with some tokens actually not mapping to words. For example, the start of sentence token is a special token that exists only virtually.

It's not actually part of the training text. Okay. The first thing we do is we do this tokenization. So each word becomes a token. We map each token into its position in the vocabulary. So imagine you have very big corpora of text. This text is made up of words.

Each word will occupy a position in the vocabulary. So we map each word to its position in the vocabulary. In this case, each token into its position in the vocabulary. Then we map each of these numbers, which are the position of the token in the vocabulary to an embedding vector of size 512.

Now this size of size 512 is the one used in the vanilla transformer. We will see that in BERT the size is 768, if I'm not mistaken. But for now, I will only always refer to the configuration of the vanilla transformer. So the transformer as presented in the attention is all you need.

So each of these input IDs, which is the position of the token of each word is projected into an embedding. This embedding is a vector of size 512 that captures the meaning of each token, in this case of each word. But why do we use embedding vectors to capture the meaning of each token?

Let's review. For example, given the word "cherry", "digital" and "information", the idea is this. Imagine we live in a very simple world in which the embedding vector is not made up of 512 dimensions, but only two dimensions. So we can project these vectors on the XY plane. If we project them, and if the embedding vectors have been trained correctly, we will see that the words with similar meaning will point to the same direction in space, while words with different meaning will point to different directions in space.

For example, the word "digital" and "information", because they capture the same kind of semantic meaning "information", they will point to similar directions. And we can measure this similarity by measuring the angle between them. So for example, the angle between "digital" and "information" is very small, you can see here, while the angle between "cherry" and "digital" is quite big, because they represent different semantic groups, so they have different meaning.

Imagine there is also another word called "tomato". We expect the word "tomato" to point to the vertical direction, very similar to the "cherry", for example, here it may be. So that the angle between "cherry" and "tomato" is very small, for example. And we measure this angle between vectors using the cosine similarity, which is based on the dot product between two vectors.

And we will see that this dot product is very important, because we will use it in the attention mechanism that we will see later. Ok, now let's review the positional encodings as presented in the original paper of the transformer model. We need to give some positional information to our model, because now we only gave some vectors that represent the meaning of the word.

But we also need to tell the model that this particular word is in the position 1 in the sentence, and this particular word is in position 2 in the sentence, etc. etc. And this is the job of the positional encodings. Let's see how they work. So we start with our original sentence, which is the first line of the Chinese poem we saw before.

We convert it into embedding vectors of size 512. Then each of these embedding vectors, we add another vector. This is called the positional encoding or positional embedding, I saw both names used. And this position embedding actually indicates the position of this particular token inside the sentence. And this vector here indicates the position 1 of this token.

And this indicates the position 2 and the position 3 and position 4. Now, actually, this position embedding, at least in the vanilla transformer, they are computed once and they are reused for every sentence during training and inference. So they are not specific for this particular token, but they are only specific for this particular position, which means that every token in the position 0 or every token in position 1 will receive this particular vector added to it that represents the position number 1.

So the result of this addition is these vectors here, which will become the input of the encoder and that we will see it later. How do we compute these positional encodings, at least as presented in the original transformer paper? Well, suppose we have a sentence made up of three words or three tokens.

We have seen these formulas before from the paper "Attention is all you need". We create a vector of size 512 and for the even dimensions of this vector, we use the first formula and for the odd dimensions, we use the second formula, in which the arguments of these two formulas is, the first one is a pause, which indicates the position of the word inside of the sentence.

So for the first token, it's 0. And 2i indicates the dimension of this vector to which we are applying. And we can compute it also for the second vector. For the third position, etc. If we have another sentence that is different, for example, I love you, which is also made up of three tokens, we will reuse the same vectors as the other sentence.

So this particular vector here is associated with the position 0, not with the token before. So if we have another token, we will reuse the same vector for the position 0. Ok, now let's review what is the self-attention that we use in language models. Because language models need to find a way to relate tokens with each other so that they can compute some kind of interactions between tokens.

So for example, tokens have a meaning not by themselves, but by the way they are present inside of the sentence and their relationship with other tokens inside of the sentence. And this is the job of the self-attention, which is done here in the multi-head attention. We will not see the multi-head attention, we will see the single head attention.

So let's start. Now let's build the input for this self-attention because for now we have worked with independent tokens. So we took this token and we converted it into a vector. Then we added the positional encodings. We first converted it into embedding, then we added the positional encoding to capture the position information.

But actually I lied to you by telling you that we work independently. Actually when we code the transformer, we always work with the matrix form. So all these tokens, all these vectors are never alone. They are always in a big matrix that contains all of them. So now we create this big matrix.

But before it was not easy to work with a big matrix directly because it's not easy to visualize. So now we create a big matrix. So we combine all these vectors in a one big matrix here in which each row is one of these vectors. So for example, the first vector here becomes the first row.

The second vector here becomes the second row. The third vector here becomes the third row, et cetera, et cetera, et cetera. The shape of this matrix is 10 by 512 because we have 10 tokens and each token is represented by a vector of 512 dimensions. We take this matrix and we make three copies of it.

So three identical copies of this matrix. The first one, we will call it query. The second one, we will call it key. And the third one, we will call it value. As you can see, the values in this matrix are all the same because there are three identical copies.

Why we use three identical copies? Because this is the self-attention mechanism, which means that we relate tokens to each other, tokens that belong to the same sentence with each other. If we relate tokens of two different sentences or from two different languages, for example, when we are doing a language translation, in that case, we will talk about cross-attention.

In this case, we talk about self-attention. And this is the kind of attention that is used in language models. OK, the self-attention mechanism, as you have seen, probably in the paper, works with this formula here, which is the attention is calculated as the softmax of the query multiplied by the transpose of the keys divided by the square root of dk, then multiplied by b.

What is this dk here? The dk here actually represents the dimension of the vector of each head in case of the multi-head attention, because we are actually simplifying our scenario and working with only one head. In our case, dk corresponds to d model. So that is the size of the embedding vector that we created before, which is 512.

So we take our matrix, the one we built before, so the query, so 10 by 512. We multiply it by the transpose of the keys, which becomes a 512 by 10. So they are basically the identical matrix, but one is transposed and one is not. We divide it by the square root of 512.

We apply the softmax and this will produce the following matrix we can see here, in which each value is the softmax of the dot product of one token with another token of the embedding of one token with another token. For example, let's visualize with some vectors. So we built the matrix before, which is made up of 10 rows because each row is a token and each row contains 512 numbers because it's a vector of 512 dimensions.

So the dimension one up to 512, then the dimension one up to 512 and we have 10 of them. This is the transpose of this matrix. So we will have not 10 rows, but we will have 10 column of vectors with the dimension one up to 512. Then we have another column vector here with the dimension one and then 512, etc, etc, etc.

We have another one, etc. So this value here is the dot product of the first row of the first matrix with the first column of the first matrix, which is the embedding of the first token, which if you remember is the start of sentence with the embedding of the token start of sentence and it's this value here.

Then this value here, it's the dot product of the embedding of the start of sentence with the second token, which is this one here and it's the token before. So this is before, etc. Then we apply the Softmax. The Softmax basically changes the values in this matrix in such a way that they sum up to one.

So each row in this matrix, this row for example here, sums up to one and also this row here sums up to one, etc, etc. As you can see here, this word start of sentence is able to relate to the word before. And this is not what we want, because as I said before, our goal is to model a language model.

That is, a language model is a probabilistic model that assigns probability to sequence of words. That is, we want to calculate the probability of the word China being the next word in the sentence Shanghai is a city in. That is, we want to condition the word China only on the words that come before it.

That is, Shanghai is a city in. So our model should only be able to watch this part of the sentence to predict the next token. This is also called the left context. But this is not what is happening here. Because we are able to relate tokens that also come in the future with tokens that come in the past.

So for example, the word SOS is being related with the token before. And the token SOS is also being related with the token my, even if the token my comes after it. So what we do, we basically need to introduce the causal mask. Let's see how it works. The causal mask works like this.

We take the metrics that we saw before with all the attention scores and all the interactions that are not causal. So all the interaction of words with the words that come on its right are replaced with minus infinity. For example, start of sentence should not be able to relate to the word before.

So we replace the interaction with minus infinity before we apply the softmax. And then also, for example, the word before should not be able to watch the word my bed lies pool. So all these interactions are also replaced with minus infinity. So basically all the values above the principal diagonal that you can see here are replaced with minus infinity.

Then we apply the softmax. And if you remember the formula for the softmax, you can see here it's on the numerator is e to the power of z i, which is the item to which you are applying the softmax. And e to the power of minus infinity will become zero.

So basically we replace them with minus infinity so that when we apply the softmax, they will be replaced with zero by the softmax. This way the model will not have access to any information of interactions between the word start of sentence and all the tokens that come after it because we replace them with zero.

So even there is some kind of connection between this token, the model will not be able to learn it because we never give this information to the model. And this is how the model becomes causal. The only token that is able to watch all the previous token is this bright token here.

So the last token, because this token can see all the tokens that come before it. That's why this line has no zero here. Okay, in the formula of the attention, we also have a multiplication with v. So we take the output of the softmax. So the matrix that we saw before, we apply the causal mask, and then we multiply with v.

And this way you will understand why we apply the causal mask. So let's review the shapes. This matrix here is a 10 by 10 matrix. And this matrix v is the initial matrix that we built with our vectors. So it's 10 by 512. So it's the value matrix, one of the three identical copies that we made before.

The multiplication between these two metrics will produce an output matrix that is 10 by 512. Let's see how the output works. Okay, so this is a matrix made of rows of vectors. So the first row is a vector of size 512. The second row is a vector of size 512.

The third is also a vector of size 512. So with 512 dimensions here, this one also have dimension one dimension two dimension three, 512 dimension one dimension two dimension three, 512, etc, etc. The output token will also have the same shape. So it will be 10 by 512, which means we have 10 vectors of dimension 512 dimension.

So the dimension one dimension two dimension three, up to 512 dimension one, two, three, up to 512, etc, etc. Now let's do this product by hand. To get this first value here of this matrix, so the dimension one of the first vector of the attention output matrix here, this value here is the dot product of the first row.

So this row here, all this row with the first column of this matrix. So the first dimension of the embedding of each token in our input. But as you can see, because of the causal mask, most of the values in this matrix are zero, as you can see here, this means that the output value here will be only be able to watch the first dimension of the first token, which also means that in the output of the attention, the first token will only be able to attend only itself, as you can see, not the values that come after it.

Let's look at the second, for example, this output here. So the first dimension of the second row of this matrix attention output here. So this value here comes from the second row of the initial matrix. So this one here multiplied by the first column of this matrix. Now we have two values that are non-zero.

So this means that this output here will depend only on the first two tokens because all the other are zero. And this is how we make the model causal. OK, now we are ready to explore BERT and the architecture behind BERT. So BERT's architecture is also using the encoder of the transformer model.

So we have input embeddings, we have positional encodings, we will see that the positional encodings are actually different. Then we have self-attention, we have normalization, we have feedforward. And then we have this head, this linear head we can see here. And this we will see later that it changes according to the specific task for which we are using BERT.

BERT was introduced with the two pre-trained models. One is BERT-Base and one is BERT-Large. The BERT-Base, for example, has 12 encoder layers, which means that this block here, so the gray block you can see here, is repeated 12 times, one after another, and the output of the last layer is fed to this linear layer and then to the softmax.

The size of the hidden size of the feedforward layer is 3072, so the feedforward layer you can see here, which is basically just two linear layers, the size of the features is 3072. And then we have 12 attention heads in the multi-head attention. Then BERT-Large have these numbers. Now what are the differences between BERT and the vanilla transformer?

The first difference is that the embedding vector is not 512 anymore, but it's 768 for BERT-Base and 1024 for BERT-Large. From now on, I will always refer to the number 768, so that the embedding of the vector in BERT-Base. Another difference is that the positional encoding in the vanilla transformer were computed using the sine and the cosine function we saw before.

But in BERT, these positional embeddings are not fixed and pre-computed using fixed functions, but they are actually embeddings that are learned during training. And they are of the same size of the embedding vector, of course, because they are summed together. So they have 768 dimensions in BERT-Base and 1024 in BERT-Large.

But these positional embeddings are limited to 512 positions, which means that BERT cannot handle sentences longer than 512 tokens, because we only have 512 vectors to represent positions. And the linear layer head changes according to the application, so this linear layer here. So we saw before that BERT uses not the tokenizer that we have used, the simple one, which only treats each word as a token, but it uses what's called the word piece tokenizer, which also allows sub-word tokens, so each word can become multiple tokens.

The vocabulary size is roughly 30,000 tokens in the BERT-Base and BERT-Large. Ok, let's see the differences between BERT and the language models like GPT and LLAMA. In my slide I call these kind of models the common language models, so they are commonly known as language models, so like GPT and LLAMA.

So unlike them, BERT does not handle special tasks with prompts, but rather it can be specialized on a particular task by means of fine-tuning, and we will see what I mean by this in the next slide. The second difference is that BERT has been trained using the left context and the right context, so we will see also what we mean by this.

BERT is not built specifically for text generation. So for example, you can use LLAMA to generate a big article given a prompt, but you cannot use BERT for this purpose. BERT is useful for other kind of tasks, and we will see which ones. And BERT has not been trained on the next token prediction task.

So the model that we trained initially on the Chinese poems was trained on the next token prediction task, but BERT has not been trained on this particular task. BERT has been trained on the musket language model and the next sentence prediction task, and we will see both of them.

Ok, so let's see how we handle different tasks in GPT or in LLAMA, and how we handle them in BERT. Suppose we want to do question answering. If we want to do it with GPT or with LLAMA, what we do is we build a particular prompt, and this is called a few shot prompting, in which we teach the model how to handle a task inside of the prompt, and then we let the model tell us the answer without the last one in the last part of the prompt, we let the model answer using the next token.

So for example, if we tell the model that given the context and the question how to build the answer, then given the context and the question the model should be able to come up with an answer that makes sense given the previous example. While in BERT we do not work with the prompt, like we do with chat GPT or with LLAMA or with GPT, but we fine tune BERT on the specific task we want to work on, and we will see how.

As I said before, language models are models that predict the next token using only the left context of each word, that is the tokens that come to the left side of each word for predicting the next token. This is not the case with BERT. BERT uses the left context and the right context.

So I want to give you some intuition in why also we humans may be using the left and the right context. Now, I am not really a linguist, so my intuition will not be maybe technically valid, but it will help you understand the importance of left and right context also in human conversations.

So let's start with the left context. The left context in human conversation is used every time we have a phone conversation. For example, the operator's answer is based on the user's input. So the user says something, then the operator will say something, then the user will reply with something based on what the operator said, and then the operator will continue based on the context given by the previous conversation.

This is called using the left context. For right context, it's more difficult to visualize. For example, imagine there is a kid who just broke his mom's favorite necklace. The kid doesn't want to tell the truth to his mom, so he decides to make up a lie. So instead of saying directly to his mom, your favorite necklace has broken, the kid may say, "Mom, I just saw the cat playing in your room and your favorite necklace has broken." Or it may say, "Mom, alien came through the window with laser guns and your favorite necklace has broken." As you can see, the kid conditioned the lie on what he want to say next.

So it doesn't even matter which lie the kid says, because he want to create some context before arriving to the conclusion, which is, "Your favorite necklace has broken." So he conditions the word he chooses initially on what he want to say next. That is the definition of making up a lie.

And this could be an intuition in how we humans use the right context. And I hope you also see it. Now let's talk about the pre-training of model. Because for example LAMA or GPT, they have been pre-trained on a large corpora of text. And then we can use them with prompting or with fine tuning.

BERT has not been pre-trained like with LAMA or GPT using the next token prediction task, but on two specific tasks called the Masked Language Model and the Next Sentence Prediction Task. Let's review them. The Masked Language Model is also known as the Claude's Task. And you may know it from some papers or tests that you have done at university.

For example, the teacher gives you a sentence in which one word is missing and you have to fill the empty space with the missing word. And this is how BERT has been trained with the Masked Language Model. Basically what they did is they took some sentence from the corpora from which they were training BERT.

They choose some tokens randomly and they replace this random token with a special token called Mask. Then they feed this masked input to BERT and BERT has to come up with the word that was removed initially. One or more words. Let's see how it was done technically. So first we need to understand how BERT uses the left and the right context.

So as you saw before, when we compute the attention we use the formula softmax of query multiplied by the transpose of the keys divided by the square root of dk and then we multiply it by v. This means that we take the query matrix, we multiply it by the transpose of the keys, we do the softmax and this will produce this matrix here that we saw before.

But unlike before, like we did for Language Models, in this case we will not use any mask to cancel out the interactions of words with words that come after it. So for example before we replaced the value here, for example, with minus infinity and also this value with minus infinity and actually all the values above this diagonal with minus infinity because we didn't want the model to learn these interactions.

But with BERT we do not use any mask, which means that each token attends tokens to its left and tokens to its right in a sentence. Ok, let's review the details of the Masked Language Model. So imagine we want to mask this sentence, so Rome is the capital of Italy, which is why it hosts many government buildings.

The pre-training procedure selects 15% of the tokens from this sentence to be masked. If a token is selected, for example the word capital is selected, then 80% of the time it is replaced with the masked token, becoming this input for BERT, 10% of the time it's replaced with a random token, Zebra, and 10% of the time it's not replaced at all, so it remains as original.

In any of these three cases, BERT has to predict the word that was masked out, so BERT should output capital for each of these three inputs based on the probability they happen. So during training, what we do, for example, imagine this is our initial sentence in which we masked out the word capital like we saw before, this will result in an input to be fed to BERT of 14 tokens, if we count the tokens here, we feed it to BERT, BERT will produce an output because BERT is a transformer model so 14 tokens of input result in 14 tokens in the output.

What we do is we only check the BERT, the position that was masked out, so this token here, we compare it with our capital, with our target, and we compute the loss, and we run back propagation. So basically what we want is this token here to be capital. Okay now let's review the next sentence prediction task.

Next sentence prediction task means that we have a text, we choose two sentences randomly, actually we select one sentence randomly, so the sentence A, and then 50% of the time we select its immediately next sentence, so the second line, or 50% of the time we select a random sentence from the rest of the text, in this case this one, the third line.

In our case we selected the first line as sentence A, and the sentence B is the third line, so it's not the immediately following sentence. We feed both of these sentences to BERT, and BERT has to predict if the sentence B comes immediately after sentence A or not. In this case, because sentence B is the third line, and sentence A is the first line, it's not immediately following, so BERT should reply with "not next".

In the case we had selected the second line as sentence B, BERT should reply with "is next". We have two problems here, how can we encode two sentences to become the input for BERT, and second problem is how can BERT tell us that it's the next sentence or it's not the next sentence.

Well the idea is this, we take the two sentences and we encode it as only one sentence, it becomes one input, so the tokens of the two sentences become one input concatenated together, in which we prepend one special token called CLS, then the tokens of the first sentence, so suppose the sentence is "my dog is cute", then we add the token called separator, then the tokens of the second sentence, so this "he likes playing" and then another token sep.

The problem is, if we feed only this input to BERT, BERT will not be able to understand that this "my" belongs to sentence A and this "likes" belongs to sentence B, so what we did before initially, as I told you before, when we feed the input to any language model that uses transformer, we first tokenize it, then we convert it into embedding vectors of size 512 in case the vanilla transformer or 768 in case of BERT, then we append another vector that represents the position of this token, so the position embeddings, in BERT we have another embedding called segment embedding, so it's another vector that we add to the position embedding and to the token embedding and represents the fact that this token belongs to sentence A or to sentence B, and this is how we encode the input for BERT.

Let's see how we train it, so we create our input which is the first line of this poem and the third line of this poem together with the separator token we can see here and a special token called CLS here, and we also encode the information that all these tokens belong to sentence A and all these tokens belong to sentence B by using this special this one here, segment embedding we saw here.

Now we feed this input to BERT, BERT will come out with an output because it's a transformer model so an input of 20 tokens corresponds to an output of 20 tokens. We take only the first token of the output, the one that corresponds to the token CLS which stands for classifier, we feed it to a linear layer with only two output features, one feature indicating next and one indicating not next, we apply the softmax, we compare it with our target so we expect BERT to say not next because we fed it the third line as sentence B and not the second line, and then we compute the loss which is the cross entropy loss and we run backpropagation to update the weights, and this is how we train BERT on the next sentence prediction task.

Now this CLS token, let's review how this works, so as we saw before the formula for the attention is a query multiplied by the keys, the transpose of the keys divided by the square root of 768, we apply the softmax and this will produce this attention score matrix we can see here.

Now as we can see the CLS token always interacts with all the other tokens because we didn't apply any causal mask, so we can consider that the CLS token acts as a token that captures the information from all the other tokens because the attention matrix here, we didn't apply any mask before applying the softmax, so all of these attention values will be actually learned by the model, and this is the idea behind this CLS token.

So if we do for example the matrix multiplication that we did before, so let's compute the first row of the output matrix, let's see, so this matrix here is the input matrix which is 10 by 768, because suppose the input is very simple, so before my bed lies a pool of moon bright, so 10 tokens of input, 1, 2, 3, etc, etc, so 10 tokens, each of them has 768 embeddings because we are talking about birth, 8, 768, so the first dimension, the first dimension, the first dimension up to 768, this will result in an output matrix of 10 by 768, so the first dimension up to 768, 1, 2, 768, etc.

We are only interested in this output here, which corresponds to the position of the CLS token, let's see, the first dimension, so the dimension number 1 of this output token will be the dot product of this vector here, which is made of 10 dimensions, with the first column of this matrix here, which is also made of 10 dimensions because we have 10 tokens, but because none of the value here is 0, actually here is 0 because I chose random numbers, but suppose this is 0.03 and 0.04 let's say, because none of the values in this matrix is 0, the output, the CLS will be able to access the attention scores of all the tokens, so basically this token here will aggregate the attention scores, so the relationship with all the tokens, the CLS can also be thought of as the CEO in a company and you are the shareholder, when you are the shareholder, you don't ask the information to the employees, you ask to the CEO, and the CEO's job is to talk to every guy in the company, to every person in the company to get the necessary information to reach the goal, and this is the goal of the CLS, the CLS can be thought of as the aggregator of all the information, of all the information present inside of the sentence, and we use it to classify, that's why it's called the CLS token.

Okay now let's talk about fine-tuning, as we saw before, BERT does not work with prompts like LLAMA or GPT, so we cannot use zero-shot prompting or few-shot prompting or chain of thoughts or any other prompting technique, with BERT we work with fine-tuning, so we take the pre-trained model, and if we want to do text classification, we fine-tune BERT on our dataset for text classification, or question answering, let's see how these two tasks work.

Suppose we want to do text classification, so text classification is the task of assigning a label to a piece of text, for example, imagine you are running an internet provider and we receive complaints from our customers, we may want to classify requests coming from users as hardware problems, software problems or billing problems, for example, this complaint here is definitely a hardware problem, this complaint here is definitely a software problem and this one is definitely a billing problem, we want to classify automatically this request that we keep receiving from customers, how do we do that?

Well we take our request, we feed it to BERT and BERT should tell us which one of the three options it best represents this particular request, how can BERT tell us one of these three options and how can we feed our request to BERT? Let's see. So when we train BERT for text classification, we create our input, we prepend to the request text the classifier token, so the CLS token, we feed it to BERT, BERT will come up with an output, so 16 input tokens correspond to 16 output tokens, we only care about the first one, which is the one corresponding to the CLS token, we send the output to a linear layer with three output features, because we have three possible classes, one is software, one is hardware, one is billing and then we apply Softmax, we compare it with what we expect BERT to learn about this particular request, that is that this request is hardware, then we calculate the loss, which is the cross entropy loss and finally we run backpropagation to update the weights and this is how we fine-tune BERT on text classification.

Our next task is question answering. Question answering basically means this, we have a context from which we need to extract the answer to a particular question, for example the context is Shanghai is a city in China, it is also a financial center, it's fashion capital and industrial city, the question is what is the fashion capital of China, well the model should highlight the word or tell us which part of the context we can find the answer, so BERT should be able to tell us where the answer starts and where the answer ends in the context.

But we have two problems, first we need to find a way for BERT to understand which part of the input is the context and which one is the question, second we need to find a way for BERT to tell us where the answer starts and where the answer ends, let's see both of these problems.

The first problem can be solved easily using the segment embedding we saw before, so we concatenate the question and the context together as a single input with the separator token in the middle like we saw before for the next sentence prediction task, the question will be encoded as a sentence A while the context will be encoded as a sentence B, so this problem is solved.

How do we get the answer from BERT, well let's see, so first we prepare the input for BERT, so we prepend the CLS token, what is the fashion capital of China, separator, Shanghai is a city in China etc, so this is the context which has been encoded as sentence B and the first part has been encoded as sentence A, we feed it to BERT, BERT will come up with the output which is 27 tokens because the input is made up of 27 tokens, we also know which tokens correspond to which sentence, so which correspond to the sentence A, which correspond to the sentence B because we give it as input, we apply a linear layer with two output features, one that indicates if one particular token is the start token and another feature that indicates if the token is an end token, we know where is the answer because we know the answer is the word Shanghai which is the start should be the token 10 and the end should be the token 10, then we calculate the loss based on our target and the output of this linear layer, we run backpropagation and this is how we fine tune BERT for question answering and this is it guys, I hope that you liked my video, I used a very unconventional way of describing BERT, that is I started from language models, I introduced the concept of how language models work and then I introduced BERT because I wanted to create a comparison of how BERT works versus how other language models work so that you can appreciate the qualities and the weaknesses of both, BERT is actually not so recent model, it was introduced in 2018 if I remember correctly, so it is quite aged but still very relevant for a lot of tasks and I hope that you will be coming again to my channel for more content so please subscribe and share this video if you like it, if you have any questions please write it in the comment, I am also very active on LinkedIn if you want to add me, if you want to have some particular video review, model review, write it in the comment and I hope that in my next video I will be able to code BERT from scratch, so using PyTorch so we can also learn, put to practice all the knowledge that we acquired in today's video, thank you again for coming to my channel guys and have a nice day!