back to indexBERT explained: Training, Inference, BERT vs GPT/LLamA, Fine tuning, [CLS] token
Chapters
0:0 Introduction
2:0 Language Models
3:10 Training (Language Models)
7:23 Inference (Language Models)
9:15 Transformer architecture (Encoder)
10:28 Input Embeddings
14:17 Positional Encoding
17:14 Self-Attention and causal mask
29:14 BERT (overview)
32:8 BERT vs GPT/LLaMA
34:25 Left context and right context
36:36 BERT pre-training
37:5 Masked Language Model
45:1 [CLS] token
48:26 BERT fine-tuning
49:0 Text classification
50:50 Question answering
00:00:02.520 |
In this video I will be explaining BERT from scratch. 00:00:07.500 |
I will explain actually all the building blocks that make up BERT along with all the background 00:00:15.780 |
So we will start with a little review of what are language models, how they are trained, 00:00:20.360 |
how we inference language models, then we will see the transformer architecture, at 00:00:24.400 |
least the one used by language models, so the encoder part. 00:00:27.680 |
We will review embedding vectors, positional encoding, self-attention and causal mask. 00:00:34.640 |
So I will first give you a big background knowledge before you can start learning BERT 00:00:42.600 |
so you understand each building blocks of BERT. 00:00:45.640 |
And then we will see concepts like what is the left context, what is the right context 00:00:51.960 |
We will see the two tasks on which BERT has been pre-trained, so the masked language model 00:00:58.640 |
And finally, we will also see what is fine tuning and how do we fine tune BERT with the 00:01:05.280 |
text classification task and the question answering task. 00:01:09.200 |
What do I expect you to know before watching this video? 00:01:12.240 |
Well, for sure, I hope that you are familiar with the transformer model. 00:01:17.100 |
For example, if you have watched my previous video on the transformer model, that would 00:01:20.400 |
be great because even if I will review some of the concepts of the transformer, I will 00:01:28.480 |
So for example, I will not touch concepts like the cross attention or the multi-head 00:01:32.680 |
attention or the normalization, the feed-forward layer, et cetera, et cetera, because I will 00:01:38.320 |
also only give you enough background to understand BERT, so it's not a video on the transformer 00:01:44.160 |
So please, if you are not familiar with the transformer, please go watch my previous video 00:01:47.680 |
on the transformer model and then you can watch this video if you want to fully understand 00:01:53.440 |
So let's start our journey with language models. 00:01:59.000 |
Well, a language model is a probabilistic model that assigns probabilities to sequence 00:02:05.360 |
In practice, a language model allows us to compute the following probability, which is 00:02:10.280 |
the probability of the word, for example, China, following the sentence, "Shanghai is 00:02:18.780 |
So what is the probability that the word China comes next in the sentence, "Shanghai is a 00:02:25.520 |
This is the kind of probability that we model using language models, which is a neural network 00:02:34.160 |
When this large corpora of text is very, very, very large, we also call them large language 00:02:40.120 |
There are many examples of large language models. 00:02:45.200 |
They are also called foundation models because they have been pre-trained on a very big corpora 00:02:51.120 |
of text, for example, the entire Wikipedia or billions of pages from the internet. 00:02:56.620 |
And then we can use them with prompting or with fine tuning. 00:03:04.240 |
Okay, let's review how we train a large language model or a language model in general, actually. 00:03:11.160 |
So the training of a large language model involves that we have some corpora, so a piece 00:03:16.800 |
of text, which could be the entire Wikipedia. 00:03:21.720 |
It could be just one book or it could be anything. 00:03:24.240 |
In my example, imagine we want to train a large language model or a large language model 00:03:34.320 |
This one, this is a very famous poem from Li Bai. 00:03:36.920 |
It's one of the first poems that you learn if you want to study Chinese literature or 00:03:43.320 |
So we will concentrate only on the following line. 00:03:51.160 |
Let's see how to train the large language model. 00:03:53.400 |
Well, the first thing we do is we create a sequence of the line that we want to teach 00:04:02.680 |
We create a sentence to which we prepend one token called start of sentence. 00:04:09.000 |
This sentence or input, which is made up of tokens, in this case, in our very simple case, 00:04:16.360 |
we can consider that each word is a token, but this is not always the case because depending 00:04:20.600 |
on the tokenizer, each word may be split into multiple tokens. 00:04:25.840 |
But suppose for simplicity that our tokenizer always takes one word as one token. 00:04:31.520 |
So we feed this input sequence to our neural network, which is modeling the language model. 00:04:37.000 |
Usually it's the trans, the encoder part of the transformer. 00:04:39.840 |
So only this part here, along with the linear and the softmax, for example, like llama. 00:04:47.160 |
Then we, this transformer encoder will output a sequence of tokens. 00:04:51.600 |
So transformer is a sequence to sequence model, means that if you give it a sequence of 10 00:04:56.960 |
tokens, it will output a sequence of 10 tokens. 00:05:02.280 |
When we train a model, we all, we have an input. 00:05:06.560 |
What is, that is what we want the model to output. 00:05:11.340 |
So we want the model to output the same sentence, but without prepending anything, but by appending 00:05:22.960 |
So it's the same sentence as before, but instead of having a token at the beginning, it has 00:05:32.400 |
Now let's review why do we need this start of sentence token and end of sentence token? 00:05:37.360 |
Because as I said before, the neural network that we are using, which is a transform is 00:05:43.280 |
It means that if we have an input of N tokens as input, it will produce N tokens as output. 00:05:49.600 |
So if, for example, we give this neural network only the first token, so the start of sentence, 00:05:58.520 |
So in case it has already been trained, it should output the first token of the target. 00:06:09.960 |
If we input the first two tokens, for example, start of sentence before it should output 00:06:15.680 |
the first two tokens of the target before my, if we input the first three tokens, start 00:06:23.880 |
of sentence before my, it should output the first three tokens of the output. 00:06:28.560 |
So before my bed, as you can see, every time we give it an input, the last token of the 00:06:33.800 |
output, in case it has already been trained and it's matching the target is the next token 00:06:42.160 |
So for example, if we give only the first two tokens, start of sentence before the model 00:06:47.840 |
outputs before my, so it will give us the next token after before. 00:06:52.840 |
If we give it start of sentence before my, it will output before my bed. 00:06:58.220 |
So the next token after the word my, so bed, et cetera, et cetera. 00:07:02.520 |
So every time we give the model a token and the model will return the next token as the 00:07:10.440 |
And this is how we train a large language model. 00:07:13.400 |
So once we have our target token and the output, we compute the loss, which is the cross entropy 00:07:22.800 |
Now let's review how we inference from a language model. 00:07:27.080 |
So imagine you are a student who had to memorize Leiby's poem, but you only remember the first 00:07:36.360 |
So you only remember the first two words of the first line of the poem. 00:07:42.020 |
What you could do, imagine you already have a language model that has been trained on 00:07:46.500 |
You could ask the language model to write the rest of the poem for you, but of course 00:07:50.580 |
you need to give the language model some input on what you want. 00:07:55.920 |
So you tell the language model the first two tokens and the model will come up with the 00:08:05.580 |
Let's review the poem again, which is before my bed lies a pool of moon bright. 00:08:12.600 |
We give the model of our first two tokens by prepending the start of sentence token 00:08:18.520 |
and we feed it to the neural network, which has been already pre-trained. 00:08:27.040 |
We take this last token, we append it to the input and we give it back to the model. 00:08:33.200 |
So now we give it before my bed and we feed it again to the model and the model will output 00:08:41.320 |
We take this last token lies, we append it again to the input and we feed it again to 00:08:46.780 |
the transformer model and the model will output the next token of the line. 00:08:52.280 |
And then we keep doing like this until we arrive to the end of the line or the end of 00:08:57.760 |
the poem, which is indicated by the end of sentence token, depending on how we train 00:09:05.000 |
In our case, suppose we only trained it on the first line of this poem. 00:09:09.480 |
And this is how we inference from a language model. 00:09:15.240 |
Now let's understand the architecture of the transformer model, at least the part that 00:09:20.000 |
we are interested in, which is the encoder, because this is the architecture that is also 00:09:26.400 |
So we need to understand the building blocks. 00:09:28.500 |
Even if you have already watched the transformer model, my video on the transformer model, 00:09:35.200 |
So as I said before, this is the vanilla transformer. 00:09:40.680 |
So this is the vanilla transformer, the transformer model as presented in the original paper. 00:09:47.400 |
In most language models are actually modeled just using the encoder side or the decoder 00:09:55.080 |
And they have the last linear layer to project back the tokens in the vocabulary. 00:10:00.880 |
Let's review all these building blocks of this model. 00:10:05.720 |
So I will not review the normalization, the linear layer, or the feed forward. 00:10:09.760 |
Because this, I hope you're already familiar with. 00:10:12.120 |
What I am actually interested in is the input embeddings, the positional encodings, and 00:10:15.640 |
the multi-head attention, for which I will only actually do the case of the single head, 00:10:23.360 |
So if you want to have more information, please watch my previous video on the transformer 00:10:27.600 |
Now, let's review the embedding vectors, what they are and how do we use them. 00:10:34.160 |
Usually when we train a language model or we inference a language model, we use a prompt 00:10:43.160 |
What the first thing we do is we split this text into tokens. 00:10:47.400 |
In our simple case, we will do a very simple tokenization in which we split each word. 00:10:55.560 |
This is actually not always the case with language models. 00:10:58.660 |
For example, in Lama or in other language models, we use the BPE tokenizer. 00:11:03.940 |
In BERT, we will see, we use the word piece tokenizer. 00:11:10.060 |
In our simple case, we just pretend that each word is actually a token with some tokens 00:11:17.520 |
For example, the start of sentence token is a special token that exists only virtually. 00:11:25.920 |
The first thing we do is we do this tokenization. 00:11:30.320 |
We map each token into its position in the vocabulary. 00:11:34.120 |
So imagine you have very big corpora of text. 00:11:39.440 |
Each word will occupy a position in the vocabulary. 00:11:42.960 |
So we map each word to its position in the vocabulary. 00:11:46.440 |
In this case, each token into its position in the vocabulary. 00:11:51.000 |
Then we map each of these numbers, which are the position of the token in the vocabulary 00:12:01.120 |
Now this size of size 512 is the one used in the vanilla transformer. 00:12:07.400 |
We will see that in BERT the size is 768, if I'm not mistaken. 00:12:13.000 |
But for now, I will only always refer to the configuration of the vanilla transformer. 00:12:18.360 |
So the transformer as presented in the attention is all you need. 00:12:22.640 |
So each of these input IDs, which is the position of the token of each word is projected into 00:12:29.560 |
This embedding is a vector of size 512 that captures the meaning of each token, in this 00:12:38.560 |
But why do we use embedding vectors to capture the meaning of each token? 00:12:47.000 |
For example, given the word "cherry", "digital" and "information", the idea is this. 00:12:53.560 |
Imagine we live in a very simple world in which the embedding vector is not made up 00:13:01.080 |
So we can project these vectors on the XY plane. 00:13:06.120 |
If we project them, and if the embedding vectors have been trained correctly, we will see that 00:13:11.040 |
the words with similar meaning will point to the same direction in space, while words 00:13:16.720 |
with different meaning will point to different directions in space. 00:13:20.160 |
For example, the word "digital" and "information", because they capture the same kind of semantic 00:13:27.040 |
meaning "information", they will point to similar directions. 00:13:32.040 |
And we can measure this similarity by measuring the angle between them. 00:13:36.080 |
So for example, the angle between "digital" and "information" is very small, you can see 00:13:39.840 |
here, while the angle between "cherry" and "digital" is quite big, because they represent 00:13:44.880 |
different semantic groups, so they have different meaning. 00:13:48.640 |
Imagine there is also another word called "tomato". 00:13:50.720 |
We expect the word "tomato" to point to the vertical direction, very similar to the "cherry", 00:13:57.560 |
So that the angle between "cherry" and "tomato" is very small, for example. 00:14:02.280 |
And we measure this angle between vectors using the cosine similarity, which is based 00:14:09.880 |
And we will see that this dot product is very important, because we will use it in the attention 00:14:17.060 |
Ok, now let's review the positional encodings as presented in the original paper of the 00:14:26.920 |
We need to give some positional information to our model, because now we only gave some 00:14:32.120 |
vectors that represent the meaning of the word. 00:14:36.520 |
But we also need to tell the model that this particular word is in the position 1 in the 00:14:41.160 |
sentence, and this particular word is in position 2 in the sentence, etc. etc. 00:14:45.120 |
And this is the job of the positional encodings. 00:14:49.360 |
So we start with our original sentence, which is the first line of the Chinese poem we saw 00:14:55.020 |
We convert it into embedding vectors of size 512. 00:14:59.200 |
Then each of these embedding vectors, we add another vector. 00:15:03.240 |
This is called the positional encoding or positional embedding, I saw both names used. 00:15:09.160 |
And this position embedding actually indicates the position of this particular token inside 00:15:16.520 |
And this vector here indicates the position 1 of this token. 00:15:20.800 |
And this indicates the position 2 and the position 3 and position 4. 00:15:24.360 |
Now, actually, this position embedding, at least in the vanilla transformer, they are 00:15:29.000 |
computed once and they are reused for every sentence during training and inference. 00:15:34.000 |
So they are not specific for this particular token, but they are only specific for this 00:15:39.000 |
particular position, which means that every token in the position 0 or every token in 00:15:44.920 |
position 1 will receive this particular vector added to it that represents the position number 00:15:51.960 |
So the result of this addition is these vectors here, which will become the input of the encoder 00:16:01.600 |
How do we compute these positional encodings, at least as presented in the original transformer 00:16:07.200 |
Well, suppose we have a sentence made up of three words or three tokens. 00:16:12.800 |
We have seen these formulas before from the paper "Attention is all you need". 00:16:17.280 |
We create a vector of size 512 and for the even dimensions of this vector, we use the 00:16:23.080 |
first formula and for the odd dimensions, we use the second formula, in which the arguments 00:16:28.640 |
of these two formulas is, the first one is a pause, which indicates the position of the 00:16:38.200 |
And 2i indicates the dimension of this vector to which we are applying. 00:16:43.920 |
And we can compute it also for the second vector. 00:16:48.040 |
If we have another sentence that is different, for example, I love you, which is also made 00:16:52.080 |
up of three tokens, we will reuse the same vectors as the other sentence. 00:16:58.200 |
So this particular vector here is associated with the position 0, not with the token before. 00:17:04.400 |
So if we have another token, we will reuse the same vector for the position 0. 00:17:11.920 |
Ok, now let's review what is the self-attention that we use in language models. 00:17:19.480 |
Because language models need to find a way to relate tokens with each other so that they 00:17:24.360 |
can compute some kind of interactions between tokens. 00:17:29.120 |
So for example, tokens have a meaning not by themselves, but by the way they are present 00:17:35.240 |
inside of the sentence and their relationship with other tokens inside of the sentence. 00:17:40.840 |
And this is the job of the self-attention, which is done here in the multi-head attention. 00:17:45.420 |
We will not see the multi-head attention, we will see the single head attention. 00:17:50.920 |
Now let's build the input for this self-attention because for now we have worked with independent 00:17:59.320 |
So we took this token and we converted it into a vector. 00:18:06.280 |
We first converted it into embedding, then we added the positional encoding to capture 00:18:12.920 |
But actually I lied to you by telling you that we work independently. 00:18:17.520 |
Actually when we code the transformer, we always work with the matrix form. 00:18:21.720 |
So all these tokens, all these vectors are never alone. 00:18:26.000 |
They are always in a big matrix that contains all of them. 00:18:31.240 |
But before it was not easy to work with a big matrix directly because it's not easy 00:18:38.360 |
So we combine all these vectors in a one big matrix here in which each row is one of these 00:18:47.080 |
So for example, the first vector here becomes the first row. 00:18:50.560 |
The second vector here becomes the second row. 00:18:54.060 |
The third vector here becomes the third row, et cetera, et cetera, et cetera. 00:18:58.600 |
The shape of this matrix is 10 by 512 because we have 10 tokens and each token is represented 00:19:09.560 |
We take this matrix and we make three copies of it. 00:19:22.040 |
As you can see, the values in this matrix are all the same because there are three identical 00:19:29.480 |
Because this is the self-attention mechanism, which means that we relate tokens to each 00:19:34.720 |
other, tokens that belong to the same sentence with each other. 00:19:39.600 |
If we relate tokens of two different sentences or from two different languages, for example, 00:19:44.840 |
when we are doing a language translation, in that case, we will talk about cross-attention. 00:19:51.880 |
And this is the kind of attention that is used in language models. 00:19:55.960 |
OK, the self-attention mechanism, as you have seen, probably in the paper, works with this 00:20:02.320 |
formula here, which is the attention is calculated as the softmax of the query multiplied by 00:20:07.680 |
the transpose of the keys divided by the square root of dk, then multiplied by b. 00:20:16.080 |
The dk here actually represents the dimension of the vector of each head in case of the 00:20:22.000 |
multi-head attention, because we are actually simplifying our scenario and working with 00:20:30.840 |
So that is the size of the embedding vector that we created before, which is 512. 00:20:38.200 |
So we take our matrix, the one we built before, so the query, so 10 by 512. 00:20:43.600 |
We multiply it by the transpose of the keys, which becomes a 512 by 10. 00:20:48.960 |
So they are basically the identical matrix, but one is transposed and one is not. 00:20:58.280 |
We apply the softmax and this will produce the following matrix we can see here, in which 00:21:04.040 |
each value is the softmax of the dot product of one token with another token of the embedding 00:21:15.480 |
For example, let's visualize with some vectors. 00:21:20.220 |
So we built the matrix before, which is made up of 10 rows because each row is a token 00:21:29.460 |
and each row contains 512 numbers because it's a vector of 512 dimensions. 00:21:34.380 |
So the dimension one up to 512, then the dimension one up to 512 and we have 10 of them. 00:21:46.120 |
So we will have not 10 rows, but we will have 10 column of vectors with the dimension one 00:21:55.720 |
Then we have another column vector here with the dimension one and then 512, etc, etc, etc. 00:22:06.080 |
So this value here is the dot product of the first row of the first matrix with the first 00:22:13.440 |
column of the first matrix, which is the embedding of the first token, which if you remember 00:22:19.040 |
is the start of sentence with the embedding of the token start of sentence and it's this 00:22:28.360 |
Then this value here, it's the dot product of the embedding of the start of sentence 00:22:33.960 |
with the second token, which is this one here and it's the token before. 00:22:43.840 |
The Softmax basically changes the values in this matrix in such a way that they sum up 00:22:49.560 |
So each row in this matrix, this row for example here, sums up to one and also this row here 00:23:00.560 |
As you can see here, this word start of sentence is able to relate to the word before. 00:23:11.000 |
And this is not what we want, because as I said before, our goal is to model a language 00:23:20.260 |
That is, a language model is a probabilistic model that assigns probability to sequence 00:23:25.760 |
That is, we want to calculate the probability of the word China being the next word in the 00:23:34.760 |
That is, we want to condition the word China only on the words that come before it. 00:23:42.960 |
So our model should only be able to watch this part of the sentence to predict the next 00:23:55.360 |
Because we are able to relate tokens that also come in the future with tokens that come 00:24:01.920 |
So for example, the word SOS is being related with the token before. 00:24:06.600 |
And the token SOS is also being related with the token my, even if the token my comes after 00:24:13.040 |
So what we do, we basically need to introduce the causal mask. 00:24:23.720 |
We take the metrics that we saw before with all the attention scores and all the interactions 00:24:30.660 |
So all the interaction of words with the words that come on its right are replaced with minus 00:24:38.120 |
For example, start of sentence should not be able to relate to the word before. 00:24:44.160 |
So we replace the interaction with minus infinity before we apply the softmax. 00:24:50.120 |
And then also, for example, the word before should not be able to watch the word my bed 00:24:56.140 |
So all these interactions are also replaced with minus infinity. 00:24:58.900 |
So basically all the values above the principal diagonal that you can see here are replaced 00:25:08.760 |
And if you remember the formula for the softmax, you can see here it's on the numerator is 00:25:13.240 |
e to the power of z i, which is the item to which you are applying the softmax. 00:25:20.140 |
And e to the power of minus infinity will become zero. 00:25:23.080 |
So basically we replace them with minus infinity so that when we apply the softmax, they will 00:25:32.060 |
This way the model will not have access to any information of interactions between the 00:25:38.480 |
word start of sentence and all the tokens that come after it because we replace them 00:25:45.260 |
So even there is some kind of connection between this token, the model will not be able to 00:25:49.140 |
learn it because we never give this information to the model. 00:25:55.040 |
The only token that is able to watch all the previous token is this bright token here. 00:25:59.580 |
So the last token, because this token can see all the tokens that come before it. 00:26:08.220 |
Okay, in the formula of the attention, we also have a multiplication with v. 00:26:16.420 |
So the matrix that we saw before, we apply the causal mask, and then we multiply with 00:26:21.860 |
And this way you will understand why we apply the causal mask. 00:26:31.300 |
And this matrix v is the initial matrix that we built with our vectors. 00:26:38.620 |
So it's the value matrix, one of the three identical copies that we made before. 00:26:43.260 |
The multiplication between these two metrics will produce an output matrix that is 10 by 00:26:52.260 |
Okay, so this is a matrix made of rows of vectors. 00:27:07.800 |
So with 512 dimensions here, this one also have dimension one dimension two dimension 00:27:14.180 |
three, 512 dimension one dimension two dimension three, 512, etc, etc. 00:27:21.420 |
The output token will also have the same shape. 00:27:24.220 |
So it will be 10 by 512, which means we have 10 vectors of dimension 512 dimension. 00:27:29.860 |
So the dimension one dimension two dimension three, up to 512 dimension one, two, three, 00:27:45.220 |
To get this first value here of this matrix, so the dimension one of the first vector of 00:27:50.940 |
the attention output matrix here, this value here is the dot product of the first row. 00:27:58.380 |
So this row here, all this row with the first column of this matrix. 00:28:06.020 |
So the first dimension of the embedding of each token in our input. 00:28:12.220 |
But as you can see, because of the causal mask, most of the values in this matrix are 00:28:17.100 |
zero, as you can see here, this means that the output value here will be only be able 00:28:23.800 |
to watch the first dimension of the first token, which also means that in the output 00:28:28.860 |
of the attention, the first token will only be able to attend only itself, as you can 00:28:39.140 |
Let's look at the second, for example, this output here. 00:28:42.280 |
So the first dimension of the second row of this matrix attention output here. 00:28:48.940 |
So this value here comes from the second row of the initial matrix. 00:28:53.200 |
So this one here multiplied by the first column of this matrix. 00:29:02.100 |
So this means that this output here will depend only on the first two tokens because all the 00:29:14.340 |
OK, now we are ready to explore BERT and the architecture behind BERT. 00:29:20.620 |
So BERT's architecture is also using the encoder of the transformer model. 00:29:26.180 |
So we have input embeddings, we have positional encodings, we will see that the positional 00:29:32.500 |
Then we have self-attention, we have normalization, we have feedforward. 00:29:36.540 |
And then we have this head, this linear head we can see here. 00:29:40.100 |
And this we will see later that it changes according to the specific task for which we 00:29:46.740 |
BERT was introduced with the two pre-trained models. 00:29:52.340 |
The BERT-Base, for example, has 12 encoder layers, which means that this block here, 00:29:57.180 |
so the gray block you can see here, is repeated 12 times, one after another, and the output 00:30:02.620 |
of the last layer is fed to this linear layer and then to the softmax. 00:30:08.100 |
The size of the hidden size of the feedforward layer is 3072, so the feedforward layer you 00:30:13.580 |
can see here, which is basically just two linear layers, the size of the features is 00:30:20.660 |
And then we have 12 attention heads in the multi-head attention. 00:30:26.420 |
Now what are the differences between BERT and the vanilla transformer? 00:30:30.320 |
The first difference is that the embedding vector is not 512 anymore, but it's 768 for 00:30:41.540 |
From now on, I will always refer to the number 768, so that the embedding of the vector in 00:30:50.420 |
Another difference is that the positional encoding in the vanilla transformer were computed 00:30:54.900 |
using the sine and the cosine function we saw before. 00:30:58.560 |
But in BERT, these positional embeddings are not fixed and pre-computed using fixed functions, 00:31:05.500 |
but they are actually embeddings that are learned during training. 00:31:09.480 |
And they are of the same size of the embedding vector, of course, because they are summed 00:31:14.900 |
So they have 768 dimensions in BERT-Base and 1024 in BERT-Large. 00:31:21.600 |
But these positional embeddings are limited to 512 positions, which means that BERT cannot 00:31:29.920 |
handle sentences longer than 512 tokens, because we only have 512 vectors to represent positions. 00:31:40.960 |
And the linear layer head changes according to the application, so this linear layer here. 00:31:47.200 |
So we saw before that BERT uses not the tokenizer that we have used, the simple one, which only 00:31:53.100 |
treats each word as a token, but it uses what's called the word piece tokenizer, which also 00:31:57.860 |
allows sub-word tokens, so each word can become multiple tokens. 00:32:02.440 |
The vocabulary size is roughly 30,000 tokens in the BERT-Base and BERT-Large. 00:32:08.000 |
Ok, let's see the differences between BERT and the language models like GPT and LLAMA. 00:32:14.140 |
In my slide I call these kind of models the common language models, so they are commonly 00:32:19.200 |
known as language models, so like GPT and LLAMA. 00:32:22.200 |
So unlike them, BERT does not handle special tasks with prompts, but rather it can be specialized 00:32:29.040 |
on a particular task by means of fine-tuning, and we will see what I mean by this in the 00:32:35.920 |
The second difference is that BERT has been trained using the left context and the right 00:32:40.040 |
context, so we will see also what we mean by this. 00:32:45.040 |
BERT is not built specifically for text generation. 00:32:48.400 |
So for example, you can use LLAMA to generate a big article given a prompt, but you cannot 00:32:54.960 |
BERT is useful for other kind of tasks, and we will see which ones. 00:32:59.160 |
And BERT has not been trained on the next token prediction task. 00:33:03.320 |
So the model that we trained initially on the Chinese poems was trained on the next 00:33:08.080 |
token prediction task, but BERT has not been trained on this particular task. 00:33:12.440 |
BERT has been trained on the musket language model and the next sentence prediction task, 00:33:19.200 |
Ok, so let's see how we handle different tasks in GPT or in LLAMA, and how we handle them 00:33:30.320 |
If we want to do it with GPT or with LLAMA, what we do is we build a particular prompt, 00:33:35.640 |
and this is called a few shot prompting, in which we teach the model how to handle a task 00:33:43.480 |
inside of the prompt, and then we let the model tell us the answer without the last 00:33:50.760 |
one in the last part of the prompt, we let the model answer using the next token. 00:33:55.760 |
So for example, if we tell the model that given the context and the question how to 00:34:00.280 |
build the answer, then given the context and the question the model should be able to come 00:34:06.520 |
up with an answer that makes sense given the previous example. 00:34:11.280 |
While in BERT we do not work with the prompt, like we do with chat GPT or with LLAMA or 00:34:17.200 |
with GPT, but we fine tune BERT on the specific task we want to work on, and we will see how. 00:34:25.880 |
As I said before, language models are models that predict the next token using only the 00:34:32.420 |
left context of each word, that is the tokens that come to the left side of each word for 00:34:41.880 |
BERT uses the left context and the right context. 00:34:45.320 |
So I want to give you some intuition in why also we humans may be using the left and the 00:34:51.800 |
Now, I am not really a linguist, so my intuition will not be maybe technically valid, but it 00:34:58.000 |
will help you understand the importance of left and right context also in human conversations. 00:35:05.920 |
The left context in human conversation is used every time we have a phone conversation. 00:35:09.920 |
For example, the operator's answer is based on the user's input. 00:35:15.720 |
So the user says something, then the operator will say something, then the user will reply 00:35:19.960 |
with something based on what the operator said, and then the operator will continue 00:35:24.360 |
based on the context given by the previous conversation. 00:35:31.120 |
For right context, it's more difficult to visualize. 00:35:34.320 |
For example, imagine there is a kid who just broke his mom's favorite necklace. 00:35:38.680 |
The kid doesn't want to tell the truth to his mom, so he decides to make up a lie. 00:35:43.320 |
So instead of saying directly to his mom, your favorite necklace has broken, the kid 00:35:48.640 |
may say, "Mom, I just saw the cat playing in your room and your favorite necklace has 00:35:54.640 |
Or it may say, "Mom, alien came through the window with laser guns and your favorite necklace 00:36:01.880 |
As you can see, the kid conditioned the lie on what he want to say next. 00:36:08.400 |
So it doesn't even matter which lie the kid says, because he want to create some context 00:36:16.200 |
before arriving to the conclusion, which is, "Your favorite necklace has broken." 00:36:20.760 |
So he conditions the word he chooses initially on what he want to say next. 00:36:29.840 |
And this could be an intuition in how we humans use the right context. 00:36:38.680 |
Now let's talk about the pre-training of model. 00:36:41.520 |
Because for example LAMA or GPT, they have been pre-trained on a large corpora of text. 00:36:47.560 |
And then we can use them with prompting or with fine tuning. 00:36:51.360 |
BERT has not been pre-trained like with LAMA or GPT using the next token prediction task, 00:36:58.440 |
but on two specific tasks called the Masked Language Model and the Next Sentence Prediction 00:37:07.560 |
The Masked Language Model is also known as the Claude's Task. 00:37:10.880 |
And you may know it from some papers or tests that you have done at university. 00:37:16.300 |
For example, the teacher gives you a sentence in which one word is missing and you have 00:37:20.780 |
to fill the empty space with the missing word. 00:37:24.400 |
And this is how BERT has been trained with the Masked Language Model. 00:37:28.080 |
Basically what they did is they took some sentence from the corpora from which they 00:37:34.700 |
They choose some tokens randomly and they replace this random token with a special token 00:37:45.300 |
Then they feed this masked input to BERT and BERT has to come up with the word that was 00:38:00.820 |
So first we need to understand how BERT uses the left and the right context. 00:38:07.820 |
So as you saw before, when we compute the attention we use the formula softmax of query 00:38:12.480 |
multiplied by the transpose of the keys divided by the square root of dk and then we multiply 00:38:17.460 |
This means that we take the query matrix, we multiply it by the transpose of the keys, 00:38:24.000 |
we do the softmax and this will produce this matrix here that we saw before. 00:38:28.640 |
But unlike before, like we did for Language Models, in this case we will not use any mask 00:38:35.940 |
to cancel out the interactions of words with words that come after it. 00:38:41.920 |
So for example before we replaced the value here, for example, with minus infinity and 00:38:46.640 |
also this value with minus infinity and actually all the values above this diagonal with minus 00:38:51.440 |
infinity because we didn't want the model to learn these interactions. 00:38:55.160 |
But with BERT we do not use any mask, which means that each token attends tokens to its 00:39:05.600 |
Ok, let's review the details of the Masked Language Model. 00:39:10.300 |
So imagine we want to mask this sentence, so Rome is the capital of Italy, which is 00:39:18.880 |
The pre-training procedure selects 15% of the tokens from this sentence to be masked. 00:39:24.980 |
If a token is selected, for example the word capital is selected, then 80% of the time 00:39:30.720 |
it is replaced with the masked token, becoming this input for BERT, 10% of the time it's 00:39:36.880 |
replaced with a random token, Zebra, and 10% of the time it's not replaced at all, so it 00:39:44.640 |
In any of these three cases, BERT has to predict the word that was masked out, so BERT should 00:39:49.920 |
output capital for each of these three inputs based on the probability they happen. 00:39:58.880 |
So during training, what we do, for example, imagine this is our initial sentence in which 00:40:04.840 |
we masked out the word capital like we saw before, this will result in an input to be 00:40:09.760 |
fed to BERT of 14 tokens, if we count the tokens here, we feed it to BERT, BERT will 00:40:17.160 |
produce an output because BERT is a transformer model so 14 tokens of input result in 14 tokens 00:40:25.480 |
What we do is we only check the BERT, the position that was masked out, so this token 00:40:30.640 |
here, we compare it with our capital, with our target, and we compute the loss, and we 00:40:39.760 |
So basically what we want is this token here to be capital. 00:40:47.560 |
Okay now let's review the next sentence prediction task. 00:40:51.920 |
Next sentence prediction task means that we have a text, we choose two sentences randomly, 00:41:00.040 |
actually we select one sentence randomly, so the sentence A, and then 50% of the time 00:41:05.920 |
we select its immediately next sentence, so the second line, or 50% of the time we select 00:41:12.400 |
a random sentence from the rest of the text, in this case this one, the third line. 00:41:17.480 |
In our case we selected the first line as sentence A, and the sentence B is the third 00:41:22.800 |
line, so it's not the immediately following sentence. 00:41:26.880 |
We feed both of these sentences to BERT, and BERT has to predict if the sentence B comes 00:41:35.320 |
In this case, because sentence B is the third line, and sentence A is the first line, it's 00:41:41.880 |
not immediately following, so BERT should reply with "not next". 00:41:46.480 |
In the case we had selected the second line as sentence B, BERT should reply with "is 00:41:52.840 |
We have two problems here, how can we encode two sentences to become the input for BERT, 00:42:00.360 |
and second problem is how can BERT tell us that it's the next sentence or it's not the 00:42:09.800 |
Well the idea is this, we take the two sentences and we encode it as only one sentence, it 00:42:17.600 |
becomes one input, so the tokens of the two sentences become one input concatenated together, 00:42:23.920 |
in which we prepend one special token called CLS, then the tokens of the first sentence, 00:42:30.160 |
so suppose the sentence is "my dog is cute", then we add the token called separator, then 00:42:35.200 |
the tokens of the second sentence, so this "he likes playing" and then another token 00:42:43.580 |
The problem is, if we feed only this input to BERT, BERT will not be able to understand 00:42:48.560 |
that this "my" belongs to sentence A and this "likes" belongs to sentence B, so what we 00:42:54.640 |
did before initially, as I told you before, when we feed the input to any language model 00:43:00.560 |
that uses transformer, we first tokenize it, then we convert it into embedding vectors 00:43:06.080 |
of size 512 in case the vanilla transformer or 768 in case of BERT, then we append another 00:43:13.720 |
vector that represents the position of this token, so the position embeddings, in BERT 00:43:18.720 |
we have another embedding called segment embedding, so it's another vector that we add to the 00:43:24.240 |
position embedding and to the token embedding and represents the fact that this token belongs 00:43:29.920 |
to sentence A or to sentence B, and this is how we encode the input for BERT. 00:43:36.760 |
Let's see how we train it, so we create our input which is the first line of this poem 00:43:44.240 |
and the third line of this poem together with the separator token we can see here and a 00:43:50.280 |
special token called CLS here, and we also encode the information that all these tokens 00:43:57.360 |
belong to sentence A and all these tokens belong to sentence B by using this special 00:44:03.920 |
this one here, segment embedding we saw here. 00:44:07.480 |
Now we feed this input to BERT, BERT will come out with an output because it's a transformer 00:44:15.760 |
model so an input of 20 tokens corresponds to an output of 20 tokens. 00:44:21.480 |
We take only the first token of the output, the one that corresponds to the token CLS 00:44:28.040 |
which stands for classifier, we feed it to a linear layer with only two output features, 00:44:35.120 |
one feature indicating next and one indicating not next, we apply the softmax, we compare 00:44:41.880 |
it with our target so we expect BERT to say not next because we fed it the third line 00:44:48.680 |
as sentence B and not the second line, and then we compute the loss which is the cross 00:44:54.120 |
entropy loss and we run backpropagation to update the weights, and this is how we train 00:45:02.600 |
Now this CLS token, let's review how this works, so as we saw before the formula for 00:45:08.800 |
the attention is a query multiplied by the keys, the transpose of the keys divided by 00:45:12.760 |
the square root of 768, we apply the softmax and this will produce this attention score 00:45:23.160 |
Now as we can see the CLS token always interacts with all the other tokens because we didn't 00:45:29.000 |
apply any causal mask, so we can consider that the CLS token acts as a token that captures 00:45:36.680 |
the information from all the other tokens because the attention matrix here, we didn't 00:45:44.280 |
apply any mask before applying the softmax, so all of these attention values will be actually 00:45:50.760 |
learned by the model, and this is the idea behind this CLS token. 00:45:56.840 |
So if we do for example the matrix multiplication that we did before, so let's compute the first 00:46:02.400 |
row of the output matrix, let's see, so this matrix here is the input matrix which is 10 00:46:12.240 |
by 768, because suppose the input is very simple, so before my bed lies a pool of moon 00:46:17.040 |
bright, so 10 tokens of input, 1, 2, 3, etc, etc, so 10 tokens, each of them has 768 embeddings 00:46:27.840 |
because we are talking about birth, 8, 768, so the first dimension, the first dimension, 00:46:36.980 |
the first dimension up to 768, this will result in an output matrix of 10 by 768, so the first 00:46:52.960 |
We are only interested in this output here, which corresponds to the position of the CLS 00:46:59.280 |
token, let's see, the first dimension, so the dimension number 1 of this output token 00:47:05.840 |
will be the dot product of this vector here, which is made of 10 dimensions, with the first 00:47:14.440 |
column of this matrix here, which is also made of 10 dimensions because we have 10 tokens, 00:47:20.660 |
but because none of the value here is 0, actually here is 0 because I chose random numbers, 00:47:26.000 |
but suppose this is 0.03 and 0.04 let's say, because none of the values in this matrix 00:47:36.560 |
is 0, the output, the CLS will be able to access the attention scores of all the tokens, 00:47:43.520 |
so basically this token here will aggregate the attention scores, so the relationship 00:47:49.440 |
with all the tokens, the CLS can also be thought of as the CEO in a company and you are the 00:47:56.800 |
shareholder, when you are the shareholder, you don't ask the information to the employees, 00:48:00.640 |
you ask to the CEO, and the CEO's job is to talk to every guy in the company, to every 00:48:07.720 |
person in the company to get the necessary information to reach the goal, and this is 00:48:13.920 |
the goal of the CLS, the CLS can be thought of as the aggregator of all the information, 00:48:19.020 |
of all the information present inside of the sentence, and we use it to classify, that's 00:48:28.040 |
Okay now let's talk about fine-tuning, as we saw before, BERT does not work with prompts 00:48:34.720 |
like LLAMA or GPT, so we cannot use zero-shot prompting or few-shot prompting or chain of 00:48:42.240 |
thoughts or any other prompting technique, with BERT we work with fine-tuning, so we 00:48:47.200 |
take the pre-trained model, and if we want to do text classification, we fine-tune BERT 00:48:51.960 |
on our dataset for text classification, or question answering, let's see how these two 00:48:59.440 |
Suppose we want to do text classification, so text classification is the task of assigning 00:49:05.320 |
a label to a piece of text, for example, imagine you are running an internet provider and we 00:49:09.960 |
receive complaints from our customers, we may want to classify requests coming from 00:49:15.060 |
users as hardware problems, software problems or billing problems, for example, this complaint 00:49:21.360 |
here is definitely a hardware problem, this complaint here is definitely a software problem 00:49:27.160 |
and this one is definitely a billing problem, we want to classify automatically this request 00:49:33.360 |
that we keep receiving from customers, how do we do that? 00:49:37.440 |
Well we take our request, we feed it to BERT and BERT should tell us which one of the three 00:49:43.280 |
options it best represents this particular request, how can BERT tell us one of these 00:49:51.960 |
three options and how can we feed our request to BERT? 00:49:57.000 |
So when we train BERT for text classification, we create our input, we prepend to the request 00:50:03.440 |
text the classifier token, so the CLS token, we feed it to BERT, BERT will come up with 00:50:10.840 |
an output, so 16 input tokens correspond to 16 output tokens, we only care about the first 00:50:16.200 |
one, which is the one corresponding to the CLS token, we send the output to a linear 00:50:22.640 |
layer with three output features, because we have three possible classes, one is software, 00:50:28.280 |
one is hardware, one is billing and then we apply Softmax, we compare it with what we 00:50:33.160 |
expect BERT to learn about this particular request, that is that this request is hardware, 00:50:40.440 |
then we calculate the loss, which is the cross entropy loss and finally we run backpropagation 00:50:44.460 |
to update the weights and this is how we fine-tune BERT on text classification. 00:50:54.920 |
Question answering basically means this, we have a context from which we need to extract 00:51:00.360 |
the answer to a particular question, for example the context is Shanghai is a city in China, 00:51:07.520 |
it is also a financial center, it's fashion capital and industrial city, the question 00:51:12.200 |
is what is the fashion capital of China, well the model should highlight the word or tell 00:51:17.920 |
us which part of the context we can find the answer, so BERT should be able to tell us 00:51:24.720 |
where the answer starts and where the answer ends in the context. 00:51:31.640 |
But we have two problems, first we need to find a way for BERT to understand which part 00:51:36.120 |
of the input is the context and which one is the question, second we need to find a 00:51:41.040 |
way for BERT to tell us where the answer starts and where the answer ends, let's see both 00:51:48.640 |
The first problem can be solved easily using the segment embedding we saw before, so we 00:51:54.120 |
concatenate the question and the context together as a single input with the separator token 00:52:00.220 |
in the middle like we saw before for the next sentence prediction task, the question will 00:52:09.920 |
be encoded as a sentence A while the context will be encoded as a sentence B, so this problem 00:52:20.100 |
How do we get the answer from BERT, well let's see, so first we prepare the input for BERT, 00:52:27.920 |
so we prepend the CLS token, what is the fashion capital of China, separator, Shanghai is a 00:52:34.200 |
city in China etc, so this is the context which has been encoded as sentence B and the 00:52:38.720 |
first part has been encoded as sentence A, we feed it to BERT, BERT will come up with 00:52:45.040 |
the output which is 27 tokens because the input is made up of 27 tokens, we also know 00:52:51.960 |
which tokens correspond to which sentence, so which correspond to the sentence A, which 00:52:57.520 |
correspond to the sentence B because we give it as input, we apply a linear layer with 00:53:03.900 |
two output features, one that indicates if one particular token is the start token and 00:53:11.040 |
another feature that indicates if the token is an end token, we know where is the answer 00:53:18.720 |
because we know the answer is the word Shanghai which is the start should be the token 10 00:53:24.160 |
and the end should be the token 10, then we calculate the loss based on our target and 00:53:29.600 |
the output of this linear layer, we run backpropagation and this is how we fine tune BERT for question 00:53:35.480 |
answering and this is it guys, I hope that you liked my video, I used a very unconventional 00:53:43.900 |
way of describing BERT, that is I started from language models, I introduced the concept 00:53:49.200 |
of how language models work and then I introduced BERT because I wanted to create a comparison 00:53:54.280 |
of how BERT works versus how other language models work so that you can appreciate the 00:54:00.440 |
qualities and the weaknesses of both, BERT is actually not so recent model, it was introduced 00:54:06.240 |
in 2018 if I remember correctly, so it is quite aged but still very relevant for a lot 00:54:11.880 |
of tasks and I hope that you will be coming again to my channel for more content so please 00:54:18.520 |
subscribe and share this video if you like it, if you have any questions please write 00:54:23.140 |
it in the comment, I am also very active on LinkedIn if you want to add me, if you want 00:54:29.840 |
to have some particular video review, model review, write it in the comment and I hope 00:54:36.640 |
that in my next video I will be able to code BERT from scratch, so using PyTorch so we 00:54:42.000 |
can also learn, put to practice all the knowledge that we acquired in today's video, thank you 00:54:48.000 |
again for coming to my channel guys and have a nice day!