back to index

BERT explained: Training, Inference, BERT vs GPT/LLamA, Fine tuning, [CLS] token


Chapters

0:0 Introduction
2:0 Language Models
3:10 Training (Language Models)
7:23 Inference (Language Models)
9:15 Transformer architecture (Encoder)
10:28 Input Embeddings
14:17 Positional Encoding
17:14 Self-Attention and causal mask
29:14 BERT (overview)
32:8 BERT vs GPT/LLaMA
34:25 Left context and right context
36:36 BERT pre-training
37:5 Masked Language Model
45:1 [CLS] token
48:26 BERT fine-tuning
49:0 Text classification
50:50 Question answering

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hello guys, welcome to my new video on BERT.
00:00:02.520 | In this video I will be explaining BERT from scratch.
00:00:05.860 | What do I mean by this?
00:00:07.500 | I will explain actually all the building blocks that make up BERT along with all the background
00:00:13.280 | knowledge that you need to understand this.
00:00:15.780 | So we will start with a little review of what are language models, how they are trained,
00:00:20.360 | how we inference language models, then we will see the transformer architecture, at
00:00:24.400 | least the one used by language models, so the encoder part.
00:00:27.680 | We will review embedding vectors, positional encoding, self-attention and causal mask.
00:00:32.560 | And then I will introduce BERT.
00:00:34.640 | So I will first give you a big background knowledge before you can start learning BERT
00:00:42.600 | so you understand each building blocks of BERT.
00:00:45.640 | And then we will see concepts like what is the left context, what is the right context
00:00:49.960 | and how it's used in BERT.
00:00:51.960 | We will see the two tasks on which BERT has been pre-trained, so the masked language model
00:00:55.760 | and the next sentence prediction task.
00:00:58.640 | And finally, we will also see what is fine tuning and how do we fine tune BERT with the
00:01:05.280 | text classification task and the question answering task.
00:01:09.200 | What do I expect you to know before watching this video?
00:01:12.240 | Well, for sure, I hope that you are familiar with the transformer model.
00:01:17.100 | For example, if you have watched my previous video on the transformer model, that would
00:01:20.400 | be great because even if I will review some of the concepts of the transformer, I will
00:01:27.480 | not review all of them.
00:01:28.480 | So for example, I will not touch concepts like the cross attention or the multi-head
00:01:32.680 | attention or the normalization, the feed-forward layer, et cetera, et cetera, because I will
00:01:38.320 | also only give you enough background to understand BERT, so it's not a video on the transformer
00:01:43.160 | model.
00:01:44.160 | So please, if you are not familiar with the transformer, please go watch my previous video
00:01:47.680 | on the transformer model and then you can watch this video if you want to fully understand
00:01:52.240 | BERT.
00:01:53.440 | So let's start our journey with language models.
00:01:57.560 | What is a language model?
00:01:59.000 | Well, a language model is a probabilistic model that assigns probabilities to sequence
00:02:04.360 | of words.
00:02:05.360 | In practice, a language model allows us to compute the following probability, which is
00:02:10.280 | the probability of the word, for example, China, following the sentence, "Shanghai is
00:02:17.120 | a city in".
00:02:18.780 | So what is the probability that the word China comes next in the sentence, "Shanghai is a
00:02:24.520 | city in"?
00:02:25.520 | This is the kind of probability that we model using language models, which is a neural network
00:02:30.880 | trained on a very large corpora of text.
00:02:34.160 | When this large corpora of text is very, very, very large, we also call them large language
00:02:39.120 | models.
00:02:40.120 | There are many examples of large language models.
00:02:41.920 | For example, we have Lama.
00:02:43.440 | For example, we have GPT.
00:02:45.200 | They are also called foundation models because they have been pre-trained on a very big corpora
00:02:51.120 | of text, for example, the entire Wikipedia or billions of pages from the internet.
00:02:56.620 | And then we can use them with prompting or with fine tuning.
00:03:01.360 | Later we will see how do we do this.
00:03:04.240 | Okay, let's review how we train a large language model or a language model in general, actually.
00:03:11.160 | So the training of a large language model involves that we have some corpora, so a piece
00:03:16.800 | of text, which could be the entire Wikipedia.
00:03:19.920 | It would be web pages.
00:03:21.720 | It could be just one book or it could be anything.
00:03:24.240 | In my example, imagine we want to train a large language model or a large language model
00:03:29.240 | on Chinese poems.
00:03:31.040 | And suppose we only have one poem.
00:03:34.320 | This one, this is a very famous poem from Li Bai.
00:03:36.920 | It's one of the first poems that you learn if you want to study Chinese literature or
00:03:41.920 | the Chinese language.
00:03:43.320 | So we will concentrate only on the following line.
00:03:47.080 | So before my bed lies a pool of moon bright.
00:03:51.160 | Let's see how to train the large language model.
00:03:53.400 | Well, the first thing we do is we create a sequence of the line that we want to teach
00:03:59.200 | to the model.
00:04:00.200 | So let's start with the first line.
00:04:02.680 | We create a sentence to which we prepend one token called start of sentence.
00:04:09.000 | This sentence or input, which is made up of tokens, in this case, in our very simple case,
00:04:16.360 | we can consider that each word is a token, but this is not always the case because depending
00:04:20.600 | on the tokenizer, each word may be split into multiple tokens.
00:04:25.840 | But suppose for simplicity that our tokenizer always takes one word as one token.
00:04:31.520 | So we feed this input sequence to our neural network, which is modeling the language model.
00:04:37.000 | Usually it's the trans, the encoder part of the transformer.
00:04:39.840 | So only this part here, along with the linear and the softmax, for example, like llama.
00:04:47.160 | Then we, this transformer encoder will output a sequence of tokens.
00:04:51.600 | So transformer is a sequence to sequence model, means that if you give it a sequence of 10
00:04:56.960 | tokens, it will output a sequence of 10 tokens.
00:05:02.280 | When we train a model, we all, we have an input.
00:05:04.880 | We have also a target.
00:05:06.560 | What is, that is what we want the model to output.
00:05:11.340 | So we want the model to output the same sentence, but without prepending anything, but by appending
00:05:17.280 | one last token called end of sentence.
00:05:20.080 | So the total length is still 10 tokens.
00:05:22.960 | So it's the same sentence as before, but instead of having a token at the beginning, it has
00:05:27.640 | a token at the end.
00:05:29.200 | This token is called end of sentence token.
00:05:32.400 | Now let's review why do we need this start of sentence token and end of sentence token?
00:05:37.360 | Because as I said before, the neural network that we are using, which is a transform is
00:05:41.520 | a sequence to sequence model.
00:05:43.280 | It means that if we have an input of N tokens as input, it will produce N tokens as output.
00:05:49.600 | So if, for example, we give this neural network only the first token, so the start of sentence,
00:05:55.800 | it will only output one token as output.
00:05:58.520 | So in case it has already been trained, it should output the first token of the target.
00:06:03.320 | So before, if we, let me switch to the pen.
00:06:08.920 | Okay.
00:06:09.960 | If we input the first two tokens, for example, start of sentence before it should output
00:06:15.680 | the first two tokens of the target before my, if we input the first three tokens, start
00:06:23.880 | of sentence before my, it should output the first three tokens of the output.
00:06:28.560 | So before my bed, as you can see, every time we give it an input, the last token of the
00:06:33.800 | output, in case it has already been trained and it's matching the target is the next token
00:06:40.080 | that we need to complete the sentence.
00:06:42.160 | So for example, if we give only the first two tokens, start of sentence before the model
00:06:47.840 | outputs before my, so it will give us the next token after before.
00:06:52.840 | If we give it start of sentence before my, it will output before my bed.
00:06:58.220 | So the next token after the word my, so bed, et cetera, et cetera.
00:07:02.520 | So every time we give the model a token and the model will return the next token as the
00:07:07.920 | last token in the output sequence.
00:07:10.440 | And this is how we train a large language model.
00:07:13.400 | So once we have our target token and the output, we compute the loss, which is the cross entropy
00:07:18.840 | loss and we run back propagation.
00:07:22.800 | Now let's review how we inference from a language model.
00:07:27.080 | So imagine you are a student who had to memorize Leiby's poem, but you only remember the first
00:07:33.000 | two words.
00:07:34.000 | How do you survive an exam?
00:07:36.360 | So you only remember the first two words of the first line of the poem.
00:07:42.020 | What you could do, imagine you already have a language model that has been trained on
00:07:45.500 | Chinese poem.
00:07:46.500 | You could ask the language model to write the rest of the poem for you, but of course
00:07:50.580 | you need to give the language model some input on what you want.
00:07:54.060 | That input is called the prompt.
00:07:55.920 | So you tell the language model the first two tokens and the model will come up with the
00:08:00.760 | following tokens that make up the poem.
00:08:05.580 | Let's review the poem again, which is before my bed lies a pool of moon bright.
00:08:09.880 | So let's start our inferencing.
00:08:12.600 | We give the model of our first two tokens by prepending the start of sentence token
00:08:18.520 | and we feed it to the neural network, which has been already pre-trained.
00:08:23.680 | Then the model should output before my bed.
00:08:27.040 | We take this last token, we append it to the input and we give it back to the model.
00:08:33.200 | So now we give it before my bed and we feed it again to the model and the model will output
00:08:39.000 | the next token, which is lies.
00:08:41.320 | We take this last token lies, we append it again to the input and we feed it again to
00:08:46.780 | the transformer model and the model will output the next token of the line.
00:08:52.280 | And then we keep doing like this until we arrive to the end of the line or the end of
00:08:57.760 | the poem, which is indicated by the end of sentence token, depending on how we train
00:09:04.000 | the model.
00:09:05.000 | In our case, suppose we only trained it on the first line of this poem.
00:09:09.480 | And this is how we inference from a language model.
00:09:12.280 | In a language model like Lama, for example.
00:09:15.240 | Now let's understand the architecture of the transformer model, at least the part that
00:09:20.000 | we are interested in, which is the encoder, because this is the architecture that is also
00:09:25.200 | used in BERT.
00:09:26.400 | So we need to understand the building blocks.
00:09:28.500 | Even if you have already watched the transformer model, my video on the transformer model,
00:09:32.480 | let's review all this concept again.
00:09:35.200 | So as I said before, this is the vanilla transformer.
00:09:40.680 | So this is the vanilla transformer, the transformer model as presented in the original paper.
00:09:45.880 | Attention is all you need.
00:09:47.400 | In most language models are actually modeled just using the encoder side or the decoder
00:09:52.520 | side, depending on the application.
00:09:55.080 | And they have the last linear layer to project back the tokens in the vocabulary.
00:10:00.880 | Let's review all these building blocks of this model.
00:10:04.000 | I will not review actually all of them.
00:10:05.720 | So I will not review the normalization, the linear layer, or the feed forward.
00:10:09.760 | Because this, I hope you're already familiar with.
00:10:12.120 | What I am actually interested in is the input embeddings, the positional encodings, and
00:10:15.640 | the multi-head attention, for which I will only actually do the case of the single head,
00:10:21.360 | because it's easier to visualize.
00:10:23.360 | So if you want to have more information, please watch my previous video on the transformer
00:10:26.600 | model.
00:10:27.600 | Now, let's review the embedding vectors, what they are and how do we use them.
00:10:33.160 | Okay.
00:10:34.160 | Usually when we train a language model or we inference a language model, we use a prompt
00:10:38.920 | or some input for training.
00:10:41.560 | This input is a text.
00:10:43.160 | What the first thing we do is we split this text into tokens.
00:10:47.400 | In our simple case, we will do a very simple tokenization in which we split each word.
00:10:53.680 | Each word becomes a token.
00:10:55.560 | This is actually not always the case with language models.
00:10:58.660 | For example, in Lama or in other language models, we use the BPE tokenizer.
00:11:03.940 | In BERT, we will see, we use the word piece tokenizer.
00:11:06.920 | So each word can become multiple tokens.
00:11:10.060 | In our simple case, we just pretend that each word is actually a token with some tokens
00:11:16.000 | actually not mapping to words.
00:11:17.520 | For example, the start of sentence token is a special token that exists only virtually.
00:11:22.040 | It's not actually part of the training text.
00:11:24.840 | Okay.
00:11:25.920 | The first thing we do is we do this tokenization.
00:11:28.120 | So each word becomes a token.
00:11:30.320 | We map each token into its position in the vocabulary.
00:11:34.120 | So imagine you have very big corpora of text.
00:11:37.160 | This text is made up of words.
00:11:39.440 | Each word will occupy a position in the vocabulary.
00:11:42.960 | So we map each word to its position in the vocabulary.
00:11:46.440 | In this case, each token into its position in the vocabulary.
00:11:51.000 | Then we map each of these numbers, which are the position of the token in the vocabulary
00:11:57.040 | to an embedding vector of size 512.
00:12:01.120 | Now this size of size 512 is the one used in the vanilla transformer.
00:12:07.400 | We will see that in BERT the size is 768, if I'm not mistaken.
00:12:13.000 | But for now, I will only always refer to the configuration of the vanilla transformer.
00:12:18.360 | So the transformer as presented in the attention is all you need.
00:12:22.640 | So each of these input IDs, which is the position of the token of each word is projected into
00:12:28.560 | an embedding.
00:12:29.560 | This embedding is a vector of size 512 that captures the meaning of each token, in this
00:12:36.360 | case of each word.
00:12:38.560 | But why do we use embedding vectors to capture the meaning of each token?
00:12:45.320 | Let's review.
00:12:47.000 | For example, given the word "cherry", "digital" and "information", the idea is this.
00:12:53.560 | Imagine we live in a very simple world in which the embedding vector is not made up
00:12:57.240 | of 512 dimensions, but only two dimensions.
00:13:01.080 | So we can project these vectors on the XY plane.
00:13:06.120 | If we project them, and if the embedding vectors have been trained correctly, we will see that
00:13:11.040 | the words with similar meaning will point to the same direction in space, while words
00:13:16.720 | with different meaning will point to different directions in space.
00:13:20.160 | For example, the word "digital" and "information", because they capture the same kind of semantic
00:13:27.040 | meaning "information", they will point to similar directions.
00:13:32.040 | And we can measure this similarity by measuring the angle between them.
00:13:36.080 | So for example, the angle between "digital" and "information" is very small, you can see
00:13:39.840 | here, while the angle between "cherry" and "digital" is quite big, because they represent
00:13:44.880 | different semantic groups, so they have different meaning.
00:13:48.640 | Imagine there is also another word called "tomato".
00:13:50.720 | We expect the word "tomato" to point to the vertical direction, very similar to the "cherry",
00:13:55.120 | for example, here it may be.
00:13:57.560 | So that the angle between "cherry" and "tomato" is very small, for example.
00:14:02.280 | And we measure this angle between vectors using the cosine similarity, which is based
00:14:07.360 | on the dot product between two vectors.
00:14:09.880 | And we will see that this dot product is very important, because we will use it in the attention
00:14:14.360 | mechanism that we will see later.
00:14:17.060 | Ok, now let's review the positional encodings as presented in the original paper of the
00:14:24.920 | transformer model.
00:14:26.920 | We need to give some positional information to our model, because now we only gave some
00:14:32.120 | vectors that represent the meaning of the word.
00:14:36.520 | But we also need to tell the model that this particular word is in the position 1 in the
00:14:41.160 | sentence, and this particular word is in position 2 in the sentence, etc. etc.
00:14:45.120 | And this is the job of the positional encodings.
00:14:47.400 | Let's see how they work.
00:14:49.360 | So we start with our original sentence, which is the first line of the Chinese poem we saw
00:14:53.800 | before.
00:14:55.020 | We convert it into embedding vectors of size 512.
00:14:59.200 | Then each of these embedding vectors, we add another vector.
00:15:03.240 | This is called the positional encoding or positional embedding, I saw both names used.
00:15:09.160 | And this position embedding actually indicates the position of this particular token inside
00:15:15.520 | the sentence.
00:15:16.520 | And this vector here indicates the position 1 of this token.
00:15:20.800 | And this indicates the position 2 and the position 3 and position 4.
00:15:24.360 | Now, actually, this position embedding, at least in the vanilla transformer, they are
00:15:29.000 | computed once and they are reused for every sentence during training and inference.
00:15:34.000 | So they are not specific for this particular token, but they are only specific for this
00:15:39.000 | particular position, which means that every token in the position 0 or every token in
00:15:44.920 | position 1 will receive this particular vector added to it that represents the position number
00:15:51.960 | So the result of this addition is these vectors here, which will become the input of the encoder
00:15:58.500 | and that we will see it later.
00:16:01.600 | How do we compute these positional encodings, at least as presented in the original transformer
00:16:06.200 | paper?
00:16:07.200 | Well, suppose we have a sentence made up of three words or three tokens.
00:16:12.800 | We have seen these formulas before from the paper "Attention is all you need".
00:16:17.280 | We create a vector of size 512 and for the even dimensions of this vector, we use the
00:16:23.080 | first formula and for the odd dimensions, we use the second formula, in which the arguments
00:16:28.640 | of these two formulas is, the first one is a pause, which indicates the position of the
00:16:34.200 | word inside of the sentence.
00:16:35.560 | So for the first token, it's 0.
00:16:38.200 | And 2i indicates the dimension of this vector to which we are applying.
00:16:43.920 | And we can compute it also for the second vector.
00:16:46.280 | For the third position, etc.
00:16:48.040 | If we have another sentence that is different, for example, I love you, which is also made
00:16:52.080 | up of three tokens, we will reuse the same vectors as the other sentence.
00:16:58.200 | So this particular vector here is associated with the position 0, not with the token before.
00:17:04.400 | So if we have another token, we will reuse the same vector for the position 0.
00:17:11.920 | Ok, now let's review what is the self-attention that we use in language models.
00:17:19.480 | Because language models need to find a way to relate tokens with each other so that they
00:17:24.360 | can compute some kind of interactions between tokens.
00:17:29.120 | So for example, tokens have a meaning not by themselves, but by the way they are present
00:17:35.240 | inside of the sentence and their relationship with other tokens inside of the sentence.
00:17:40.840 | And this is the job of the self-attention, which is done here in the multi-head attention.
00:17:45.420 | We will not see the multi-head attention, we will see the single head attention.
00:17:49.160 | So let's start.
00:17:50.920 | Now let's build the input for this self-attention because for now we have worked with independent
00:17:58.320 | tokens.
00:17:59.320 | So we took this token and we converted it into a vector.
00:18:03.040 | Then we added the positional encodings.
00:18:06.280 | We first converted it into embedding, then we added the positional encoding to capture
00:18:10.720 | the position information.
00:18:12.920 | But actually I lied to you by telling you that we work independently.
00:18:17.520 | Actually when we code the transformer, we always work with the matrix form.
00:18:21.720 | So all these tokens, all these vectors are never alone.
00:18:26.000 | They are always in a big matrix that contains all of them.
00:18:29.480 | So now we create this big matrix.
00:18:31.240 | But before it was not easy to work with a big matrix directly because it's not easy
00:18:36.040 | to visualize.
00:18:37.040 | So now we create a big matrix.
00:18:38.360 | So we combine all these vectors in a one big matrix here in which each row is one of these
00:18:46.080 | vectors.
00:18:47.080 | So for example, the first vector here becomes the first row.
00:18:50.560 | The second vector here becomes the second row.
00:18:54.060 | The third vector here becomes the third row, et cetera, et cetera, et cetera.
00:18:58.600 | The shape of this matrix is 10 by 512 because we have 10 tokens and each token is represented
00:19:05.680 | by a vector of 512 dimensions.
00:19:09.560 | We take this matrix and we make three copies of it.
00:19:13.040 | So three identical copies of this matrix.
00:19:15.920 | The first one, we will call it query.
00:19:18.040 | The second one, we will call it key.
00:19:20.120 | And the third one, we will call it value.
00:19:22.040 | As you can see, the values in this matrix are all the same because there are three identical
00:19:25.920 | copies.
00:19:27.680 | Why we use three identical copies?
00:19:29.480 | Because this is the self-attention mechanism, which means that we relate tokens to each
00:19:34.720 | other, tokens that belong to the same sentence with each other.
00:19:39.600 | If we relate tokens of two different sentences or from two different languages, for example,
00:19:44.840 | when we are doing a language translation, in that case, we will talk about cross-attention.
00:19:49.840 | In this case, we talk about self-attention.
00:19:51.880 | And this is the kind of attention that is used in language models.
00:19:55.960 | OK, the self-attention mechanism, as you have seen, probably in the paper, works with this
00:20:02.320 | formula here, which is the attention is calculated as the softmax of the query multiplied by
00:20:07.680 | the transpose of the keys divided by the square root of dk, then multiplied by b.
00:20:14.160 | What is this dk here?
00:20:16.080 | The dk here actually represents the dimension of the vector of each head in case of the
00:20:22.000 | multi-head attention, because we are actually simplifying our scenario and working with
00:20:26.760 | only one head.
00:20:27.760 | In our case, dk corresponds to d model.
00:20:30.840 | So that is the size of the embedding vector that we created before, which is 512.
00:20:38.200 | So we take our matrix, the one we built before, so the query, so 10 by 512.
00:20:43.600 | We multiply it by the transpose of the keys, which becomes a 512 by 10.
00:20:48.960 | So they are basically the identical matrix, but one is transposed and one is not.
00:20:54.760 | We divide it by the square root of 512.
00:20:58.280 | We apply the softmax and this will produce the following matrix we can see here, in which
00:21:04.040 | each value is the softmax of the dot product of one token with another token of the embedding
00:21:13.280 | of one token with another token.
00:21:15.480 | For example, let's visualize with some vectors.
00:21:20.220 | So we built the matrix before, which is made up of 10 rows because each row is a token
00:21:29.460 | and each row contains 512 numbers because it's a vector of 512 dimensions.
00:21:34.380 | So the dimension one up to 512, then the dimension one up to 512 and we have 10 of them.
00:21:44.320 | This is the transpose of this matrix.
00:21:46.120 | So we will have not 10 rows, but we will have 10 column of vectors with the dimension one
00:21:53.140 | up to 512.
00:21:55.720 | Then we have another column vector here with the dimension one and then 512, etc, etc, etc.
00:22:03.860 | We have another one, etc.
00:22:06.080 | So this value here is the dot product of the first row of the first matrix with the first
00:22:13.440 | column of the first matrix, which is the embedding of the first token, which if you remember
00:22:19.040 | is the start of sentence with the embedding of the token start of sentence and it's this
00:22:26.480 | value here.
00:22:28.360 | Then this value here, it's the dot product of the embedding of the start of sentence
00:22:33.960 | with the second token, which is this one here and it's the token before.
00:22:38.980 | So this is before, etc.
00:22:42.280 | Then we apply the Softmax.
00:22:43.840 | The Softmax basically changes the values in this matrix in such a way that they sum up
00:22:48.560 | to one.
00:22:49.560 | So each row in this matrix, this row for example here, sums up to one and also this row here
00:22:56.440 | sums up to one, etc, etc.
00:23:00.560 | As you can see here, this word start of sentence is able to relate to the word before.
00:23:11.000 | And this is not what we want, because as I said before, our goal is to model a language
00:23:19.260 | model.
00:23:20.260 | That is, a language model is a probabilistic model that assigns probability to sequence
00:23:24.600 | of words.
00:23:25.760 | That is, we want to calculate the probability of the word China being the next word in the
00:23:31.200 | sentence Shanghai is a city in.
00:23:34.760 | That is, we want to condition the word China only on the words that come before it.
00:23:40.680 | That is, Shanghai is a city in.
00:23:42.960 | So our model should only be able to watch this part of the sentence to predict the next
00:23:48.520 | token.
00:23:49.760 | This is also called the left context.
00:23:52.520 | But this is not what is happening here.
00:23:55.360 | Because we are able to relate tokens that also come in the future with tokens that come
00:24:00.920 | in the past.
00:24:01.920 | So for example, the word SOS is being related with the token before.
00:24:06.600 | And the token SOS is also being related with the token my, even if the token my comes after
00:24:13.040 | So what we do, we basically need to introduce the causal mask.
00:24:18.120 | Let's see how it works.
00:24:20.960 | The causal mask works like this.
00:24:23.720 | We take the metrics that we saw before with all the attention scores and all the interactions
00:24:29.240 | that are not causal.
00:24:30.660 | So all the interaction of words with the words that come on its right are replaced with minus
00:24:37.120 | infinity.
00:24:38.120 | For example, start of sentence should not be able to relate to the word before.
00:24:44.160 | So we replace the interaction with minus infinity before we apply the softmax.
00:24:50.120 | And then also, for example, the word before should not be able to watch the word my bed
00:24:55.000 | lies pool.
00:24:56.140 | So all these interactions are also replaced with minus infinity.
00:24:58.900 | So basically all the values above the principal diagonal that you can see here are replaced
00:25:04.100 | with minus infinity.
00:25:06.620 | Then we apply the softmax.
00:25:08.760 | And if you remember the formula for the softmax, you can see here it's on the numerator is
00:25:13.240 | e to the power of z i, which is the item to which you are applying the softmax.
00:25:20.140 | And e to the power of minus infinity will become zero.
00:25:23.080 | So basically we replace them with minus infinity so that when we apply the softmax, they will
00:25:27.680 | be replaced with zero by the softmax.
00:25:32.060 | This way the model will not have access to any information of interactions between the
00:25:38.480 | word start of sentence and all the tokens that come after it because we replace them
00:25:44.260 | with zero.
00:25:45.260 | So even there is some kind of connection between this token, the model will not be able to
00:25:49.140 | learn it because we never give this information to the model.
00:25:53.020 | And this is how the model becomes causal.
00:25:55.040 | The only token that is able to watch all the previous token is this bright token here.
00:25:59.580 | So the last token, because this token can see all the tokens that come before it.
00:26:04.140 | That's why this line has no zero here.
00:26:08.220 | Okay, in the formula of the attention, we also have a multiplication with v.
00:26:13.900 | So we take the output of the softmax.
00:26:16.420 | So the matrix that we saw before, we apply the causal mask, and then we multiply with
00:26:21.860 | And this way you will understand why we apply the causal mask.
00:26:26.180 | So let's review the shapes.
00:26:28.140 | This matrix here is a 10 by 10 matrix.
00:26:31.300 | And this matrix v is the initial matrix that we built with our vectors.
00:26:36.740 | So it's 10 by 512.
00:26:38.620 | So it's the value matrix, one of the three identical copies that we made before.
00:26:43.260 | The multiplication between these two metrics will produce an output matrix that is 10 by
00:26:49.620 | Let's see how the output works.
00:26:52.260 | Okay, so this is a matrix made of rows of vectors.
00:26:58.380 | So the first row is a vector of size 512.
00:27:01.620 | The second row is a vector of size 512.
00:27:04.860 | The third is also a vector of size 512.
00:27:07.800 | So with 512 dimensions here, this one also have dimension one dimension two dimension
00:27:14.180 | three, 512 dimension one dimension two dimension three, 512, etc, etc.
00:27:21.420 | The output token will also have the same shape.
00:27:24.220 | So it will be 10 by 512, which means we have 10 vectors of dimension 512 dimension.
00:27:29.860 | So the dimension one dimension two dimension three, up to 512 dimension one, two, three,
00:27:39.160 | up to 512, etc, etc.
00:27:42.620 | Now let's do this product by hand.
00:27:45.220 | To get this first value here of this matrix, so the dimension one of the first vector of
00:27:50.940 | the attention output matrix here, this value here is the dot product of the first row.
00:27:58.380 | So this row here, all this row with the first column of this matrix.
00:28:06.020 | So the first dimension of the embedding of each token in our input.
00:28:12.220 | But as you can see, because of the causal mask, most of the values in this matrix are
00:28:17.100 | zero, as you can see here, this means that the output value here will be only be able
00:28:23.800 | to watch the first dimension of the first token, which also means that in the output
00:28:28.860 | of the attention, the first token will only be able to attend only itself, as you can
00:28:35.420 | see, not the values that come after it.
00:28:39.140 | Let's look at the second, for example, this output here.
00:28:42.280 | So the first dimension of the second row of this matrix attention output here.
00:28:48.940 | So this value here comes from the second row of the initial matrix.
00:28:53.200 | So this one here multiplied by the first column of this matrix.
00:28:58.540 | Now we have two values that are non-zero.
00:29:02.100 | So this means that this output here will depend only on the first two tokens because all the
00:29:09.020 | other are zero.
00:29:10.900 | And this is how we make the model causal.
00:29:14.340 | OK, now we are ready to explore BERT and the architecture behind BERT.
00:29:20.620 | So BERT's architecture is also using the encoder of the transformer model.
00:29:26.180 | So we have input embeddings, we have positional encodings, we will see that the positional
00:29:30.140 | encodings are actually different.
00:29:32.500 | Then we have self-attention, we have normalization, we have feedforward.
00:29:36.540 | And then we have this head, this linear head we can see here.
00:29:40.100 | And this we will see later that it changes according to the specific task for which we
00:29:44.620 | are using BERT.
00:29:46.740 | BERT was introduced with the two pre-trained models.
00:29:49.460 | One is BERT-Base and one is BERT-Large.
00:29:52.340 | The BERT-Base, for example, has 12 encoder layers, which means that this block here,
00:29:57.180 | so the gray block you can see here, is repeated 12 times, one after another, and the output
00:30:02.620 | of the last layer is fed to this linear layer and then to the softmax.
00:30:08.100 | The size of the hidden size of the feedforward layer is 3072, so the feedforward layer you
00:30:13.580 | can see here, which is basically just two linear layers, the size of the features is
00:30:19.180 | 3072.
00:30:20.660 | And then we have 12 attention heads in the multi-head attention.
00:30:24.380 | Then BERT-Large have these numbers.
00:30:26.420 | Now what are the differences between BERT and the vanilla transformer?
00:30:30.320 | The first difference is that the embedding vector is not 512 anymore, but it's 768 for
00:30:37.260 | BERT-Base and 1024 for BERT-Large.
00:30:41.540 | From now on, I will always refer to the number 768, so that the embedding of the vector in
00:30:47.860 | BERT-Base.
00:30:50.420 | Another difference is that the positional encoding in the vanilla transformer were computed
00:30:54.900 | using the sine and the cosine function we saw before.
00:30:58.560 | But in BERT, these positional embeddings are not fixed and pre-computed using fixed functions,
00:31:05.500 | but they are actually embeddings that are learned during training.
00:31:09.480 | And they are of the same size of the embedding vector, of course, because they are summed
00:31:13.900 | together.
00:31:14.900 | So they have 768 dimensions in BERT-Base and 1024 in BERT-Large.
00:31:21.600 | But these positional embeddings are limited to 512 positions, which means that BERT cannot
00:31:29.920 | handle sentences longer than 512 tokens, because we only have 512 vectors to represent positions.
00:31:40.960 | And the linear layer head changes according to the application, so this linear layer here.
00:31:47.200 | So we saw before that BERT uses not the tokenizer that we have used, the simple one, which only
00:31:53.100 | treats each word as a token, but it uses what's called the word piece tokenizer, which also
00:31:57.860 | allows sub-word tokens, so each word can become multiple tokens.
00:32:02.440 | The vocabulary size is roughly 30,000 tokens in the BERT-Base and BERT-Large.
00:32:08.000 | Ok, let's see the differences between BERT and the language models like GPT and LLAMA.
00:32:14.140 | In my slide I call these kind of models the common language models, so they are commonly
00:32:19.200 | known as language models, so like GPT and LLAMA.
00:32:22.200 | So unlike them, BERT does not handle special tasks with prompts, but rather it can be specialized
00:32:29.040 | on a particular task by means of fine-tuning, and we will see what I mean by this in the
00:32:33.680 | next slide.
00:32:35.920 | The second difference is that BERT has been trained using the left context and the right
00:32:40.040 | context, so we will see also what we mean by this.
00:32:45.040 | BERT is not built specifically for text generation.
00:32:48.400 | So for example, you can use LLAMA to generate a big article given a prompt, but you cannot
00:32:52.840 | use BERT for this purpose.
00:32:54.960 | BERT is useful for other kind of tasks, and we will see which ones.
00:32:59.160 | And BERT has not been trained on the next token prediction task.
00:33:03.320 | So the model that we trained initially on the Chinese poems was trained on the next
00:33:08.080 | token prediction task, but BERT has not been trained on this particular task.
00:33:12.440 | BERT has been trained on the musket language model and the next sentence prediction task,
00:33:17.200 | and we will see both of them.
00:33:19.200 | Ok, so let's see how we handle different tasks in GPT or in LLAMA, and how we handle them
00:33:26.240 | in BERT.
00:33:27.600 | Suppose we want to do question answering.
00:33:30.320 | If we want to do it with GPT or with LLAMA, what we do is we build a particular prompt,
00:33:35.640 | and this is called a few shot prompting, in which we teach the model how to handle a task
00:33:43.480 | inside of the prompt, and then we let the model tell us the answer without the last
00:33:50.760 | one in the last part of the prompt, we let the model answer using the next token.
00:33:55.760 | So for example, if we tell the model that given the context and the question how to
00:34:00.280 | build the answer, then given the context and the question the model should be able to come
00:34:06.520 | up with an answer that makes sense given the previous example.
00:34:11.280 | While in BERT we do not work with the prompt, like we do with chat GPT or with LLAMA or
00:34:17.200 | with GPT, but we fine tune BERT on the specific task we want to work on, and we will see how.
00:34:25.880 | As I said before, language models are models that predict the next token using only the
00:34:32.420 | left context of each word, that is the tokens that come to the left side of each word for
00:34:38.000 | predicting the next token.
00:34:39.920 | This is not the case with BERT.
00:34:41.880 | BERT uses the left context and the right context.
00:34:45.320 | So I want to give you some intuition in why also we humans may be using the left and the
00:34:50.800 | right context.
00:34:51.800 | Now, I am not really a linguist, so my intuition will not be maybe technically valid, but it
00:34:58.000 | will help you understand the importance of left and right context also in human conversations.
00:35:04.040 | So let's start with the left context.
00:35:05.920 | The left context in human conversation is used every time we have a phone conversation.
00:35:09.920 | For example, the operator's answer is based on the user's input.
00:35:15.720 | So the user says something, then the operator will say something, then the user will reply
00:35:19.960 | with something based on what the operator said, and then the operator will continue
00:35:24.360 | based on the context given by the previous conversation.
00:35:28.280 | This is called using the left context.
00:35:31.120 | For right context, it's more difficult to visualize.
00:35:34.320 | For example, imagine there is a kid who just broke his mom's favorite necklace.
00:35:38.680 | The kid doesn't want to tell the truth to his mom, so he decides to make up a lie.
00:35:43.320 | So instead of saying directly to his mom, your favorite necklace has broken, the kid
00:35:48.640 | may say, "Mom, I just saw the cat playing in your room and your favorite necklace has
00:35:53.640 | broken."
00:35:54.640 | Or it may say, "Mom, alien came through the window with laser guns and your favorite necklace
00:36:00.880 | has broken."
00:36:01.880 | As you can see, the kid conditioned the lie on what he want to say next.
00:36:08.400 | So it doesn't even matter which lie the kid says, because he want to create some context
00:36:16.200 | before arriving to the conclusion, which is, "Your favorite necklace has broken."
00:36:20.760 | So he conditions the word he chooses initially on what he want to say next.
00:36:27.040 | That is the definition of making up a lie.
00:36:29.840 | And this could be an intuition in how we humans use the right context.
00:36:35.960 | And I hope you also see it.
00:36:38.680 | Now let's talk about the pre-training of model.
00:36:41.520 | Because for example LAMA or GPT, they have been pre-trained on a large corpora of text.
00:36:47.560 | And then we can use them with prompting or with fine tuning.
00:36:51.360 | BERT has not been pre-trained like with LAMA or GPT using the next token prediction task,
00:36:58.440 | but on two specific tasks called the Masked Language Model and the Next Sentence Prediction
00:37:03.440 | Task.
00:37:04.440 | Let's review them.
00:37:07.560 | The Masked Language Model is also known as the Claude's Task.
00:37:10.880 | And you may know it from some papers or tests that you have done at university.
00:37:16.300 | For example, the teacher gives you a sentence in which one word is missing and you have
00:37:20.780 | to fill the empty space with the missing word.
00:37:24.400 | And this is how BERT has been trained with the Masked Language Model.
00:37:28.080 | Basically what they did is they took some sentence from the corpora from which they
00:37:32.960 | were training BERT.
00:37:34.700 | They choose some tokens randomly and they replace this random token with a special token
00:37:43.620 | called Mask.
00:37:45.300 | Then they feed this masked input to BERT and BERT has to come up with the word that was
00:37:53.940 | removed initially.
00:37:56.420 | One or more words.
00:37:58.060 | Let's see how it was done technically.
00:38:00.820 | So first we need to understand how BERT uses the left and the right context.
00:38:07.820 | So as you saw before, when we compute the attention we use the formula softmax of query
00:38:12.480 | multiplied by the transpose of the keys divided by the square root of dk and then we multiply
00:38:16.460 | it by v.
00:38:17.460 | This means that we take the query matrix, we multiply it by the transpose of the keys,
00:38:24.000 | we do the softmax and this will produce this matrix here that we saw before.
00:38:28.640 | But unlike before, like we did for Language Models, in this case we will not use any mask
00:38:35.940 | to cancel out the interactions of words with words that come after it.
00:38:41.920 | So for example before we replaced the value here, for example, with minus infinity and
00:38:46.640 | also this value with minus infinity and actually all the values above this diagonal with minus
00:38:51.440 | infinity because we didn't want the model to learn these interactions.
00:38:55.160 | But with BERT we do not use any mask, which means that each token attends tokens to its
00:39:01.960 | left and tokens to its right in a sentence.
00:39:05.600 | Ok, let's review the details of the Masked Language Model.
00:39:10.300 | So imagine we want to mask this sentence, so Rome is the capital of Italy, which is
00:39:15.160 | why it hosts many government buildings.
00:39:18.880 | The pre-training procedure selects 15% of the tokens from this sentence to be masked.
00:39:24.980 | If a token is selected, for example the word capital is selected, then 80% of the time
00:39:30.720 | it is replaced with the masked token, becoming this input for BERT, 10% of the time it's
00:39:36.880 | replaced with a random token, Zebra, and 10% of the time it's not replaced at all, so it
00:39:42.560 | remains as original.
00:39:44.640 | In any of these three cases, BERT has to predict the word that was masked out, so BERT should
00:39:49.920 | output capital for each of these three inputs based on the probability they happen.
00:39:58.880 | So during training, what we do, for example, imagine this is our initial sentence in which
00:40:04.840 | we masked out the word capital like we saw before, this will result in an input to be
00:40:09.760 | fed to BERT of 14 tokens, if we count the tokens here, we feed it to BERT, BERT will
00:40:17.160 | produce an output because BERT is a transformer model so 14 tokens of input result in 14 tokens
00:40:23.800 | in the output.
00:40:25.480 | What we do is we only check the BERT, the position that was masked out, so this token
00:40:30.640 | here, we compare it with our capital, with our target, and we compute the loss, and we
00:40:38.560 | run back propagation.
00:40:39.760 | So basically what we want is this token here to be capital.
00:40:47.560 | Okay now let's review the next sentence prediction task.
00:40:51.920 | Next sentence prediction task means that we have a text, we choose two sentences randomly,
00:41:00.040 | actually we select one sentence randomly, so the sentence A, and then 50% of the time
00:41:05.920 | we select its immediately next sentence, so the second line, or 50% of the time we select
00:41:12.400 | a random sentence from the rest of the text, in this case this one, the third line.
00:41:17.480 | In our case we selected the first line as sentence A, and the sentence B is the third
00:41:22.800 | line, so it's not the immediately following sentence.
00:41:26.880 | We feed both of these sentences to BERT, and BERT has to predict if the sentence B comes
00:41:32.380 | immediately after sentence A or not.
00:41:35.320 | In this case, because sentence B is the third line, and sentence A is the first line, it's
00:41:41.880 | not immediately following, so BERT should reply with "not next".
00:41:46.480 | In the case we had selected the second line as sentence B, BERT should reply with "is
00:41:51.840 | next".
00:41:52.840 | We have two problems here, how can we encode two sentences to become the input for BERT,
00:42:00.360 | and second problem is how can BERT tell us that it's the next sentence or it's not the
00:42:08.000 | next sentence.
00:42:09.800 | Well the idea is this, we take the two sentences and we encode it as only one sentence, it
00:42:17.600 | becomes one input, so the tokens of the two sentences become one input concatenated together,
00:42:23.920 | in which we prepend one special token called CLS, then the tokens of the first sentence,
00:42:30.160 | so suppose the sentence is "my dog is cute", then we add the token called separator, then
00:42:35.200 | the tokens of the second sentence, so this "he likes playing" and then another token
00:42:43.580 | The problem is, if we feed only this input to BERT, BERT will not be able to understand
00:42:48.560 | that this "my" belongs to sentence A and this "likes" belongs to sentence B, so what we
00:42:54.640 | did before initially, as I told you before, when we feed the input to any language model
00:43:00.560 | that uses transformer, we first tokenize it, then we convert it into embedding vectors
00:43:06.080 | of size 512 in case the vanilla transformer or 768 in case of BERT, then we append another
00:43:13.720 | vector that represents the position of this token, so the position embeddings, in BERT
00:43:18.720 | we have another embedding called segment embedding, so it's another vector that we add to the
00:43:24.240 | position embedding and to the token embedding and represents the fact that this token belongs
00:43:29.920 | to sentence A or to sentence B, and this is how we encode the input for BERT.
00:43:36.760 | Let's see how we train it, so we create our input which is the first line of this poem
00:43:44.240 | and the third line of this poem together with the separator token we can see here and a
00:43:50.280 | special token called CLS here, and we also encode the information that all these tokens
00:43:57.360 | belong to sentence A and all these tokens belong to sentence B by using this special
00:44:03.920 | this one here, segment embedding we saw here.
00:44:07.480 | Now we feed this input to BERT, BERT will come out with an output because it's a transformer
00:44:15.760 | model so an input of 20 tokens corresponds to an output of 20 tokens.
00:44:21.480 | We take only the first token of the output, the one that corresponds to the token CLS
00:44:28.040 | which stands for classifier, we feed it to a linear layer with only two output features,
00:44:35.120 | one feature indicating next and one indicating not next, we apply the softmax, we compare
00:44:41.880 | it with our target so we expect BERT to say not next because we fed it the third line
00:44:48.680 | as sentence B and not the second line, and then we compute the loss which is the cross
00:44:54.120 | entropy loss and we run backpropagation to update the weights, and this is how we train
00:44:58.400 | BERT on the next sentence prediction task.
00:45:02.600 | Now this CLS token, let's review how this works, so as we saw before the formula for
00:45:08.800 | the attention is a query multiplied by the keys, the transpose of the keys divided by
00:45:12.760 | the square root of 768, we apply the softmax and this will produce this attention score
00:45:21.080 | matrix we can see here.
00:45:23.160 | Now as we can see the CLS token always interacts with all the other tokens because we didn't
00:45:29.000 | apply any causal mask, so we can consider that the CLS token acts as a token that captures
00:45:36.680 | the information from all the other tokens because the attention matrix here, we didn't
00:45:44.280 | apply any mask before applying the softmax, so all of these attention values will be actually
00:45:50.760 | learned by the model, and this is the idea behind this CLS token.
00:45:56.840 | So if we do for example the matrix multiplication that we did before, so let's compute the first
00:46:02.400 | row of the output matrix, let's see, so this matrix here is the input matrix which is 10
00:46:12.240 | by 768, because suppose the input is very simple, so before my bed lies a pool of moon
00:46:17.040 | bright, so 10 tokens of input, 1, 2, 3, etc, etc, so 10 tokens, each of them has 768 embeddings
00:46:27.840 | because we are talking about birth, 8, 768, so the first dimension, the first dimension,
00:46:36.980 | the first dimension up to 768, this will result in an output matrix of 10 by 768, so the first
00:46:45.480 | dimension up to 768, 1, 2, 768, etc.
00:46:52.960 | We are only interested in this output here, which corresponds to the position of the CLS
00:46:59.280 | token, let's see, the first dimension, so the dimension number 1 of this output token
00:47:05.840 | will be the dot product of this vector here, which is made of 10 dimensions, with the first
00:47:14.440 | column of this matrix here, which is also made of 10 dimensions because we have 10 tokens,
00:47:20.660 | but because none of the value here is 0, actually here is 0 because I chose random numbers,
00:47:26.000 | but suppose this is 0.03 and 0.04 let's say, because none of the values in this matrix
00:47:36.560 | is 0, the output, the CLS will be able to access the attention scores of all the tokens,
00:47:43.520 | so basically this token here will aggregate the attention scores, so the relationship
00:47:49.440 | with all the tokens, the CLS can also be thought of as the CEO in a company and you are the
00:47:56.800 | shareholder, when you are the shareholder, you don't ask the information to the employees,
00:48:00.640 | you ask to the CEO, and the CEO's job is to talk to every guy in the company, to every
00:48:07.720 | person in the company to get the necessary information to reach the goal, and this is
00:48:13.920 | the goal of the CLS, the CLS can be thought of as the aggregator of all the information,
00:48:19.020 | of all the information present inside of the sentence, and we use it to classify, that's
00:48:24.120 | why it's called the CLS token.
00:48:28.040 | Okay now let's talk about fine-tuning, as we saw before, BERT does not work with prompts
00:48:34.720 | like LLAMA or GPT, so we cannot use zero-shot prompting or few-shot prompting or chain of
00:48:42.240 | thoughts or any other prompting technique, with BERT we work with fine-tuning, so we
00:48:47.200 | take the pre-trained model, and if we want to do text classification, we fine-tune BERT
00:48:51.960 | on our dataset for text classification, or question answering, let's see how these two
00:48:56.640 | tasks work.
00:48:59.440 | Suppose we want to do text classification, so text classification is the task of assigning
00:49:05.320 | a label to a piece of text, for example, imagine you are running an internet provider and we
00:49:09.960 | receive complaints from our customers, we may want to classify requests coming from
00:49:15.060 | users as hardware problems, software problems or billing problems, for example, this complaint
00:49:21.360 | here is definitely a hardware problem, this complaint here is definitely a software problem
00:49:27.160 | and this one is definitely a billing problem, we want to classify automatically this request
00:49:33.360 | that we keep receiving from customers, how do we do that?
00:49:37.440 | Well we take our request, we feed it to BERT and BERT should tell us which one of the three
00:49:43.280 | options it best represents this particular request, how can BERT tell us one of these
00:49:51.960 | three options and how can we feed our request to BERT?
00:49:55.600 | Let's see.
00:49:57.000 | So when we train BERT for text classification, we create our input, we prepend to the request
00:50:03.440 | text the classifier token, so the CLS token, we feed it to BERT, BERT will come up with
00:50:10.840 | an output, so 16 input tokens correspond to 16 output tokens, we only care about the first
00:50:16.200 | one, which is the one corresponding to the CLS token, we send the output to a linear
00:50:22.640 | layer with three output features, because we have three possible classes, one is software,
00:50:28.280 | one is hardware, one is billing and then we apply Softmax, we compare it with what we
00:50:33.160 | expect BERT to learn about this particular request, that is that this request is hardware,
00:50:40.440 | then we calculate the loss, which is the cross entropy loss and finally we run backpropagation
00:50:44.460 | to update the weights and this is how we fine-tune BERT on text classification.
00:50:51.040 | Our next task is question answering.
00:50:54.920 | Question answering basically means this, we have a context from which we need to extract
00:51:00.360 | the answer to a particular question, for example the context is Shanghai is a city in China,
00:51:07.520 | it is also a financial center, it's fashion capital and industrial city, the question
00:51:12.200 | is what is the fashion capital of China, well the model should highlight the word or tell
00:51:17.920 | us which part of the context we can find the answer, so BERT should be able to tell us
00:51:24.720 | where the answer starts and where the answer ends in the context.
00:51:31.640 | But we have two problems, first we need to find a way for BERT to understand which part
00:51:36.120 | of the input is the context and which one is the question, second we need to find a
00:51:41.040 | way for BERT to tell us where the answer starts and where the answer ends, let's see both
00:51:47.040 | of these problems.
00:51:48.640 | The first problem can be solved easily using the segment embedding we saw before, so we
00:51:54.120 | concatenate the question and the context together as a single input with the separator token
00:52:00.220 | in the middle like we saw before for the next sentence prediction task, the question will
00:52:09.920 | be encoded as a sentence A while the context will be encoded as a sentence B, so this problem
00:52:17.840 | is solved.
00:52:20.100 | How do we get the answer from BERT, well let's see, so first we prepare the input for BERT,
00:52:27.920 | so we prepend the CLS token, what is the fashion capital of China, separator, Shanghai is a
00:52:34.200 | city in China etc, so this is the context which has been encoded as sentence B and the
00:52:38.720 | first part has been encoded as sentence A, we feed it to BERT, BERT will come up with
00:52:45.040 | the output which is 27 tokens because the input is made up of 27 tokens, we also know
00:52:51.960 | which tokens correspond to which sentence, so which correspond to the sentence A, which
00:52:57.520 | correspond to the sentence B because we give it as input, we apply a linear layer with
00:53:03.900 | two output features, one that indicates if one particular token is the start token and
00:53:11.040 | another feature that indicates if the token is an end token, we know where is the answer
00:53:18.720 | because we know the answer is the word Shanghai which is the start should be the token 10
00:53:24.160 | and the end should be the token 10, then we calculate the loss based on our target and
00:53:29.600 | the output of this linear layer, we run backpropagation and this is how we fine tune BERT for question
00:53:35.480 | answering and this is it guys, I hope that you liked my video, I used a very unconventional
00:53:43.900 | way of describing BERT, that is I started from language models, I introduced the concept
00:53:49.200 | of how language models work and then I introduced BERT because I wanted to create a comparison
00:53:54.280 | of how BERT works versus how other language models work so that you can appreciate the
00:54:00.440 | qualities and the weaknesses of both, BERT is actually not so recent model, it was introduced
00:54:06.240 | in 2018 if I remember correctly, so it is quite aged but still very relevant for a lot
00:54:11.880 | of tasks and I hope that you will be coming again to my channel for more content so please
00:54:18.520 | subscribe and share this video if you like it, if you have any questions please write
00:54:23.140 | it in the comment, I am also very active on LinkedIn if you want to add me, if you want
00:54:29.840 | to have some particular video review, model review, write it in the comment and I hope
00:54:36.640 | that in my next video I will be able to code BERT from scratch, so using PyTorch so we
00:54:42.000 | can also learn, put to practice all the knowledge that we acquired in today's video, thank you
00:54:48.000 | again for coming to my channel guys and have a nice day!