back to indexStanford XCS224U: NLU I Contextual Word Representations, Part 5: BERT I Spring 2023
Chapters
0:0 Intro
0:18 BERT: Core model structure
2:7 Masked Language Modeling (MLM)
3:43 BERT: MLM loss function
4:59 Binary next sentence prediction pretraining
5:44 BERT: Transfer learning and fine-tuning
6:55 Tokenization and the BERT embedding space
7:41 BERT: Core model releases
9:27 BERT: Known limitations
00:00:06.000 |
This is part five in our series on contextual representations. 00:00:18.280 |
Let's start with the core model structure for BERT. 00:00:23.200 |
combinations of familiar elements at this point, 00:00:27.680 |
the transformer architecture in BERT is essentially 00:00:35.420 |
I have our sequence, the rock rules at the bottom here, 00:00:42.560 |
All the way on the left here, we have a class token. 00:00:45.000 |
That's an important token for the BERT architecture. 00:00:52.620 |
We also have a hierarchical positional encoding. 00:00:58.080 |
This won't be so interesting for our illustration, 00:01:03.080 |
for problems like natural language inference, 00:01:15.920 |
that word than when it appears in a hypothesis. 00:01:37.060 |
and then we do an additive combination of them to get 00:02:04.680 |
Let's think about how we train this artifact. 00:02:07.360 |
The core objective is masked language modeling or MLM. 00:02:12.200 |
The idea here is essentially that we're going to mask out or 00:02:15.760 |
obscure the identities of some words in the sequence, 00:02:23.760 |
For our sequence, we could have a scenario where we have 00:02:33.060 |
That might be relatively easy as a reconstruction task. 00:02:38.880 |
In this case, we have a special designated token 00:02:41.840 |
that we insert in the place of the token rules. 00:02:45.240 |
Then we try to get the model to a state where it can 00:02:51.640 |
using the full bidirectional context around that point. 00:03:00.400 |
In this case, we simply take the actual word, 00:03:06.880 |
and then try to have the model learn to predict 00:03:16.720 |
the model in order to do this reconstruction task. 00:03:21.400 |
we mask out only a small percentage of all the tokens, 00:03:24.800 |
mostly leaving the other ones in place so that 00:03:30.040 |
predict the masked or missing or corrupted tokens. 00:03:33.760 |
That's actually a limitation of the model and 00:03:38.540 |
that Electra in particular will seek to address. 00:03:55.680 |
We're going to use the embedding representation 00:04:03.520 |
In this case, we can use the entire surrounding context, 00:04:16.240 |
the preceding context to make this prediction. 00:04:19.280 |
The other thing to notice here is that we have 00:04:32.560 |
this objective for tokens that we didn't mask out. 00:04:38.360 |
the masked tokens or the ones that we have corrupted. 00:04:48.280 |
the work of making predictions for all the time steps, 00:04:51.040 |
but get an error signal for the loss function only for 00:04:54.480 |
the ones that we have designated as masked in some sense. 00:05:06.800 |
In this case, we use our corpus resources to create 00:05:13.960 |
For sequences that actually occurred in the corpus, 00:05:32.280 |
as part of learning how to reconstruct sequences. 00:05:34.880 |
I think that's a really interesting intuition about how we 00:05:39.920 |
context into the transformer representations. 00:05:44.280 |
When we think about transfer learning or fine-tuning, 00:05:48.040 |
there are a few different approaches that we can take. 00:05:50.440 |
Here's a depiction of the transformer architecture. 00:05:53.600 |
The standard lightweight thing to do is to build out 00:05:59.600 |
the final output representation above the class token. 00:06:06.800 |
the first token in every single sequence that BERT processes, 00:06:16.520 |
a lot of information about the corresponding sequence. 00:06:24.400 |
and then maybe do some classification learning there. 00:06:31.320 |
A standard alternative to this would be to pool 00:06:34.200 |
together all of the output states and then build 00:06:37.040 |
the task parameters on top of that mean pooling or 00:06:45.840 |
the output states to make predictions for your task. 00:06:48.680 |
That can be very powerful as well because you bring in 00:06:51.600 |
much more information about the entire sequence. 00:06:59.480 |
Remember that BERT has this tiny vocabulary and 00:07:06.640 |
The reason it gets away with that is because it does 00:07:09.280 |
word piece tokenization which means that we have lots of 00:07:12.560 |
these word pieces indicated by these double hash marks here. 00:07:21.000 |
but rather breaks them down into familiar pieces. 00:07:28.480 |
masked language modeling in particular will allow us to 00:07:31.680 |
learn internal representations of things that 00:07:36.560 |
encode which got spread out over multiple tokens. 00:07:40.960 |
Let's talk a little bit about core model releases. 00:07:52.200 |
I would recommend always using the cased ones at this point. 00:07:57.920 |
the Google team have worked to develop even smaller ones. 00:08:00.680 |
We have tiny, mini, small, and medium as well. 00:08:04.200 |
This is really welcome because it means you can do a lot of 00:08:19.600 |
relatively small expansion inside its feed-forward layer 00:08:23.160 |
for a total number of parameters of only four million. 00:08:28.240 |
but it's surprising how much juice you can get 00:08:33.120 |
But then you can move on up to mini, small, medium, 00:08:45.720 |
a total number of parameters of around 340 million. 00:09:00.920 |
That's an important limitation that increasingly we're 00:09:19.200 |
Maybe, for example, there are by now versions that use 00:09:34.840 |
First, the original BERT paper is admirably detailed, 00:09:58.120 |
They say the first downside is that we're creating 00:10:00.400 |
a mismatch between pre-training and fine-tuning 00:10:03.440 |
since the mask token is never seen during fine-tuning. 00:10:08.480 |
Remember, the mask token is a crucial element 00:10:11.120 |
in training the model against the MLM objective. 00:10:14.200 |
You introduce this foreign element into that phase that 00:10:17.560 |
presumably you never see when you do fine-tuning, 00:10:20.760 |
and that could be dragging down model performance. 00:10:33.400 |
We do all this work of processing these sequences, 00:10:40.320 |
and we can mask only a tiny number of them because we 00:10:43.440 |
need the bidirectional context to do the reconstruction. 00:10:53.520 |
I'll mention this only at the end of this series. 00:10:59.800 |
the predicted tokens are independent of each other, 00:11:07.160 |
long-range dependency is prevalent in natural language. 00:11:10.720 |
This is just the observation that if you do happen to mask out 00:11:14.040 |
two tokens like new and York from the place named New York, 00:11:19.960 |
those two tokens independently of each other, 00:11:29.240 |
and I'll mention later on about how ExcelNet brings 00:11:32.920 |
that dependency back in possibly to very powerful effect.