back to indexTraining BERT #2 - Train With Masked-Language Modeling (MLM)
Chapters
0:0 Intro
1:0 Setup
6:10 Masking
14:37 Data Loader
14:57 Dataset Object
21:7 Training
00:00:00.000 |
Okay, in this video what we're going to do is take a look at how we would train a model, 00:00:06.160 |
a transform model, using mass language modeling or MLM. Now MLM typically would use it when we 00:00:15.120 |
want to teach a transform model like BERT to better understand the specific style of language 00:00:23.360 |
in our specific use cases. And it consists of taking an input sentence or sequence, 00:00:33.200 |
masking a few of the tokens within that input sequence, 00:00:37.200 |
and asking BERT to predict the words that we have masked. So this is pretty useful because we can 00:00:45.920 |
take any chunk of text and process it through a masking function and we can use that for training. 00:00:54.560 |
We don't need to get label data which is really, really useful. So let's jump straight into it. 00:01:02.240 |
And what we first need to do is import everything we need. So we need 00:01:08.240 |
our tokenizer and model from transformers and we also need to import PyTorch. So do from transformers 00:01:16.320 |
import BERT tokenizer and BERT for mass LM. Then we also want to import torch. 00:01:31.040 |
And then what we want to do is initialize our tokenizer and model. So our tokenizer is a BERT 00:01:39.360 |
tokenizer from pre-trained. And we're using the BERT base uncased model. Let's copy that 00:01:54.720 |
and our model will be pretty similar. So this time using BERT for mass LM. 00:02:01.520 |
Mass LM is just mass language modeling or MLM that I mentioned before. So 00:02:09.600 |
that's great. Now I'm going to be training this on 00:02:15.920 |
a book that you can just get from the internet. It's Meditations by Marcus Aurelius. 00:02:24.400 |
The language in that is pretty unique so I figure this is quite a good example. So I already have it 00:02:32.320 |
downloaded and I've cleaned up a little bit. I will include a link to that clean version 00:02:38.480 |
of this so you can follow along if you want. So for me, of course I already have it downloaded 00:02:50.000 |
here. Meditations clean.txt. And we are reading that in. 00:03:02.480 |
P and all I need to do here is read. And what I've done is split each paragraph 00:03:17.360 |
within Meditations by a newline character. So I will just split by newline. And that should 00:03:26.800 |
get us what we want. Okay so we have this text now and what we want to do with this 00:03:34.800 |
is actually tokenize it. And this is just like we normally would with the transforms library. 00:03:44.960 |
So we have our tokenizer up here. And we just pass our text into that. Now we're using PyTorch 00:03:55.920 |
here so we want to return PyTorch tensors pt. And we also need to set the maximum length which for 00:04:08.400 |
this bear model is 512. And then we need to set truncation to true. And padding equal to max 00:04:24.800 |
length. So this will either truncate or pad each one of these sentences to the length of 512 tokens. 00:04:37.040 |
This should be returned tensors. So there we go. And here we are. So we still have our input IDs. 00:04:49.120 |
We don't need to worry about token type IDs here. And we have our attention mask 00:04:54.000 |
which Bert just uses for calculating attention. I'm not going to really go into depth on any of that. 00:05:05.760 |
Now as I said before we need two things for training our BERT model here. We need the 00:05:13.840 |
input IDs which will have a mask token. Now we haven't created that mask token yet. 00:05:19.440 |
And we also have our output labels which will not include that mask token. So before we mask 00:05:28.880 |
our input IDs we need to create a copy of that which we will use as our labels. So we write 00:05:37.600 |
inputs labels and we set that equal to inputs input IDs. So our input IDs tensor. And we clone 00:05:50.400 |
that by first attaching it and then cloning it. And that's all we need. So have a look at inputs 00:06:01.600 |
again. Now we have input IDs at the top. And if we go down to the bottom we have a copy of those 00:06:07.360 |
in this labels tensor. Now what we need to do is create our mask. So with BERT when they are 00:06:17.600 |
pre-training BERT they use a few rules. But at the core of that the main rule is that each token 00:06:25.760 |
that is not a special token has a 15% chance of being masked. So when I say special token I mean 00:06:35.440 |
the separator and classifier tokens which look like this. And I'll point those out in a minute. 00:06:42.320 |
In fact we can have a look here. This is our classifier token. This 101. And you see that at 00:06:48.960 |
the start of every sequence. And then at the end here we also have padding tokens. We also don't 00:06:53.360 |
want to mask those. So to create that 15% probability for each token what we do is use 00:07:01.920 |
the torch rand function. And we use this to create a tensor of floats that have the equal dimensions 00:07:10.720 |
to our inputs IDs here. Inputs input IDs tensor. Like so. And if we check the shape of rand we see 00:07:24.640 |
this 507 which is the number of sequences we have. And 512 which is the number of tokens that each 00:07:30.000 |
sequence has. So if we were to just take this we'd see we get the same. Okay. Now we have a look in 00:07:40.720 |
there. It's just a set of floats from the value 0 up to 1. Now what we want to do is mask roughly 00:07:51.600 |
15% of these. Or give each one of those a 15% probability of being masked. And the way we do 00:07:57.200 |
that is mask anything that is under the value 0.15. So for example these ones here they will be masked. 00:08:05.280 |
Whereas these ones up here will not be masked. To do that all we write is rand and we do less than 00:08:17.520 |
0.15. Now if we have a look at mask array we see that now these values that were less than 00:08:27.920 |
0.15 have this true value which is what we'll be using to mask our tokens later on. But at the same 00:08:35.680 |
time if you remember the classifier token is always in the first position within each tensor. 00:08:45.360 |
So here we would have a classifier token. Here too. And in fact all of these would also be padding 00:08:51.520 |
tokens. We don't want to mask any of those. So what we do to avoid that is we add some extra logic. 00:08:59.520 |
So I'll put that in brackets but actually we're going to just first test the logic so I can show 00:09:09.040 |
you what it's actually doing. So we have our inputs. Input IDs. Okay so these are the padding 00:09:16.720 |
tokens. These are the classifier tokens. And what we do is just say inputs. Input IDs not equal to 00:09:28.240 |
101 which is our classifier token. Okay and now you see that we get a fault wherever there is a 00:09:34.320 |
classifier token. And we want to do the same but for our padding. And to do that we multiply that 00:09:44.400 |
so that is essentially adding it to the logic here. It's like an and statement. And now we are 00:09:56.800 |
removing the padding tokens from that mask. And there's one more. We can't see it here but it's 00:10:04.080 |
also a separated token which is represented by the token ID 102. So we also include that 00:10:11.280 |
in here as well. Now all of these together we want to add these onto the logic up here. 00:10:21.120 |
Okay and now we will get our masquerade. You see now we have faults wherever we had 00:10:30.160 |
the padding tokens. We have faults wherever we had the classifier tokens. But we still have 00:10:34.960 |
a few mask tokens in there. So we have these true values here. Okay so that's our masquerade. 00:10:47.840 |
And now what we want to do is take the indices of each true value within each one of these 00:10:58.240 |
rows of the tensor. Now let's first do that with just one of them so you can see how it works. So 00:11:07.360 |
we take this one. Check the shape. It should be 512. Yeah so this is just one row here. 00:11:17.440 |
And what we'll do is we'll say non-zero and this will return the indices where we have 00:11:28.640 |
the well where we have non-zero values like e.g. the true values. 00:11:34.720 |
But this is like a vector so what we want to do here is flatten that. So we do torch flatten. 00:11:43.200 |
Now we get almost a list but it's still a tensor. We want an actual list and we just write to list. 00:11:51.520 |
Okay so now these are the index positions for the true values within this first row. But we 00:12:00.560 |
want to do for every row and to do that all we do here is we just use a for loop. 00:12:06.720 |
So we initialize our selection list here and we say for call it row or for i in mask array 00:12:20.800 |
shape zero. So mask array shape zero. Let me just show you. 00:12:28.640 |
It's the 507 rows that we have. We want to do selection. 00:12:35.920 |
Append and then we already have our logic here so we want to append 00:12:43.040 |
this. But we're going to append this for every single one of those rows. 00:12:58.160 |
And let's have a look at what we get in the selection. So we'll just have a look at the first 00:13:10.960 |
Ah um sorry I just need to replace that with i. There we go. So now we have indices for the first 00:13:18.560 |
five rows here. We have it for all of them of course but we're showing you the first five. 00:13:22.160 |
And there we go that's what we that's what we want. 00:13:26.160 |
And then what we want to do is we can just copy this. 00:13:33.680 |
We want to set the values that each one of these indices equal to 103 which is our mask token 00:13:43.840 |
within each row of our input ids tensor. So we go inputs input ids. 00:13:52.880 |
Then here we need to select those specific values. 00:13:59.840 |
And that is at row i followed by selection. So a selection of indices 00:14:06.320 |
at i as well. And we set those equal to 103 like so. 00:14:15.600 |
And now let's have a look at what we have in our input ids tensor. 00:14:21.040 |
So now we can see we have these mask tokens where we saw the true values before in our mask array. 00:14:30.480 |
And we haven't touched any declassifier or the padding tokens or the 00:14:34.080 |
separated tokens which are in there as well. Now our tensors here are in the correct format 00:14:41.520 |
but we still need to process them through something called a data loader during training. 00:14:48.480 |
Now to process them through a data loader we need to convert them into a PyTorch data set object. 00:14:58.000 |
And to do that what we're going to do is write a create a class here which will 00:15:02.720 |
handle this for us. So it's going to be meditations data set. 00:15:06.320 |
And to create the data set object we need to pass the data set class into here. 00:15:24.880 |
Now there's a few things we need here. We need the initialization function 00:15:30.880 |
which is just init and we pass self and encodings. 00:15:40.160 |
And here we're just going to assign encodings to a attribute within this class. 00:15:52.160 |
Encodings equals encodings. Now the data loader expects two additional functions 00:16:00.960 |
or methods that is the get item method and the length method. Length method is so that you can 00:16:08.720 |
check the length of the data set that it's looking at and the get item is so that you can get a 00:16:15.360 |
dictionary formatted batch of those items. So for get item we write this and we need self and then 00:16:26.160 |
we also specify the index. And what we do is we return a dictionary and this is just going to 00:16:36.080 |
pass so we have we have the input IDs key, we have the labels key, attention mask key, and token type 00:16:44.160 |
IDs key. It's going to pass those back to the data loader when it requests this get item method. 00:16:51.440 |
So we write torch tensor and we pass the values and the index of those values for key 00:17:04.720 |
val in self encodings dot items. So that should be okay. 00:17:17.840 |
And the only thing left is the length method. So we define length and here there's no input 00:17:29.520 |
parameters. All we need to do is return the length of our data set. So it's return 00:17:36.480 |
length and we're doing self encodings and then we can just use any of the 00:17:44.400 |
tensors that we have in there but we'll do input IDs. 00:17:53.280 |
And we could even we could modify this to be like shape zero and get rid of 00:18:00.000 |
length at the end there but I'll just stick with length for now. 00:18:05.440 |
So that is our class which will handle the formatting of our data into a data set object 00:18:19.760 |
and all we need to do is we write data set. So this is going to be our new data set variable. 00:18:24.640 |
We have meditation data set our class and in here we just pass our encodings or our inputs. 00:18:32.720 |
Like so. Okay and now we can initialize our data loader which PyTorch will be using to load 00:18:46.800 |
our data during training. So we write data loader equals torch utils data and data loader. 00:18:57.280 |
Here we want to pass our data set and then we also want to specify our batch size. So I'm going to go 00:19:08.240 |
with 16. You can modify this depending on your your GPU or your computer whatever however much 00:19:18.080 |
memory you have. And then we also want to shuffle the data within there as well so that we're not 00:19:25.920 |
extracting say the first 16 paragraphs all at once. We're actually going to be extracting 16 00:19:31.920 |
from random parts of the book. Okay now we're ready to move on to actually training. So first 00:19:42.160 |
we need to set up the or set up all the training parameters. So we first want to move the model to 00:19:53.840 |
GPU if you have a GPU and we check if we have a GPU using let me show you first so torch device 00:20:06.320 |
CUDA. If torch CUDA is available else torch device CPU. So we're saying here if we have a 00:20:24.240 |
CUDA enabled GPU use that otherwise we just use CPU. I can see here that I do have it so we have 00:20:34.480 |
this device type CUDA. And what we'll do is assign that to the device variable here and we use that 00:20:41.280 |
to move our model and everything across to that device. And we do that using model to device and 00:20:50.320 |
we should get a big output here. We get all this information that we don't need to look into that. 00:20:58.800 |
Now we need to activate our model. Our model is training mode so we just do model train 00:21:06.240 |
to make sure it's ready. And the final thing before we we set up our actual training loop 00:21:14.480 |
is we need to initialize our optimizer. We're going to be using Adam with weighted decay here. 00:21:19.680 |
So that's the Adam optimizer with weighted decay. Weighted decay just reduces the chance of 00:21:25.440 |
overfitting especially with big models like transform models. So we're going to do from 00:21:30.720 |
transformers import Adam w and our optimizer is going to be Adam w. Pass in our model parameters 00:21:46.240 |
and we also need passing a learning rate and we'll do one e to the minus five. 00:21:55.200 |
So model parameters brackets at the end there. Okay. Okay now we're fully set up we can actually 00:22:03.440 |
begin training which is set up as a normal training loop in PyTorch. 00:22:08.240 |
And the first thing I want to do is just import tqdm. This allows us to create a progress bar 00:22:17.360 |
during training otherwise we just sat there and we don't see any updates on training which we 00:22:23.520 |
don't want obviously. I'm going to say so we do two epochs. You can obviously modify this as you 00:22:31.920 |
want. I'm just we're just seeing how this all works so I'm not going to train it that much. 00:22:37.120 |
And you want to be careful of training transform models for too many epochs. They overfit very 00:22:43.520 |
easily. And we'll do for epoch in range epochs. And then here we want to set up our training loop. 00:22:56.880 |
So to do that we want to wrap it within a tqdm function there and we just pass our data loader 00:23:07.280 |
which what did I call it data loader up here. And that leave equals true. This just leaves 00:23:14.000 |
the progress bar rather than replacing it with every new epoch. And then we run through each 00:23:21.600 |
batch within our loop. So this is our batches of 16 items at time. And we first want to initialize 00:23:35.120 |
our calculated gradients. So with every loop we will calculate gradients and we first we don't 00:23:44.160 |
want to start with with gradients already calculated. We want to initialize them or 00:23:48.640 |
set them zero. So we do optim zero grad. Then we want to pull all of our tensors that we 00:23:56.960 |
require for training. So input ids of course first one. And that will be equal to batch. 00:24:05.920 |
Then in here we access our input ids. And additionally you see before we moved our 00:24:14.560 |
model to our gpu we also want to do that for our tensors here as well. So we say to device. 00:24:26.640 |
Okay and we follow this structure for our other tensors as well. Now for mass language 00:24:34.320 |
modeling we don't need to do anything with token type ids so we just ignore those. 00:24:44.320 |
And we also have our labels which we we do need of course. 00:24:51.840 |
And with that we can process everything. So now we do outputs model and we pass out 00:25:01.440 |
input ids. So inputs we want to specify the attention mask. So we just copy that 00:25:09.200 |
and we also need to specify our labels which is labels. Okay now let's just extract the 00:25:21.600 |
loss from those outputs. So we get a loss tensor there and what we do here is we use the backward 00:25:30.800 |
method which calculates loss for every parameter in our model. And from that we can calculate the 00:25:42.080 |
gradient update using our optimizer. So using that we have optim and we call the step. And this will 00:25:51.040 |
take a step to optimize all of the weights within our model based on the loss. Now final little bit 00:25:59.840 |
here this is just you know aesthetics. I want our loop. I want to actually see some bits of 00:26:05.840 |
information in that loop. So all I do is loop set description. And here I just want to show the epoch 00:26:17.280 |
which is just epoch. And then I also want to see the loss in the postfix. So we do loop set postfix 00:26:28.400 |
and we do loss item. So item here just pulls out the exact value within that loss tensor up here. 00:26:43.840 |
Okay that should be everything. Let's go see what we have. There we go. So now we're training. 00:26:51.760 |
See the loss is going down slowly and that's it. So we're now training our transform model using 00:27:02.800 |
meditations by Marcus Aurelius with mass language modeling. It's really not that hard. I mean 00:27:11.040 |
there is quite a bit to it but I think once you once you do it it's reasonably straightforward. 00:27:16.880 |
And the fact that you can do this on basically any set of text using just a mass function is 00:27:26.000 |
incredibly so so useful. So we don't need to you know go out looking for label data anywhere which 00:27:32.800 |
is amazing. So that's that's it for this video. I hope it's been useful. I know it's a bit of a 00:27:40.480 |
long one but thank you very much for watching and I will see you again in the next one.