Training BERT #2 - Train With Masked-Language Modeling (MLM)

Okay, in this video what we're going to do is take a look at how we would train a model, a transform model, using mass language modeling or MLM. Now MLM typically would use it when we want to teach a transform model like BERT to better understand the specific style of language in our specific use cases.

And it consists of taking an input sentence or sequence, masking a few of the tokens within that input sequence, and asking BERT to predict the words that we have masked. So this is pretty useful because we can take any chunk of text and process it through a masking function and we can use that for training.

We don't need to get label data which is really, really useful. So let's jump straight into it. And what we first need to do is import everything we need. So we need our tokenizer and model from transformers and we also need to import PyTorch. So do from transformers import BERT tokenizer and BERT for mass LM.

Then we also want to import torch. And then what we want to do is initialize our tokenizer and model. So our tokenizer is a BERT tokenizer from pre-trained. And we're using the BERT base uncased model. Let's copy that and our model will be pretty similar. So this time using BERT for mass LM.

Mass LM is just mass language modeling or MLM that I mentioned before. So that's great. Now I'm going to be training this on a book that you can just get from the internet. It's Meditations by Marcus Aurelius. The language in that is pretty unique so I figure this is quite a good example.

So I already have it downloaded and I've cleaned up a little bit. I will include a link to that clean version of this so you can follow along if you want. So for me, of course I already have it downloaded here. Meditations clean.txt. And we are reading that in.

P and all I need to do here is read. And what I've done is split each paragraph within Meditations by a newline character. So I will just split by newline. And that should get us what we want. Okay so we have this text now and what we want to do with this is actually tokenize it.

And this is just like we normally would with the transforms library. So we have our tokenizer up here. And we just pass our text into that. Now we're using PyTorch here so we want to return PyTorch tensors pt. And we also need to set the maximum length which for this bear model is 512.

And then we need to set truncation to true. And padding equal to max length. So this will either truncate or pad each one of these sentences to the length of 512 tokens. This should be returned tensors. So there we go. And here we are. So we still have our input IDs.

We don't need to worry about token type IDs here. And we have our attention mask which Bert just uses for calculating attention. I'm not going to really go into depth on any of that. Now as I said before we need two things for training our BERT model here. We need the input IDs which will have a mask token.

Now we haven't created that mask token yet. And we also have our output labels which will not include that mask token. So before we mask our input IDs we need to create a copy of that which we will use as our labels. So we write inputs labels and we set that equal to inputs input IDs.

So our input IDs tensor. And we clone that by first attaching it and then cloning it. And that's all we need. So have a look at inputs again. Now we have input IDs at the top. And if we go down to the bottom we have a copy of those in this labels tensor.

Now what we need to do is create our mask. So with BERT when they are pre-training BERT they use a few rules. But at the core of that the main rule is that each token that is not a special token has a 15% chance of being masked. So when I say special token I mean the separator and classifier tokens which look like this.

And I'll point those out in a minute. In fact we can have a look here. This is our classifier token. This 101. And you see that at the start of every sequence. And then at the end here we also have padding tokens. We also don't want to mask those.

So to create that 15% probability for each token what we do is use the torch rand function. And we use this to create a tensor of floats that have the equal dimensions to our inputs IDs here. Inputs input IDs tensor. Like so. And if we check the shape of rand we see this 507 which is the number of sequences we have.

And 512 which is the number of tokens that each sequence has. So if we were to just take this we'd see we get the same. Okay. Now we have a look in there. It's just a set of floats from the value 0 up to 1. Now what we want to do is mask roughly 15% of these.

Or give each one of those a 15% probability of being masked. And the way we do that is mask anything that is under the value 0.15. So for example these ones here they will be masked. Whereas these ones up here will not be masked. To do that all we write is rand and we do less than 0.15.

Now if we have a look at mask array we see that now these values that were less than 0.15 have this true value which is what we'll be using to mask our tokens later on. But at the same time if you remember the classifier token is always in the first position within each tensor.

So here we would have a classifier token. Here too. And in fact all of these would also be padding tokens. We don't want to mask any of those. So what we do to avoid that is we add some extra logic. So I'll put that in brackets but actually we're going to just first test the logic so I can show you what it's actually doing.

So we have our inputs. Input IDs. Okay so these are the padding tokens. These are the classifier tokens. And what we do is just say inputs. Input IDs not equal to 101 which is our classifier token. Okay and now you see that we get a fault wherever there is a classifier token.

And we want to do the same but for our padding. And to do that we multiply that so that is essentially adding it to the logic here. It's like an and statement. And now we are removing the padding tokens from that mask. And there's one more. We can't see it here but it's also a separated token which is represented by the token ID 102.

So we also include that in here as well. Now all of these together we want to add these onto the logic up here. Okay and now we will get our masquerade. You see now we have faults wherever we had the padding tokens. We have faults wherever we had the classifier tokens.

But we still have a few mask tokens in there. So we have these true values here. Okay so that's our masquerade. And now what we want to do is take the indices of each true value within each one of these rows of the tensor. Now let's first do that with just one of them so you can see how it works.

So we take this one. Check the shape. It should be 512. Yeah so this is just one row here. And what we'll do is we'll say non-zero and this will return the indices where we have the well where we have non-zero values like e.g. the true values. But this is like a vector so what we want to do here is flatten that.

So we do torch flatten. Now we get almost a list but it's still a tensor. We want an actual list and we just write to list. Okay so now these are the index positions for the true values within this first row. But we want to do for every row and to do that all we do here is we just use a for loop.

So we initialize our selection list here and we say for call it row or for i in mask array shape zero. So mask array shape zero. Let me just show you. It's the 507 rows that we have. We want to do selection. Append and then we already have our logic here so we want to append this.

But we're going to append this for every single one of those rows. So let's um oh so sorry let's add a range on here. And let's have a look at what we get in the selection. So we'll just have a look at the first let's go to first five.

Ah um sorry I just need to replace that with i. There we go. So now we have indices for the first five rows here. We have it for all of them of course but we're showing you the first five. And there we go that's what we that's what we want.

And then what we want to do is we can just copy this. We want to set the values that each one of these indices equal to 103 which is our mask token within each row of our input ids tensor. So we go inputs input ids. Then here we need to select those specific values.

And that is at row i followed by selection. So a selection of indices at i as well. And we set those equal to 103 like so. And now let's have a look at what we have in our input ids tensor. So now we can see we have these mask tokens where we saw the true values before in our mask array.

And we haven't touched any declassifier or the padding tokens or the separated tokens which are in there as well. Now our tensors here are in the correct format but we still need to process them through something called a data loader during training. Now to process them through a data loader we need to convert them into a PyTorch data set object.

And to do that what we're going to do is write a create a class here which will handle this for us. So it's going to be meditations data set. And to create the data set object we need to pass the data set class into here. So this is torch utils data data set.

Now there's a few things we need here. We need the initialization function which is just init and we pass self and encodings. And here we're just going to assign encodings to a attribute within this class. Encodings equals encodings. Now the data loader expects two additional functions or methods that is the get item method and the length method.

Length method is so that you can check the length of the data set that it's looking at and the get item is so that you can get a dictionary formatted batch of those items. So for get item we write this and we need self and then we also specify the index.

And what we do is we return a dictionary and this is just going to pass so we have we have the input IDs key, we have the labels key, attention mask key, and token type IDs key. It's going to pass those back to the data loader when it requests this get item method.

So we write torch tensor and we pass the values and the index of those values for key val in self encodings dot items. So that should be okay. And the only thing left is the length method. So we define length and here there's no input parameters. All we need to do is return the length of our data set.

So it's return length and we're doing self encodings and then we can just use any of the tensors that we have in there but we'll do input IDs. And we could even we could modify this to be like shape zero and get rid of length at the end there but I'll just stick with length for now.

So that is our class which will handle the formatting of our data into a data set object and all we need to do is we write data set. So this is going to be our new data set variable. We have meditation data set our class and in here we just pass our encodings or our inputs.

Like so. Okay and now we can initialize our data loader which PyTorch will be using to load our data during training. So we write data loader equals torch utils data and data loader. Here we want to pass our data set and then we also want to specify our batch size.

So I'm going to go with 16. You can modify this depending on your your GPU or your computer whatever however much memory you have. And then we also want to shuffle the data within there as well so that we're not extracting say the first 16 paragraphs all at once.

We're actually going to be extracting 16 from random parts of the book. Okay now we're ready to move on to actually training. So first we need to set up the or set up all the training parameters. So we first want to move the model to GPU if you have a GPU and we check if we have a GPU using let me show you first so torch device CUDA.

If torch CUDA is available else torch device CPU. So we're saying here if we have a CUDA enabled GPU use that otherwise we just use CPU. I can see here that I do have it so we have this device type CUDA. And what we'll do is assign that to the device variable here and we use that to move our model and everything across to that device.

And we do that using model to device and we should get a big output here. We get all this information that we don't need to look into that. Now we need to activate our model. Our model is training mode so we just do model train to make sure it's ready.

And the final thing before we we set up our actual training loop is we need to initialize our optimizer. We're going to be using Adam with weighted decay here. So that's the Adam optimizer with weighted decay. Weighted decay just reduces the chance of overfitting especially with big models like transform models.

So we're going to do from transformers import Adam w and our optimizer is going to be Adam w. Pass in our model parameters and we also need passing a learning rate and we'll do one e to the minus five. So model parameters brackets at the end there. Okay. Okay now we're fully set up we can actually begin training which is set up as a normal training loop in PyTorch.

And the first thing I want to do is just import tqdm. This allows us to create a progress bar during training otherwise we just sat there and we don't see any updates on training which we don't want obviously. I'm going to say so we do two epochs. You can obviously modify this as you want.

I'm just we're just seeing how this all works so I'm not going to train it that much. And you want to be careful of training transform models for too many epochs. They overfit very easily. And we'll do for epoch in range epochs. And then here we want to set up our training loop.

So to do that we want to wrap it within a tqdm function there and we just pass our data loader which what did I call it data loader up here. And that leave equals true. This just leaves the progress bar rather than replacing it with every new epoch. And then we run through each batch within our loop.

So this is our batches of 16 items at time. And we first want to initialize our calculated gradients. So with every loop we will calculate gradients and we first we don't want to start with with gradients already calculated. We want to initialize them or set them zero. So we do optim zero grad.

Then we want to pull all of our tensors that we require for training. So input ids of course first one. And that will be equal to batch. Then in here we access our input ids. And additionally you see before we moved our model to our gpu we also want to do that for our tensors here as well.

So we say to device. Okay and we follow this structure for our other tensors as well. Now for mass language modeling we don't need to do anything with token type ids so we just ignore those. We have our attention mask. We do need that. And we also have our labels which we we do need of course.

And with that we can process everything. So now we do outputs model and we pass out input ids. So inputs we want to specify the attention mask. So we just copy that and we also need to specify our labels which is labels. Okay now let's just extract the loss from those outputs.

So we get a loss tensor there and what we do here is we use the backward method which calculates loss for every parameter in our model. And from that we can calculate the gradient update using our optimizer. So using that we have optim and we call the step. And this will take a step to optimize all of the weights within our model based on the loss.

Now final little bit here this is just you know aesthetics. I want our loop. I want to actually see some bits of information in that loop. So all I do is loop set description. And here I just want to show the epoch which is just epoch. And then I also want to see the loss in the postfix.

So we do loop set postfix and we do loss item. So item here just pulls out the exact value within that loss tensor up here. Okay that should be everything. Let's go see what we have. There we go. So now we're training. See the loss is going down slowly and that's it.

So we're now training our transform model using meditations by Marcus Aurelius with mass language modeling. It's really not that hard. I mean there is quite a bit to it but I think once you once you do it it's reasonably straightforward. And the fact that you can do this on basically any set of text using just a mass function is incredibly so so useful.

So we don't need to you know go out looking for label data anywhere which is amazing. So that's that's it for this video. I hope it's been useful. I know it's a bit of a long one but thank you very much for watching and I will see you again in the next one.

Training BERT #2 - Train With Masked-Language Modeling (MLM)

Chapters

Transcript