Training BERT #2 - Train With Masked-Language Modeling (MLM)

00:00:00.000 | Okay, in this video what we're going to do is take a look at how we would train a model,

00:00:06.160 | a transform model, using mass language modeling or MLM. Now MLM typically would use it when we

00:00:15.120 | want to teach a transform model like BERT to better understand the specific style of language

00:00:23.360 | in our specific use cases. And it consists of taking an input sentence or sequence,

00:00:33.200 | masking a few of the tokens within that input sequence,

00:00:37.200 | and asking BERT to predict the words that we have masked. So this is pretty useful because we can

00:00:45.920 | take any chunk of text and process it through a masking function and we can use that for training.

00:00:54.560 | We don't need to get label data which is really, really useful. So let's jump straight into it.

00:01:02.240 | And what we first need to do is import everything we need. So we need

00:01:08.240 | our tokenizer and model from transformers and we also need to import PyTorch. So do from transformers

00:01:16.320 | import BERT tokenizer and BERT for mass LM. Then we also want to import torch.

00:01:31.040 | And then what we want to do is initialize our tokenizer and model. So our tokenizer is a BERT

00:01:39.360 | tokenizer from pre-trained. And we're using the BERT base uncased model. Let's copy that

00:01:54.720 | and our model will be pretty similar. So this time using BERT for mass LM.

00:02:01.520 | Mass LM is just mass language modeling or MLM that I mentioned before. So

00:02:09.600 | that's great. Now I'm going to be training this on

00:02:15.920 | a book that you can just get from the internet. It's Meditations by Marcus Aurelius.

00:02:24.400 | The language in that is pretty unique so I figure this is quite a good example. So I already have it

00:02:32.320 | downloaded and I've cleaned up a little bit. I will include a link to that clean version

00:02:38.480 | of this so you can follow along if you want. So for me, of course I already have it downloaded

00:02:50.000 | here. Meditations clean.txt. And we are reading that in.

00:03:02.480 | P and all I need to do here is read. And what I've done is split each paragraph

00:03:17.360 | within Meditations by a newline character. So I will just split by newline. And that should

00:03:26.800 | get us what we want. Okay so we have this text now and what we want to do with this

00:03:34.800 | is actually tokenize it. And this is just like we normally would with the transforms library.

00:03:44.960 | So we have our tokenizer up here. And we just pass our text into that. Now we're using PyTorch

00:03:55.920 | here so we want to return PyTorch tensors pt. And we also need to set the maximum length which for

00:04:08.400 | this bear model is 512. And then we need to set truncation to true. And padding equal to max

00:04:24.800 | length. So this will either truncate or pad each one of these sentences to the length of 512 tokens.

00:04:37.040 | This should be returned tensors. So there we go. And here we are. So we still have our input IDs.

00:04:49.120 | We don't need to worry about token type IDs here. And we have our attention mask

00:04:54.000 | which Bert just uses for calculating attention. I'm not going to really go into depth on any of that.

00:05:05.760 | Now as I said before we need two things for training our BERT model here. We need the

00:05:13.840 | input IDs which will have a mask token. Now we haven't created that mask token yet.

00:05:19.440 | And we also have our output labels which will not include that mask token. So before we mask

00:05:28.880 | our input IDs we need to create a copy of that which we will use as our labels. So we write

00:05:37.600 | inputs labels and we set that equal to inputs input IDs. So our input IDs tensor. And we clone

00:05:50.400 | that by first attaching it and then cloning it. And that's all we need. So have a look at inputs

00:06:01.600 | again. Now we have input IDs at the top. And if we go down to the bottom we have a copy of those

00:06:07.360 | in this labels tensor. Now what we need to do is create our mask. So with BERT when they are

00:06:17.600 | pre-training BERT they use a few rules. But at the core of that the main rule is that each token

00:06:25.760 | that is not a special token has a 15% chance of being masked. So when I say special token I mean

00:06:35.440 | the separator and classifier tokens which look like this. And I'll point those out in a minute.

00:06:42.320 | In fact we can have a look here. This is our classifier token. This 101. And you see that at

00:06:48.960 | the start of every sequence. And then at the end here we also have padding tokens. We also don't

00:06:53.360 | want to mask those. So to create that 15% probability for each token what we do is use

00:07:01.920 | the torch rand function. And we use this to create a tensor of floats that have the equal dimensions

00:07:10.720 | to our inputs IDs here. Inputs input IDs tensor. Like so. And if we check the shape of rand we see

00:07:24.640 | this 507 which is the number of sequences we have. And 512 which is the number of tokens that each

00:07:30.000 | sequence has. So if we were to just take this we'd see we get the same. Okay. Now we have a look in

00:07:40.720 | there. It's just a set of floats from the value 0 up to 1. Now what we want to do is mask roughly

00:07:51.600 | 15% of these. Or give each one of those a 15% probability of being masked. And the way we do

00:07:57.200 | that is mask anything that is under the value 0.15. So for example these ones here they will be masked.

00:08:05.280 | Whereas these ones up here will not be masked. To do that all we write is rand and we do less than

00:08:17.520 | 0.15. Now if we have a look at mask array we see that now these values that were less than

00:08:27.920 | 0.15 have this true value which is what we'll be using to mask our tokens later on. But at the same

00:08:35.680 | time if you remember the classifier token is always in the first position within each tensor.

00:08:45.360 | So here we would have a classifier token. Here too. And in fact all of these would also be padding

00:08:51.520 | tokens. We don't want to mask any of those. So what we do to avoid that is we add some extra logic.

00:08:59.520 | So I'll put that in brackets but actually we're going to just first test the logic so I can show

00:09:09.040 | you what it's actually doing. So we have our inputs. Input IDs. Okay so these are the padding

00:09:16.720 | tokens. These are the classifier tokens. And what we do is just say inputs. Input IDs not equal to

00:09:28.240 | 101 which is our classifier token. Okay and now you see that we get a fault wherever there is a

00:09:34.320 | classifier token. And we want to do the same but for our padding. And to do that we multiply that

00:09:44.400 | so that is essentially adding it to the logic here. It's like an and statement. And now we are

00:09:56.800 | removing the padding tokens from that mask. And there's one more. We can't see it here but it's

00:10:04.080 | also a separated token which is represented by the token ID 102. So we also include that

00:10:11.280 | in here as well. Now all of these together we want to add these onto the logic up here.

00:10:21.120 | Okay and now we will get our masquerade. You see now we have faults wherever we had

00:10:30.160 | the padding tokens. We have faults wherever we had the classifier tokens. But we still have

00:10:34.960 | a few mask tokens in there. So we have these true values here. Okay so that's our masquerade.

00:10:47.840 | And now what we want to do is take the indices of each true value within each one of these

00:10:58.240 | rows of the tensor. Now let's first do that with just one of them so you can see how it works. So

00:11:07.360 | we take this one. Check the shape. It should be 512. Yeah so this is just one row here.

00:11:17.440 | And what we'll do is we'll say non-zero and this will return the indices where we have

00:11:28.640 | the well where we have non-zero values like e.g. the true values.

00:11:34.720 | But this is like a vector so what we want to do here is flatten that. So we do torch flatten.

00:11:43.200 | Now we get almost a list but it's still a tensor. We want an actual list and we just write to list.

00:11:51.520 | Okay so now these are the index positions for the true values within this first row. But we

00:12:00.560 | want to do for every row and to do that all we do here is we just use a for loop.

00:12:06.720 | So we initialize our selection list here and we say for call it row or for i in mask array

00:12:20.800 | shape zero. So mask array shape zero. Let me just show you.

00:12:28.640 | It's the 507 rows that we have. We want to do selection.

00:12:35.920 | Append and then we already have our logic here so we want to append

00:12:43.040 | this. But we're going to append this for every single one of those rows.

00:12:48.640 | So let's um

00:12:51.520 | oh so sorry let's add a range on here.

00:12:58.160 | And let's have a look at what we get in the selection. So we'll just have a look at the first

00:13:05.920 | let's go to first five.

00:13:10.960 | Ah um sorry I just need to replace that with i. There we go. So now we have indices for the first

00:13:18.560 | five rows here. We have it for all of them of course but we're showing you the first five.

00:13:22.160 | And there we go that's what we that's what we want.

00:13:26.160 | And then what we want to do is we can just copy this.

00:13:33.680 | We want to set the values that each one of these indices equal to 103 which is our mask token

00:13:43.840 | within each row of our input ids tensor. So we go inputs input ids.

00:13:52.880 | Then here we need to select those specific values.

00:13:59.840 | And that is at row i followed by selection. So a selection of indices

00:14:06.320 | at i as well. And we set those equal to 103 like so.

00:14:15.600 | And now let's have a look at what we have in our input ids tensor.

00:14:21.040 | So now we can see we have these mask tokens where we saw the true values before in our mask array.

00:14:30.480 | And we haven't touched any declassifier or the padding tokens or the

00:14:34.080 | separated tokens which are in there as well. Now our tensors here are in the correct format

00:14:41.520 | but we still need to process them through something called a data loader during training.

00:14:48.480 | Now to process them through a data loader we need to convert them into a PyTorch data set object.

00:14:58.000 | And to do that what we're going to do is write a create a class here which will

00:15:02.720 | handle this for us. So it's going to be meditations data set.

00:15:06.320 | And to create the data set object we need to pass the data set class into here.

00:15:15.440 | So this is torch utils data data set.

00:15:24.880 | Now there's a few things we need here. We need the initialization function

00:15:30.880 | which is just init and we pass self and encodings.

00:15:40.160 | And here we're just going to assign encodings to a attribute within this class.

00:15:52.160 | Encodings equals encodings. Now the data loader expects two additional functions

00:16:00.960 | or methods that is the get item method and the length method. Length method is so that you can

00:16:08.720 | check the length of the data set that it's looking at and the get item is so that you can get a

00:16:15.360 | dictionary formatted batch of those items. So for get item we write this and we need self and then

00:16:26.160 | we also specify the index. And what we do is we return a dictionary and this is just going to

00:16:36.080 | pass so we have we have the input IDs key, we have the labels key, attention mask key, and token type

00:16:44.160 | IDs key. It's going to pass those back to the data loader when it requests this get item method.

00:16:51.440 | So we write torch tensor and we pass the values and the index of those values for key

00:17:04.720 | val in self encodings dot items. So that should be okay.

00:17:17.840 | And the only thing left is the length method. So we define length and here there's no input

00:17:29.520 | parameters. All we need to do is return the length of our data set. So it's return

00:17:36.480 | length and we're doing self encodings and then we can just use any of the

00:17:44.400 | tensors that we have in there but we'll do input IDs.

00:17:53.280 | And we could even we could modify this to be like shape zero and get rid of

00:18:00.000 | length at the end there but I'll just stick with length for now.

00:18:05.440 | So that is our class which will handle the formatting of our data into a data set object

00:18:19.760 | and all we need to do is we write data set. So this is going to be our new data set variable.

00:18:24.640 | We have meditation data set our class and in here we just pass our encodings or our inputs.

00:18:32.720 | Like so. Okay and now we can initialize our data loader which PyTorch will be using to load

00:18:46.800 | our data during training. So we write data loader equals torch utils data and data loader.

00:18:57.280 | Here we want to pass our data set and then we also want to specify our batch size. So I'm going to go

00:19:08.240 | with 16. You can modify this depending on your your GPU or your computer whatever however much

00:19:18.080 | memory you have. And then we also want to shuffle the data within there as well so that we're not

00:19:25.920 | extracting say the first 16 paragraphs all at once. We're actually going to be extracting 16

00:19:31.920 | from random parts of the book. Okay now we're ready to move on to actually training. So first

00:19:42.160 | we need to set up the or set up all the training parameters. So we first want to move the model to

00:19:53.840 | GPU if you have a GPU and we check if we have a GPU using let me show you first so torch device

00:20:06.320 | CUDA. If torch CUDA is available else torch device CPU. So we're saying here if we have a

00:20:24.240 | CUDA enabled GPU use that otherwise we just use CPU. I can see here that I do have it so we have

00:20:34.480 | this device type CUDA. And what we'll do is assign that to the device variable here and we use that

00:20:41.280 | to move our model and everything across to that device. And we do that using model to device and

00:20:50.320 | we should get a big output here. We get all this information that we don't need to look into that.

00:20:58.800 | Now we need to activate our model. Our model is training mode so we just do model train

00:21:06.240 | to make sure it's ready. And the final thing before we we set up our actual training loop

00:21:14.480 | is we need to initialize our optimizer. We're going to be using Adam with weighted decay here.

00:21:19.680 | So that's the Adam optimizer with weighted decay. Weighted decay just reduces the chance of

00:21:25.440 | overfitting especially with big models like transform models. So we're going to do from

00:21:30.720 | transformers import Adam w and our optimizer is going to be Adam w. Pass in our model parameters

00:21:46.240 | and we also need passing a learning rate and we'll do one e to the minus five.

00:21:55.200 | So model parameters brackets at the end there. Okay. Okay now we're fully set up we can actually

00:22:03.440 | begin training which is set up as a normal training loop in PyTorch.

00:22:08.240 | And the first thing I want to do is just import tqdm. This allows us to create a progress bar

00:22:17.360 | during training otherwise we just sat there and we don't see any updates on training which we

00:22:23.520 | don't want obviously. I'm going to say so we do two epochs. You can obviously modify this as you

00:22:31.920 | want. I'm just we're just seeing how this all works so I'm not going to train it that much.

00:22:37.120 | And you want to be careful of training transform models for too many epochs. They overfit very

00:22:43.520 | easily. And we'll do for epoch in range epochs. And then here we want to set up our training loop.

00:22:56.880 | So to do that we want to wrap it within a tqdm function there and we just pass our data loader

00:23:07.280 | which what did I call it data loader up here. And that leave equals true. This just leaves

00:23:14.000 | the progress bar rather than replacing it with every new epoch. And then we run through each

00:23:21.600 | batch within our loop. So this is our batches of 16 items at time. And we first want to initialize

00:23:35.120 | our calculated gradients. So with every loop we will calculate gradients and we first we don't

00:23:44.160 | want to start with with gradients already calculated. We want to initialize them or

00:23:48.640 | set them zero. So we do optim zero grad. Then we want to pull all of our tensors that we

00:23:56.960 | require for training. So input ids of course first one. And that will be equal to batch.

00:24:05.920 | Then in here we access our input ids. And additionally you see before we moved our

00:24:14.560 | model to our gpu we also want to do that for our tensors here as well. So we say to device.

00:24:26.640 | Okay and we follow this structure for our other tensors as well. Now for mass language

00:24:34.320 | modeling we don't need to do anything with token type ids so we just ignore those.

00:24:39.040 | We have our attention mask. We do need that.

00:24:44.320 | And we also have our labels which we we do need of course.

00:24:51.840 | And with that we can process everything. So now we do outputs model and we pass out

00:25:01.440 | input ids. So inputs we want to specify the attention mask. So we just copy that

00:25:09.200 | and we also need to specify our labels which is labels. Okay now let's just extract the

00:25:21.600 | loss from those outputs. So we get a loss tensor there and what we do here is we use the backward

00:25:30.800 | method which calculates loss for every parameter in our model. And from that we can calculate the

00:25:42.080 | gradient update using our optimizer. So using that we have optim and we call the step. And this will

00:25:51.040 | take a step to optimize all of the weights within our model based on the loss. Now final little bit

00:25:59.840 | here this is just you know aesthetics. I want our loop. I want to actually see some bits of

00:26:05.840 | information in that loop. So all I do is loop set description. And here I just want to show the epoch

00:26:17.280 | which is just epoch. And then I also want to see the loss in the postfix. So we do loop set postfix

00:26:28.400 | and we do loss item. So item here just pulls out the exact value within that loss tensor up here.

00:26:43.840 | Okay that should be everything. Let's go see what we have. There we go. So now we're training.

00:26:51.760 | See the loss is going down slowly and that's it. So we're now training our transform model using

00:27:02.800 | meditations by Marcus Aurelius with mass language modeling. It's really not that hard. I mean

00:27:11.040 | there is quite a bit to it but I think once you once you do it it's reasonably straightforward.

00:27:16.880 | And the fact that you can do this on basically any set of text using just a mass function is

00:27:26.000 | incredibly so so useful. So we don't need to you know go out looking for label data anywhere which

00:27:32.800 | is amazing. So that's that's it for this video. I hope it's been useful. I know it's a bit of a

00:27:40.480 | long one but thank you very much for watching and I will see you again in the next one.

Training BERT #2 - Train With Masked-Language Modeling (MLM)

Chapters