Training BERT #1 - Masked-Language Modeling (MLM)

Okay, so in this video, we're going to have a look at what I think is the more interesting side of transformers, which is how we actually train those. So typically with transforming, what we do is we download a pre-trained model from Hugging Face. And then at that point, we can either use a pre-trained model as is, which in a lot of cases, it will be good enough to actually do that.

But then at other times, we might want to actually fine-tune the model. And that is what I'll be showing you how to do here. So core of BERT, there are two different training or fine-tuning approaches that we can use. And we can even use both of those together. But for this video, what we're going to have a look at is how to use a mass language modeling, which is called MLM.

And MLM is really the, probably the most important of those two core training approaches. The other one being next sentence prediction. So what MLM is, is we essentially give BERT a input sequence. So like this, so this would be our input sequence. And we ask BERT to predict the same input sequence as the output.

And BERT will optimize the weights within its encoder layers in order to produce this output. Now obviously that's pretty easy. So what we do is we mask some, some random tokens within the input. So here we might mask one. And what we do is replace that with another token, which is a special token called a mask token, which looks like that.

And when we're doing MLM, we would typically mask around 15% of the input tokens. So if we take a look at how that looks, so this might look a little complex, but it's pretty straightforward. So down here, we have our input from the previous slide. We process that through our tokenizer like we normally would with transformers.

And then in the middle here, I haven't drawn it, but in the middle, there's a masking function. And that masking function will mask around 15% of the tokens in the input IDs tensor. So here we have a mask token and they will then get processed by BERT in the middle here.

And BERT will output a set of vectors, which all have the length 768. Usually there's, there's different BERT models. They have different lengths. We'll go with the 768 here, and then we pass them through a feed forward network and that will output our, our output logits up here. And each one of those is of the size equal to the vocab size.

And with this model, I think the vocab size is something around I think three or 30,500, something like that. And then from there to get the predicted token for each one of those logits, we apply a softmax function to get a probability distribution. And then we apply a argmax function, which is what you can see here.

So this is just an example of one of those logits over here. We have the softmax, we get the probability distribution, and then we apply our argmax to get our final token ID, which we can then map or we can then decode using our tokenizer to get an actual word in English.

So that's how it works. Let's have a look at how we actually do that in code. Okay, so first we'll need to import everything we need. So we're using transformers here, where we're using the BERT tokenizer and BERT format LM classes. And then we'll also be importing Torch as well.

So from transformers, import our tokenizer, and also our BERT for mass LM, which is MLM, the mass language modeling. And then we also want to import Torch as well. Okay. And then what I want to do here is initialize our two models, well our tokenizer and model. And I do that just as we normally would with Hugging Face transformers.

So we do BERT tokenizer from pre-trained. And here we have BERT base on case. And then we also want our model, which is BERT for mass LM. And this will also be from pre-trained. Again, using the same model, so BERT base on case. Okay. So that's our tokenizer and model.

And I'm also going to use this example text here. So we see here, so this should be election, this mask, and this one here should be attacked. Okay. Now, execute that. I've made a typo here. Okay. And now what we want to do is actually tokenize that chunk of text.

So to do that, we would write inputs. We have our tokenizer, and all we do is pass our text in there. We're using PyTorch here, so we want to return tensors, PT. Okay. And let's have a look at what tensors we return from that. So you see we have our input IDs, token type IDs, and attention mask.

Now we don't need to worry about token type IDs whatsoever for MLM. And attention mask, MLM does use that, but I'm not going to go into any details. So all we want to focus on this video is input IDs. So let's have a look at what we have there.

So there's a few things that I want to point out. First, we have our special tokens. So we have the CLS or classified token here. We have the separated token, SCP. And we also have our mass tokens, one here and one here. And everything in between are actual real tokens from our text.

So what we have now, we have our inputs. And what we do is use these inputs initially to create our labels. But what I've done here is already amassed our inputs. So what I'm going to do is just actually replace these with the actual words. So this is election.

And this one is attacked. So just rerun that and that. Okay. And now what we can do with that is actually create our target labels. So the target labels needs to be contained within a tensor called labels. Create like that. And it just needs to be a copy of this input IDs tensor.

And to create a copy of that, we write detach. And then we clone it. Okay. So that creates our copy, which is not going to be connected to our input IDs. And now if we just have a look at our inputs, we can see input IDs at the top, and we have labels at the bottom.

They're just copies. Okay. Now what we want to do is mask a random number of input IDs or tokens within the input IDs tensor, but not the labels tensor. Now to do that, what we can do is use the PyTorch random function. And using that, what we'll do is create a random array of floats that have equal dimensions to input IDs tensor.

So all we do is we pass input IDs dot shape into there. And if we can check the shape of it afterwards, we get this one by 62, which is equal to this here. And we can have a look at what we have there. It's just a set of floats between zero and one.

Now, if we want to select a random 15% of those, what we do is we'll create a new array, mask array, and this will be equal to rand where rand is greater than or less than 0.15. Okay. And this will select 15% of those. And let me show you what that looks like.

So this will create a Boolean array and he'll say all of these faults and then these true values are, that's where we'll put our mask tokens later on. Now there's one here and this one is covering our separator token. Now we don't want to mask our separator or classifier token.

We don't want to mask any special tokens. So what we can do is add an extra little bit of logic there, which will like this. So we do inputs, input IDs, and we say not equal to one zero one, which is our classifier token. And let's just have a look at what this looks like and see that now we get true for everything, except for my classifier token.

And we multiply this by the same rule, but for our separator token. So now you see that we have faults here and faults here. Now all we need to do is add this to our mask array logic up here. And we also put brackets around this and this will make sure that these two are always faults no matter what.

Okay. Now what I want to do is actually get the index positions of each one of these true values and do that. We write torch flatten. So this is going to just flatten the tensor that we will get out from this next bit of code and maybe it would make sense.

Okay. Let's start from the first part of the code. So we're going to go mask array here. That gets us our mask array. We want to say non zero. And that will get us a vector of indices where we have true values or non zero values. And what we want to do is convert that into a list like that.

But you see that we have a list within a list. So this is where the torch flatten comes in. So we add another bracket around this and we do torch flatten. And then we convert it to a list. And that gives us a list of indices where we have these true values.

So that's our selection. And now what we want to do is use that selection to select a certain number or select those specific indices within our input ID tensor. So we want to select the first part of that. So the zero index followed by selection. And we set those equal to one zero three.

And then let's have a look and see if that works. So one zero three is our mask token, by the way. And you can see here now we have those mask tokens in those positions. So we just masked random, roughly 15% of those tokens. And then from there, we can pass all of this into our model and the model will calculate out loss and the logits that we saw before.

So we do that as we normally would when we're using HuggingFace and Torch. So we have models, we pass our inputs as keyword arguments. So look at what output is given us. And we'll see we have these two tensors, we have loss and we have logits. Now, let's have a look at what that loss looks like.

Okay, so we get this value here. So that is our loss. And of course, with that loss, we can actually optimize our model. Okay, so that's how mass language modeling works. Now, when we're actually training a model using mass language modeling, obviously, the code is slightly different. But there's also a reasonable amount of depth that we need to go into for that.

So I'm not going to include in this video, but I am going to do a video on that, actually training a model using mass language modeling pretty soon. And I'll leave a link to that in the description because I know some of you probably want to watch that to understand how to actually train your own models using this.

But that's it for this video. I hope it's been useful. And I will see you again in the next one.

Training BERT #1 - Masked-Language Modeling (MLM)

Chapters

Transcript