Building MLM Training Input Pipeline - Transformers From Scratch #3

Hi, welcome to this video. So this is the third video in our Transformers from Scratch miniseries. And in the last two videos, we basically got a load of data and trained our tokenizer, which is what you can see here. So I'm going to have to rerun these. And we just tokenize some Italian.

So this is a Italian BERT model, Roberta model. So we can write something like this, which means, hello, how are you? And it will tokenize our text here. Now, this is where we got to. What we now want to do is build out a input pipeline and train a model.

Now, this is reasonably, I would say, involved, because we need to do a few things. First thing is we need three different tensors in our model here. We need the input IDs and attention mask. We also need the labels tensor as well. So the labels tensor is actually just going to be our input IDs as they are at the moment.

But our input IDs are not going to be that. Our input IDs, they need to be passed through a mass language modeling script, which will mask around 15% of the tokens within that tensor. And then whilst we're training, our model is essentially going to try and guess what those masked tokens are.

And we'll optimize the model using the loss between the guesses that the model outputs from the input IDs and the real values that are our labels. So that's essentially how it's going to work. I suppose it's a lot easier said than done. So first thing we're going to do is create our mass language modeling function.

If you have watched some of my videos before, we quite recently did, I think, two videos, maybe two videos on mass language modeling. I'll leave a link to those in the description because the code is pretty much the same as what we cover there. And I will cover it very quickly here, but not too in-depth.

So if you're interested, those links will be there in the description. So the very first thing we need to do, which I haven't done yet, is import torch. So we're using PyTorch here. And we need to create a random array. So we write torch rand. And this random array needs to be in the same shape as our tensor that we've input up here.

So let's write tensor dot shape. And then we want to mass around 15% of those. So to do that, we can use rand, where rand is less than 0.15. Because what we've created here is an array in the shape of our input tensor, which is going to be our input IDs, where every value is within the range of 0 to 1.

And that's completely random. So that means there should be-- or for each token, there's a roughly 15% chance of that value being under 0.15. So that's our first criteria for our random masking array. The roughly 15% can call this mask array as well. But there's a few other criteria as well.

And again, I covered that in the other videos. But in short, we don't want to mask special tokens. So we can see up here we have two special tokens. If we add padding-- so I'm just going to add a little bit of padding, not loads. Let's go max length 10.

And we write padding equals max length. If we do that, we get these extra ones here. They're our padding tokens. So we basically want to say we don't want to mask our 0's, 2's, or 1's, because they're special tokens that we don't want to mask. So we also just put where tensor is not equal to 0.

And let's just copy that. It's a little bit easier. And also not equal to 1. And it will also not be equal to 2. And I do wonder if-- yeah, we could make that a bit nicer. So if we just do-- so you can either do this, where you specify each token.

And you will want to do that sometimes, maybe, like with BERT, because your special tokens are in the range of, like, 100, 0, 101, and so on. There's a few different ones. But because we've got everything, it's either 2 or below, we could just write this. So we could say where it is not or where it is greater than 2.

So this is like an AND statement saying, we're going to mask tokens that have a randomly generated value of less than 0.15. That's our 15% criteria. And they're not a special token, e.g., they are greater than the value 2, because our special tokens are 0, 1, and 2. So that's cool.

And now what we want to do is loop through each row in our tensor. So we want to do for i in range tensor dot shape 0. So this is how many rows we have in our tensor. And we can't do this in parallel, because each row is going to have a different number of tokens that will be masked.

So if we did this in parallel, we end up trying to fit different size rows into an equally sized tensor. So we can't do that. Again, if this is confusing, I have those videos. But I mean, you don't need to specifically know everything that's going on here. This is just how we mask those roughly 15% of tokens.

So we want torch flatten. And this bit is a bit confusing. But we want to take the mask array at the current position. I want to say where it's not 0. So when we create this mask array, we essentially get a load of true or false values in the size of our tensor shape.

Where we have 1s, that is a mask. And what we're doing here is we're saying, get me a list of all the values that are not 0, which are 1s. And that gives us a list within a list. So we get something like this. And it will say, indices 2, 4, 18.

I don't know why I said 4. It's 5. 2, 5, 18. They are where your mask tokens will be. And then we use torch flatten here to remove that outer list. And at the end here, we're going to convert to a list so that we can do some fancy indexing in a moment.

And that fancy indexing looks like this. So we have our tensor. We're specifying the current row. So we're going row at a time. And then we want to specify that selected number of indices, which are where we're going to place our mask. Now, what does the mask token look like?

Well, we can actually find it over here in our vocab.json. Yeah. So scroll to the top, and we see our mappings here. So the mask token is number 4. So that's what we're going to use. Switch back over. So we're going to make those values equal to 4. That's our mask.

Then at that point, we have successfully masked our input IDs, and we want to return the tensor. So that's our masking function. That's a big part of this video. That's one of the harder parts. So now what we're going to do is I'm going to scroll up a little bit to here.

So we have-- I'm just going to take this. So this will give us a list of all of our training files. So here. And we just need to do from path lib, import path. OK, let's have a look at what we have. So this is just a list of everything that we have over here.

So these are text files containing our Italian samples. Each sample is separated by a newline character. And each file also contains about 10,000 samples. So we have quite a bit of data. And what we're going to do here is we're going to create our three tensors that I mentioned before.

We have-- before I make a list, I didn't make a list. So we have the labels and input IDs, and then we also have the attention mask as well. So let's first initialize a list. So input IDs, attention mask, or I'm just going to call it mask, and labels.

And what we're going to do is I also-- so we're going to use a progress bar here. So I'm just going to import. So from TQDM, auto import TQDM. So I'm just going to import that as well. And what I'm going to do is loop through each path in our-- I'm going to wrap it in TQDM.

This creates our progress bar in our paths. For each path, we're going to load it, extract our data, convert it into the correct format that we need here, and append each one of those to these lists, and then create a big tensor out of that. So we want to write with open.

And then here we have our path, our reading, and the encoding is UTF-8 as F. We want to write text equals F.read.split, like that. So I'm going to rename it lines. So this is just a big list of 10,000 samples that are all Italian, OK? So then we want to encode that.

So we write sample equals tokenizer. Lines, we want our max length, which is going to be 512. We want padding up to that max length. And we also want to truncate anything that is further than that. So truncation equals trunc. OK, that's our tokenization done. Then we want to extract.

We want to extract all of those and add them to our list. So we get our labels first. Now, the labels are just the input IDs produced by our sample. So sample, input IDs. And I'm thinking here we can do turn tensors, use PyTorch. So append our input IDs to labels.

And then we have our mask. We want to append the sample attention mask. And then we can also see that up here, by the way. Here, this is what we're doing. We're taking those out, putting them into our list. And then-- so we have labels, mask, and we want to create our input IDs.

Now, input IDs, that's what we built this mass language modeling function for. And in there, we need to pass our tensor. So to do that, we just want to write sample input IDs. And before I forget, that needs to go within MLM, like that. Like that. Now, I don't want to modify that tensor, because it's being appended to labels.

So I'm going to create a clone of that. And that will be done using detach and .clone, like that. So it's pretty good. Let's run that. OK, and it's going to take a long time. So, yeah, I'm not going to use all of them. It was going up as well, so I have no idea how long that would take.

Let's leave that for a little bit. Let's go with the first 50 for now. Still got to wait a little while, but at least not as long. So I'll leave that to run. Hopefully it shouldn't take too long. And, yeah, I'll see you when it's done. OK, so that's done.

It wasn't too long. And if we just have a look-- so input IDs at the moment is just a big list. I don't know if it's a good idea, but here we go. So we just have a list of tensors. What we can do is, rather than having a list of tensors, we can use something called TorchCAT.

And TorchCAT expects a list of tensors to be passed to it, which is why I've done this, where we have lists and we just append tensors to it. And we can do that, and it will concatenate our tensors, which is pretty cool. So what we want to do now is we write input IDs, and we're just going to concatenate all of our tensors.

So then they're ready for formatting into a data set. So we have mask here and labels here. We can also see, just worth pointing out, we have that mask token there, so we know that we have mask tokens in our input IDs now. If we-- let's run that, and let's just compare.

So let's go input IDs, 0. That's quite a lot, so can I-- I'll just do the first 10. And then let's do the same for labels. We'll see that we don't have these 4s, or we hopefully shouldn't have those 4s. So that's essentially a masking operation. So cover this with a mask here, and then same here and here, here and here.

OK, cool. Now the format that our data set needs and our model needs is a dictionary where we have input IDs, which maps to input IDs, obviously. And you can guess either two as well. So input IDs, this one, attention mask to mask, and the final one is labels.

So there are encodings. Now we create a data set object. To create a data set object-- in fact, actually, we create a data set object to create a data loader object, which is what we use to load data into our model. And that's essentially our input pipeline. But to create that data loader, we need to create a data set object.

Now the data set object, we create that by-- well, like this. So we do class data set, color, whatever you want. And we want torch, utils, data, data set, like that. We need a initialization function, which is going to store our encodings internally. Don't forget the def there. So we want to write self encodings equals encoding.

So this is initializing our data set object. And then there's two other methods that this object needs. We need a length method so that we can say length data set. And it will return the number of samples that are in the data set. And we also need a get item method, which will allow the data loader to extract a certain-- so say if it says, give me number one, it's going to go into this data set object and extract the tensors, the input IDs, attention marks, and labels at position one.

So that's what we need to do there. So we'll do length first. And length, we don't need to pass anything in there. We're just calling length. So from that, we just want to return the self encodings, do input IDs. And remember before, we did this shape. And we took the first one, which is the length.

So if I take input IDs, in fact, I can just do it here. So I'll copy that. If I go here, we get that 500K, which is the number of samples we have. That's what we want to return. So that's our length. And then we also have the get item.

So here, we do want to pass an index value. So this is going to be-- our data load is requesting a certain position. And for that, we want to return. So we're going to return dictionary. It needs to be in this format here. But we need to specify the correct index.

Now, what we could do is we could do self encodings and then access our input IDs like that. We also-- I need to change that here. So it'll give us an error, dot shape. And we could do that. So we could take that like so, and then just say, index position.

That's fine. You can do that if you want. But an easier way of doing it, where we don't need to specify the-- we don't care about the structure of the data set. We just want to get it out. We don't need to specify it. We can just do this.

You write key tensor. So the specific index of that tensor for key tensor in self encodings, dot items. So if we were to go encodings, dot items-- so we can do that here. See, we get essentially everything in our data set. So we're just looping through that, returning it, and specifying which index we're returning here.

So once we have written that, we can initialize our data set. So write data set equals data set. And then we just pass in our encodings there. So just remove that, and encodings. That's it. So that's our data set. And now we initialize our data loader. So this is pretty much it for our input pipeline.

So data loader equals torch utils. This is coming from the same area as our data set. Data loader. Now we pass in our data set object. We want to specify batch size. So I typically go with 16. This will depend on how much your computer can handle it once as well.

So just play around with that, see what works. And we also want to shuffle our data set as well. So yeah, that's our input pipeline. After that, obviously, we want to feed it in and train our model with it. So we're going to cover that in the next video.

So thank you for watching, and I will see you in the next one.

Building MLM Training Input Pipeline - Transformers From Scratch #3

Chapters

Transcript