back to indexBuilding MLM Training Input Pipeline - Transformers From Scratch #3
Chapters
0:0
2:16 Create Our Mass Language Modeling Function
21:59 Initialize Our Data Set
22:20 Data Loader
00:00:02.240 |
So this is the third video in our Transformers 00:00:14.480 |
got a load of data and trained our tokenizer, which 00:00:33.560 |
So this is a Italian BERT model, Roberta model. 00:00:51.000 |
What we now want to do is build out a input pipeline 00:00:56.960 |
Now, this is reasonably, I would say, involved, 00:01:07.520 |
First thing is we need three different tensors 00:01:24.000 |
going to be our input IDs as they are at the moment. 00:01:34.760 |
through a mass language modeling script, which 00:01:38.400 |
will mask around 15% of the tokens within that tensor. 00:01:54.280 |
the loss between the guesses that the model outputs 00:01:58.200 |
from the input IDs and the real values that are our labels. 00:02:05.160 |
So that's essentially how it's going to work. 00:02:13.840 |
So first thing we're going to do is create our mass language 00:02:20.600 |
If you have watched some of my videos before, 00:02:24.760 |
we quite recently did, I think, two videos, maybe two videos 00:02:32.960 |
I'll leave a link to those in the description 00:02:35.440 |
because the code is pretty much the same as what 00:02:39.200 |
And I will cover it very quickly here, but not too in-depth. 00:02:53.920 |
So the very first thing we need to do, which I haven't done 00:03:14.800 |
And this random array needs to be in the same shape 00:03:25.120 |
And then we want to mass around 15% of those. 00:03:32.200 |
So to do that, we can use rand, where rand is less than 0.15. 00:03:43.560 |
an array in the shape of our input tensor, which 00:03:46.720 |
is going to be our input IDs, where every value is 00:04:00.280 |
or for each token, there's a roughly 15% chance 00:04:06.920 |
So that's our first criteria for our random masking array. 00:04:14.280 |
The roughly 15% can call this mask array as well. 00:04:21.440 |
And again, I covered that in the other videos. 00:04:25.720 |
But in short, we don't want to mask special tokens. 00:04:30.680 |
So we can see up here we have two special tokens. 00:04:35.160 |
so I'm just going to add a little bit of padding, 00:04:55.600 |
because they're special tokens that we don't want to mask. 00:04:58.520 |
So we also just put where tensor is not equal to 0. 00:05:28.680 |
so you can either do this, where you specify each token. 00:05:32.160 |
And you will want to do that sometimes, maybe, 00:05:34.760 |
like with BERT, because your special tokens are 00:05:37.280 |
in the range of, like, 100, 0, 101, and so on. 00:05:45.040 |
it's either 2 or below, we could just write this. 00:05:48.760 |
So we could say where it is not or where it is greater than 2. 00:05:57.480 |
we're going to mask tokens that have a randomly generated 00:06:18.320 |
And now what we want to do is loop through each row 00:06:26.760 |
So we want to do for i in range tensor dot shape 0. 00:06:35.520 |
So this is how many rows we have in our tensor. 00:06:40.440 |
because each row is going to have a different number 00:06:57.560 |
Again, if this is confusing, I have those videos. 00:07:02.680 |
But I mean, you don't need to specifically know everything 00:07:09.560 |
This is just how we mask those roughly 15% of tokens. 00:07:23.800 |
But we want to take the mask array at the current position. 00:07:34.240 |
we essentially get a load of true or false values 00:07:48.800 |
get me a list of all the values that are not 0, 00:08:14.440 |
And then we use torch flatten here to remove that outer list. 00:08:19.960 |
And at the end here, we're going to convert to a list 00:08:23.440 |
so that we can do some fancy indexing in a moment. 00:08:37.400 |
And then we want to specify that selected number of indices, 00:08:42.480 |
which are where we're going to place our mask. 00:08:51.360 |
Well, we can actually find it over here in our vocab.json. 00:09:06.000 |
So scroll to the top, and we see our mappings here. 00:09:21.960 |
So we're going to make those values equal to 4. 00:09:32.200 |
masked our input IDs, and we want to return the tensor. 00:09:57.880 |
So this will give us a list of all of our training files. 00:10:05.720 |
And we just need to do from path lib, import path. 00:10:21.640 |
So these are text files containing our Italian samples. 00:10:26.720 |
Each sample is separated by a newline character. 00:10:30.720 |
And each file also contains about 10,000 samples. 00:10:41.120 |
we're going to create our three tensors that I mentioned 00:10:47.040 |
We have-- before I make a list, I didn't make a list. 00:10:54.880 |
and then we also have the attention mask as well. 00:11:02.800 |
So input IDs, attention mask, or I'm just going to call it mask, 00:11:33.840 |
And what I'm going to do is loop through each path in our-- 00:11:42.560 |
For each path, we're going to load it, extract our data, 00:11:50.480 |
convert it into the correct format that we need here, 00:12:15.040 |
want to write text equals F.read.split, like that. 00:12:40.920 |
Lines, we want our max length, which is going to be 512. 00:13:05.640 |
We want to extract all of those and add them to our list. 00:13:14.180 |
Now, the labels are just the input IDs produced 00:13:21.360 |
And I'm thinking here we can do turn tensors, use PyTorch. 00:13:38.680 |
And then we can also see that up here, by the way. 00:13:46.600 |
We're taking those out, putting them into our list. 00:13:53.440 |
Now, input IDs, that's what we built this mass language 00:14:03.720 |
So to do that, we just want to write sample input IDs. 00:14:10.720 |
And before I forget, that needs to go within MLM, like that. 00:14:27.160 |
And that will be done using detach and .clone, like that. 00:15:28.280 |
so input IDs at the moment is just a big list. 00:15:32.680 |
I don't know if it's a good idea, but here we go. 00:15:39.280 |
What we can do is, rather than having a list of tensors, 00:15:50.720 |
to be passed to it, which is why I've done this, 00:15:53.600 |
where we have lists and we just append tensors to it. 00:15:57.920 |
And we can do that, and it will concatenate our tensors, which 00:16:03.720 |
So what we want to do now is we write input IDs, 00:16:10.040 |
and we're just going to concatenate all of our tensors. 00:16:15.260 |
So then they're ready for formatting into a data set. 00:16:38.600 |
know that we have mask tokens in our input IDs now. 00:16:42.160 |
If we-- let's run that, and let's just compare. 00:17:19.080 |
Now the format that our data set needs and our model needs 00:17:25.280 |
is a dictionary where we have input IDs, which 00:17:34.520 |
So input IDs, this one, attention mask to mask, 00:17:58.880 |
to create a data loader object, which is what we 00:18:19.560 |
So we do class data set, color, whatever you want. 00:18:23.240 |
And we want torch, utils, data, data set, like that. 00:18:36.920 |
which is going to store our encodings internally. 00:18:44.680 |
So we want to write self encodings equals encoding. 00:18:53.280 |
And then there's two other methods that this object needs. 00:18:56.760 |
We need a length method so that we can say length data set. 00:19:06.840 |
which will allow the data loader to extract a certain-- 00:19:19.880 |
and extract the tensors, the input IDs, attention 00:19:35.000 |
And length, we don't need to pass anything in there. 00:19:38.560 |
So from that, we just want to return the self encodings, 00:19:47.040 |
And we took the first one, which is the length. 00:19:49.480 |
So if I take input IDs, in fact, I can just do it here. 00:20:19.280 |
our data load is requesting a certain position. 00:20:33.200 |
Now, what we could do is we could do self encodings 00:20:55.640 |
So we could take that like so, and then just say, 00:21:13.040 |
we don't care about the structure of the data set. 00:21:28.760 |
So the specific index of that tensor for key tensor 00:21:47.760 |
See, we get essentially everything in our data set. 00:21:51.960 |
So we're just looping through that, returning it, 00:21:54.160 |
and specifying which index we're returning here. 00:21:58.240 |
So once we have written that, we can initialize our data set. 00:22:08.880 |
And then we just pass in our encodings there. 00:22:22.280 |
So this is pretty much it for our input pipeline. 00:22:29.240 |
This is coming from the same area as our data set. 00:22:43.680 |
This will depend on how much your computer can 00:22:48.820 |
So just play around with that, see what works. 00:22:52.040 |
And we also want to shuffle our data set as well. 00:23:00.400 |
After that, obviously, we want to feed it in and train 00:23:03.880 |
So we're going to cover that in the next video. 00:23:08.160 |
So thank you for watching, and I will see you in the next one.