Building MLM Training Input Pipeline - Transformers From Scratch #3

00:00:00.000 | Hi, welcome to this video.

00:00:02.240 | So this is the third video in our Transformers

00:00:06.800 | from Scratch miniseries.

00:00:10.400 | And in the last two videos, we basically

00:00:14.480 | got a load of data and trained our tokenizer, which

00:00:18.480 | is what you can see here.

00:00:20.880 | So I'm going to have to rerun these.

00:00:29.080 | And we just tokenize some Italian.

00:00:33.560 | So this is a Italian BERT model, Roberta model.

00:00:39.040 | So we can write something like this,

00:00:41.520 | which means, hello, how are you?

00:00:43.780 | And it will tokenize our text here.

00:00:48.520 | Now, this is where we got to.

00:00:51.000 | What we now want to do is build out a input pipeline

00:00:55.400 | and train a model.

00:00:56.960 | Now, this is reasonably, I would say, involved,

00:01:03.200 | because we need to do a few things.

00:01:07.520 | First thing is we need three different tensors

00:01:13.640 | in our model here.

00:01:14.400 | We need the input IDs and attention mask.

00:01:17.920 | We also need the labels tensor as well.

00:01:21.200 | So the labels tensor is actually just

00:01:24.000 | going to be our input IDs as they are at the moment.

00:01:28.320 | But our input IDs are not going to be that.

00:01:30.640 | Our input IDs, they need to be passed

00:01:34.760 | through a mass language modeling script, which

00:01:38.400 | will mask around 15% of the tokens within that tensor.

00:01:44.280 | And then whilst we're training, our model

00:01:47.080 | is essentially going to try and guess

00:01:49.560 | what those masked tokens are.

00:01:52.280 | And we'll optimize the model using

00:01:54.280 | the loss between the guesses that the model outputs

00:01:58.200 | from the input IDs and the real values that are our labels.

00:02:05.160 | So that's essentially how it's going to work.

00:02:11.720 | I suppose it's a lot easier said than done.

00:02:13.840 | So first thing we're going to do is create our mass language

00:02:18.120 | modeling function.

00:02:20.600 | If you have watched some of my videos before,

00:02:24.760 | we quite recently did, I think, two videos, maybe two videos

00:02:30.680 | on mass language modeling.

00:02:32.960 | I'll leave a link to those in the description

00:02:35.440 | because the code is pretty much the same as what

00:02:38.080 | we cover there.

00:02:39.200 | And I will cover it very quickly here, but not too in-depth.

00:02:46.080 | So if you're interested, those links

00:02:50.320 | will be there in the description.

00:02:53.920 | So the very first thing we need to do, which I haven't done

00:02:59.360 | yet, is import torch.

00:03:03.360 | So we're using PyTorch here.

00:03:07.120 | And we need to create a random array.

00:03:11.760 | So we write torch rand.

00:03:14.800 | And this random array needs to be in the same shape

00:03:17.560 | as our tensor that we've input up here.

00:03:20.400 | So let's write tensor dot shape.

00:03:25.120 | And then we want to mass around 15% of those.

00:03:32.200 | So to do that, we can use rand, where rand is less than 0.15.

00:03:40.040 | Because what we've created here is

00:03:43.560 | an array in the shape of our input tensor, which

00:03:46.720 | is going to be our input IDs, where every value is

00:03:50.480 | within the range of 0 to 1.

00:03:55.160 | And that's completely random.

00:03:56.400 | So that means there should be--

00:04:00.280 | or for each token, there's a roughly 15% chance

00:04:03.640 | of that value being under 0.15.

00:04:06.920 | So that's our first criteria for our random masking array.

00:04:14.280 | The roughly 15% can call this mask array as well.

00:04:19.680 | But there's a few other criteria as well.

00:04:21.440 | And again, I covered that in the other videos.

00:04:25.720 | But in short, we don't want to mask special tokens.

00:04:30.680 | So we can see up here we have two special tokens.

00:04:33.200 | If we add padding--

00:04:35.160 | so I'm just going to add a little bit of padding,

00:04:37.920 | not loads.

00:04:38.680 | Let's go max length 10.

00:04:40.960 | And we write padding equals max length.

00:04:44.840 | If we do that, we get these extra ones here.

00:04:48.000 | They're our padding tokens.

00:04:49.440 | So we basically want to say we don't

00:04:51.480 | want to mask our 0's, 2's, or 1's,

00:04:55.600 | because they're special tokens that we don't want to mask.

00:04:58.520 | So we also just put where tensor is not equal to 0.

00:05:06.000 | And let's just copy that.

00:05:08.760 | It's a little bit easier.

00:05:11.160 | And also not equal to 1.

00:05:12.640 | And it will also not be equal to 2.

00:05:14.880 | And I do wonder if--

00:05:23.320 | yeah, we could make that a bit nicer.

00:05:25.480 | So if we just do--

00:05:28.680 | so you can either do this, where you specify each token.

00:05:32.160 | And you will want to do that sometimes, maybe,

00:05:34.760 | like with BERT, because your special tokens are

00:05:37.280 | in the range of, like, 100, 0, 101, and so on.

00:05:42.280 | There's a few different ones.

00:05:43.520 | But because we've got everything,

00:05:45.040 | it's either 2 or below, we could just write this.

00:05:48.760 | So we could say where it is not or where it is greater than 2.

00:05:54.680 | So this is like an AND statement saying,

00:05:57.480 | we're going to mask tokens that have a randomly generated

00:06:02.840 | value of less than 0.15.

00:06:04.640 | That's our 15% criteria.

00:06:08.120 | And they're not a special token, e.g.,

00:06:10.440 | they are greater than the value 2,

00:06:12.640 | because our special tokens are 0, 1, and 2.

00:06:17.120 | So that's cool.

00:06:18.320 | And now what we want to do is loop through each row

00:06:25.600 | in our tensor.

00:06:26.760 | So we want to do for i in range tensor dot shape 0.

00:06:35.520 | So this is how many rows we have in our tensor.

00:06:38.040 | And we can't do this in parallel,

00:06:40.440 | because each row is going to have a different number

00:06:44.840 | of tokens that will be masked.

00:06:47.320 | So if we did this in parallel, we

00:06:49.600 | end up trying to fit different size rows

00:06:52.640 | into an equally sized tensor.

00:06:55.200 | So we can't do that.

00:06:57.560 | Again, if this is confusing, I have those videos.

00:07:02.680 | But I mean, you don't need to specifically know everything

00:07:08.040 | that's going on here.

00:07:09.560 | This is just how we mask those roughly 15% of tokens.

00:07:14.600 | So we want torch flatten.

00:07:20.720 | And this bit is a bit confusing.

00:07:23.800 | But we want to take the mask array at the current position.

00:07:28.800 | I want to say where it's not 0.

00:07:30.960 | So when we create this mask array,

00:07:34.240 | we essentially get a load of true or false values

00:07:37.880 | in the size of our tensor shape.

00:07:41.760 | Where we have 1s, that is a mask.

00:07:46.760 | And what we're doing here is we're saying,

00:07:48.800 | get me a list of all the values that are not 0,

00:07:52.520 | which are 1s.

00:07:54.160 | And that gives us a list within a list.

00:07:58.520 | So we get something like this.

00:07:59.760 | And it will say, indices 2, 4, 18.

00:08:06.440 | I don't know why I said 4.

00:08:07.520 | It's 5.

00:08:08.560 | 2, 5, 18.

00:08:11.280 | They are where your mask tokens will be.

00:08:14.440 | And then we use torch flatten here to remove that outer list.

00:08:19.960 | And at the end here, we're going to convert to a list

00:08:23.440 | so that we can do some fancy indexing in a moment.

00:08:28.240 | And that fancy indexing looks like this.

00:08:30.480 | So we have our tensor.

00:08:31.800 | We're specifying the current row.

00:08:34.800 | So we're going row at a time.

00:08:37.400 | And then we want to specify that selected number of indices,

00:08:42.480 | which are where we're going to place our mask.

00:08:46.880 | Now, what does the mask token look like?

00:08:51.360 | Well, we can actually find it over here in our vocab.json.

00:09:04.680 | Yeah.

00:09:06.000 | So scroll to the top, and we see our mappings here.

00:09:11.120 | So the mask token is number 4.

00:09:15.000 | So that's what we're going to use.

00:09:19.280 | Switch back over.

00:09:21.960 | So we're going to make those values equal to 4.

00:09:26.560 | That's our mask.

00:09:27.800 | Then at that point, we have successfully

00:09:32.200 | masked our input IDs, and we want to return the tensor.

00:09:38.120 | So that's our masking function.

00:09:41.360 | That's a big part of this video.

00:09:43.680 | That's one of the harder parts.

00:09:46.280 | So now what we're going to do is I'm

00:09:48.240 | going to scroll up a little bit to here.

00:09:52.520 | So we have-- I'm just going to take this.

00:09:57.880 | So this will give us a list of all of our training files.

00:10:03.120 | So here.

00:10:05.720 | And we just need to do from path lib, import path.

00:10:11.160 | OK, let's have a look at what we have.

00:10:15.640 | So this is just a list of everything

00:10:20.680 | that we have over here.

00:10:21.640 | So these are text files containing our Italian samples.

00:10:26.720 | Each sample is separated by a newline character.

00:10:30.720 | And each file also contains about 10,000 samples.

00:10:35.000 | So we have quite a bit of data.

00:10:38.040 | And what we're going to do here is

00:10:41.120 | we're going to create our three tensors that I mentioned

00:10:46.400 | before.

00:10:47.040 | We have-- before I make a list, I didn't make a list.

00:10:52.480 | So we have the labels and input IDs,

00:10:54.880 | and then we also have the attention mask as well.

00:10:57.760 | So let's first initialize a list.

00:11:02.800 | So input IDs, attention mask, or I'm just going to call it mask,

00:11:13.240 | and labels.

00:11:16.280 | And what we're going to do is I also--

00:11:19.840 | so we're going to use a progress bar here.

00:11:21.600 | So I'm just going to import.

00:11:23.800 | So from TQDM, auto import TQDM.

00:11:29.640 | So I'm just going to import that as well.

00:11:33.840 | And what I'm going to do is loop through each path in our--

00:11:38.720 | I'm going to wrap it in TQDM.

00:11:40.360 | This creates our progress bar in our paths.

00:11:42.560 | For each path, we're going to load it, extract our data,

00:11:50.480 | convert it into the correct format that we need here,

00:11:53.720 | and append each one of those to these lists,

00:11:58.080 | and then create a big tensor out of that.

00:12:01.400 | So we want to write with open.

00:12:05.120 | And then here we have our path, our reading,

00:12:08.560 | and the encoding is UTF-8 as F. We

00:12:15.040 | want to write text equals F.read.split, like that.

00:12:22.920 | So I'm going to rename it lines.

00:12:27.320 | So this is just a big list of 10,000 samples

00:12:32.800 | that are all Italian, OK?

00:12:35.680 | So then we want to encode that.

00:12:37.360 | So we write sample equals tokenizer.

00:12:40.920 | Lines, we want our max length, which is going to be 512.

00:12:46.440 | We want padding up to that max length.

00:12:50.200 | And we also want to truncate anything

00:12:51.800 | that is further than that.

00:12:53.840 | So truncation equals trunc.

00:12:58.840 | OK, that's our tokenization done.

00:13:03.360 | Then we want to extract.

00:13:05.640 | We want to extract all of those and add them to our list.

00:13:11.560 | So we get our labels first.

00:13:14.180 | Now, the labels are just the input IDs produced

00:13:16.880 | by our sample.

00:13:18.280 | So sample, input IDs.

00:13:21.360 | And I'm thinking here we can do turn tensors, use PyTorch.

00:13:28.000 | So append our input IDs to labels.

00:13:33.680 | And then we have our mask.

00:13:34.760 | We want to append the sample attention mask.

00:13:38.680 | And then we can also see that up here, by the way.

00:13:45.360 | Here, this is what we're doing.

00:13:46.600 | We're taking those out, putting them into our list.

00:13:50.040 | And then-- so we have labels, mask,

00:13:51.840 | and we want to create our input IDs.

00:13:53.440 | Now, input IDs, that's what we built this mass language

00:13:58.360 | modeling function for.

00:14:00.520 | And in there, we need to pass our tensor.

00:14:03.720 | So to do that, we just want to write sample input IDs.

00:14:10.720 | And before I forget, that needs to go within MLM, like that.

00:14:16.720 | Like that.

00:14:18.160 | Now, I don't want to modify that tensor,

00:14:21.280 | because it's being appended to labels.

00:14:23.640 | So I'm going to create a clone of that.

00:14:27.160 | And that will be done using detach and .clone, like that.

00:14:34.680 | So it's pretty good.

00:14:39.800 | Let's run that.

00:14:44.960 | OK, and it's going to take a long time.

00:14:47.960 | So, yeah, I'm not going to use all of them.

00:14:53.800 | It was going up as well, so I have no idea

00:14:55.880 | how long that would take.

00:14:58.080 | Let's leave that for a little bit.

00:15:00.360 | Let's go with the first 50 for now.

00:15:04.160 | Still got to wait a little while,

00:15:05.560 | but at least not as long.

00:15:08.960 | So I'll leave that to run.

00:15:14.800 | Hopefully it shouldn't take too long.

00:15:16.360 | And, yeah, I'll see you when it's done.

00:15:20.040 | OK, so that's done.

00:15:23.920 | It wasn't too long.

00:15:26.200 | And if we just have a look--

00:15:28.280 | so input IDs at the moment is just a big list.

00:15:32.680 | I don't know if it's a good idea, but here we go.

00:15:35.520 | So we just have a list of tensors.

00:15:39.280 | What we can do is, rather than having a list of tensors,

00:15:43.640 | we can use something called TorchCAT.

00:15:46.440 | And TorchCAT expects a list of tensors

00:15:50.720 | to be passed to it, which is why I've done this,

00:15:53.600 | where we have lists and we just append tensors to it.

00:15:57.920 | And we can do that, and it will concatenate our tensors, which

00:16:02.440 | is pretty cool.

00:16:03.720 | So what we want to do now is we write input IDs,

00:16:10.040 | and we're just going to concatenate all of our tensors.

00:16:15.260 | So then they're ready for formatting into a data set.

00:16:22.920 | So we have mask here and labels here.

00:16:31.640 | We can also see, just worth pointing out,

00:16:36.920 | we have that mask token there, so we

00:16:38.600 | know that we have mask tokens in our input IDs now.

00:16:42.160 | If we-- let's run that, and let's just compare.

00:16:47.800 | So let's go input IDs, 0.

00:16:51.640 | That's quite a lot, so can I--

00:16:57.720 | I'll just do the first 10.

00:17:00.200 | And then let's do the same for labels.

00:17:02.240 | We'll see that we don't have these 4s,

00:17:03.920 | or we hopefully shouldn't have those 4s.

00:17:08.360 | So that's essentially a masking operation.

00:17:11.040 | So cover this with a mask here, and then

00:17:14.040 | same here and here, here and here.

00:17:17.320 | OK, cool.

00:17:19.080 | Now the format that our data set needs and our model needs

00:17:25.280 | is a dictionary where we have input IDs, which

00:17:30.280 | maps to input IDs, obviously.

00:17:32.120 | And you can guess either two as well.

00:17:34.520 | So input IDs, this one, attention mask to mask,

00:17:41.600 | and the final one is labels.

00:17:46.320 | So there are encodings.

00:17:49.640 | Now we create a data set object.

00:17:52.880 | To create a data set object-- in fact,

00:17:56.120 | actually, we create a data set object

00:17:58.880 | to create a data loader object, which is what we

00:18:01.920 | use to load data into our model.

00:18:03.840 | And that's essentially our input pipeline.

00:18:09.480 | But to create that data loader, we

00:18:11.160 | need to create a data set object.

00:18:14.640 | Now the data set object, we create that by--

00:18:18.400 | well, like this.

00:18:19.560 | So we do class data set, color, whatever you want.

00:18:23.240 | And we want torch, utils, data, data set, like that.

00:18:30.440 | We need a initialization function,

00:18:36.920 | which is going to store our encodings internally.

00:18:41.480 | Don't forget the def there.

00:18:44.680 | So we want to write self encodings equals encoding.

00:18:48.240 | So this is initializing our data set object.

00:18:53.280 | And then there's two other methods that this object needs.

00:18:56.760 | We need a length method so that we can say length data set.

00:19:01.320 | And it will return the number of samples

00:19:03.280 | that are in the data set.

00:19:04.800 | And we also need a get item method,

00:19:06.840 | which will allow the data loader to extract a certain--

00:19:13.680 | so say if it says, give me number one,

00:19:17.640 | it's going to go into this data set object

00:19:19.880 | and extract the tensors, the input IDs, attention

00:19:23.240 | marks, and labels at position one.

00:19:27.360 | So that's what we need to do there.

00:19:31.280 | So we'll do length first.

00:19:35.000 | And length, we don't need to pass anything in there.

00:19:37.440 | We're just calling length.

00:19:38.560 | So from that, we just want to return the self encodings,

00:19:44.120 | do input IDs.

00:19:45.400 | And remember before, we did this shape.

00:19:47.040 | And we took the first one, which is the length.

00:19:49.480 | So if I take input IDs, in fact, I can just do it here.

00:19:55.560 | So I'll copy that.

00:19:56.440 | If I go here, we get that 500K, which

00:20:02.800 | is the number of samples we have.

00:20:04.280 | That's what we want to return.

00:20:05.520 | So that's our length.

00:20:09.720 | And then we also have the get item.

00:20:11.720 | So here, we do want to pass an index value.

00:20:17.880 | So this is going to be--

00:20:19.280 | our data load is requesting a certain position.

00:20:23.600 | And for that, we want to return.

00:20:25.200 | So we're going to return dictionary.

00:20:26.700 | It needs to be in this format here.

00:20:28.960 | But we need to specify the correct index.

00:20:33.200 | Now, what we could do is we could do self encodings

00:20:40.520 | and then access our input IDs like that.

00:20:44.480 | We also-- I need to change that here.

00:20:47.400 | So it'll give us an error, dot shape.

00:20:53.920 | And we could do that.

00:20:55.640 | So we could take that like so, and then just say,

00:21:05.240 | index position.

00:21:06.640 | That's fine.

00:21:07.560 | You can do that if you want.

00:21:09.280 | But an easier way of doing it, where

00:21:11.000 | we don't need to specify the--

00:21:13.040 | we don't care about the structure of the data set.

00:21:16.200 | We just want to get it out.

00:21:19.400 | We don't need to specify it.

00:21:21.760 | We can just do this.

00:21:25.000 | You write key tensor.

00:21:28.760 | So the specific index of that tensor for key tensor

00:21:36.200 | in self encodings, dot items.

00:21:40.840 | So if we were to go encodings, dot items--

00:21:45.480 | so we can do that here.

00:21:47.760 | See, we get essentially everything in our data set.

00:21:51.960 | So we're just looping through that, returning it,

00:21:54.160 | and specifying which index we're returning here.

00:21:58.240 | So once we have written that, we can initialize our data set.

00:22:02.680 | So write data set equals data set.

00:22:08.880 | And then we just pass in our encodings there.

00:22:11.220 | So just remove that, and encodings.

00:22:14.760 | That's it.

00:22:16.000 | So that's our data set.

00:22:18.400 | And now we initialize our data loader.

00:22:22.280 | So this is pretty much it for our input pipeline.

00:22:25.880 | So data loader equals torch utils.

00:22:29.240 | This is coming from the same area as our data set.

00:22:33.240 | Data loader.

00:22:34.920 | Now we pass in our data set object.

00:22:37.680 | We want to specify batch size.

00:22:40.360 | So I typically go with 16.

00:22:43.680 | This will depend on how much your computer can

00:22:47.320 | handle it once as well.

00:22:48.820 | So just play around with that, see what works.

00:22:52.040 | And we also want to shuffle our data set as well.

00:22:54.400 | So yeah, that's our input pipeline.

00:23:00.400 | After that, obviously, we want to feed it in and train

00:23:03.120 | our model with it.

00:23:03.880 | So we're going to cover that in the next video.

00:23:08.160 | So thank you for watching, and I will see you in the next one.

Building MLM Training Input Pipeline - Transformers From Scratch #3

Chapters