back to indexTraining and Testing an Italian BERT - Transformers From Scratch #4
Chapters
0:0 Intro
0:35 Review of Code
2:2 Config Object
6:28 Setup For Training
10:30 Training Loop
14:57 Dealing With CUDA Errors
16:17 Training Results
19:52 Loss
21:18 Fill-mask Pipeline For Testing
21:54 Testing With Laura
00:00:00.000 |
Hi, welcome to the video, so this is the fourth video in a 00:00:11.720 |
If you haven't been following along we've essentially covered what you can see on the screen 00:00:18.160 |
So we got some data we built a tokenizer with it 00:00:22.520 |
And then we've set up our input pipeline ready to begin actually training our model 00:00:27.840 |
Which is what we're going to cover in this video 00:00:30.600 |
So let's move over to the code, and we see here that we have 00:00:44.800 |
Input data our input pipeline, and we're now at a point where we have a data loader pytorch data loader 00:00:52.400 |
Ready, and we can begin training a model with it, so 00:01:03.600 |
Let's just have a quick look at the structure of our data, so when we're training a model for mass language modeling 00:01:10.980 |
We need a few a few tensors. We need we need three tensors, and this is for training Roberta by the way as well 00:01:26.480 |
Attention mask and our labels our input IDs have roughly 15% of 00:01:35.280 |
We have these two tensors these are the labels and we have the real tokens in here the token IDs and 00:01:41.520 |
Then in our input IDs tensor we have these have been replaced with mass tokens that number fours 00:01:51.840 |
That's the structure of our input data. We've 00:01:54.520 |
Created a torch data set from it and use that to create a torch data loader 00:02:04.080 |
Begin setting up our model for training so there are a few a few things 00:02:10.120 |
To that we can't just begin training straight away 00:02:13.320 |
so the first thing that we need to do is create a 00:02:21.520 |
This is the config object is something that we use when we're initializing a transformer from scratch 00:02:27.240 |
In order to initialize it with a certain set of parameters 00:02:31.360 |
So we'll do that first so we want from transformers 00:02:57.000 |
One of the main ones is the vocup size now this needs to match to whichever vocup size we have already 00:03:08.520 |
Tokenizer when building our tokenizer, so I mean for me if I go all the way up here 00:03:19.480 |
To do here, this is where I created the tokenizer I can see okay. It's this number here, so the 00:03:31.040 |
But if we if you don't have that you can just write tokenizer 00:03:41.200 |
That will return your your focus, so I mean let's let's replace that we'll do this 00:03:50.520 |
As well as that we want to also set this so max position 00:04:02.160 |
Max length plus 2 in this case so max length is is set up here 00:04:15.640 |
Plus 2 because we have these added special tokens if we don't do that 00:04:21.080 |
We'll end up with a index error because we're going beyond the embedding 00:04:28.200 |
Then we want our hidden size so this is the size of the vectors 00:04:33.840 |
that our embedding layers within Roberta will create so each token, so we have 00:04:44.160 |
Each one those will be signed a vector of size 00:04:59.920 |
Deep internals of the model so we want the number of attention heads 00:05:11.080 |
Hidden layers, which I so the default for this is for Roberta 00:05:20.000 |
For the sake of keeping train times a little shorter 00:05:32.880 |
So that's the different token types that we have we just have one don't need to don't need to worry about that 00:05:43.080 |
That's our configuration object ready and we can import and initialize a Roberta model with that 00:05:50.560 |
So we went from transformers. This is kind of similar to what we usually do import 00:05:57.120 |
Roberta, and we're doing this for mast LM. So MLM right so we're training using MLM 00:06:08.400 |
we initialize our model using that Roberta for mass LM object, and we just pass in our config and 00:06:16.600 |
this will that's right there is initialize our 00:06:21.400 |
Roberta model so that's a plain Roberta model 00:06:31.880 |
Setting up everything for for training. So we have our model now need to 00:06:39.080 |
first thing is we need to decide which device we're going to be training on so whether that's CPU or a CUDA enabled GPU and 00:06:59.480 |
so the the typical way that you would you would decide whether you're using 00:07:03.280 |
CUDA or CPU or the typical line of code that will decide it for you is you write device and 00:07:19.400 |
If it's available, otherwise we are going to use torch 00:07:32.120 |
Takes yeah, it's just takes really long time. So if you are using CPU 00:07:38.880 |
Know you you have to leave it overnight for sure. Maybe even longer 00:07:45.440 |
Even if it's just like a little bit of data. It takes so long 00:07:51.920 |
But hopefully hopefully you have a GPU if not, just you're gonna have to be patient. That's all 00:07:58.160 |
Or if you could maybe try and use Google Colab 00:08:01.760 |
but you have to use a premium version because otherwise it's just gonna 00:08:06.120 |
Shut off after like an hour or two. I don't know. I don't really use it 00:08:10.740 |
So I don't know how long it will it would train for before just deciding 00:08:15.400 |
So it's done and the GPU is also not that good anyway, so yeah 00:08:22.920 |
However, however, you can however you can do it and then after that we want to move our model to our 00:08:29.680 |
Device so whether it's GPU or CPU we move over there. We're gonna get really big output now 00:08:38.360 |
So it's just our model. So this is like the structure of our model. So we can see a few interesting things. We've got 00:08:45.040 |
Roberta for MLM. We have the Roberta model and then inside that we have our embeddings and then we have our 00:08:59.080 |
so it goes up it goes from 0 to 5 so 6 and then we have the the outputs here and then our 00:09:05.280 |
Final bit which is a language modeling head the MLM head 00:09:17.480 |
Import Adam W, which is Adam with weighted decay and 00:09:21.880 |
And what we're going to do is just gonna activate the training mode of our model give us loads of output again 00:09:36.200 |
Let's just remove that. There we go easier and 00:10:00.280 |
Learning rate. I think you can go from sort of here to I think from what I remember down to like here 00:10:10.000 |
But obviously it's going to depend on how much data you have and don't do that 00:10:14.400 |
How much data you have and loads of different things, right? So 00:10:28.000 |
That's ourselves. Now. We're just going to create our training loop now for the training loop 00:10:37.360 |
Import TQDM so we can see how far through we are 00:10:47.120 |
We're going to initialize our loop object using TQDM. So 00:10:55.640 |
We have our data loader. What is the name of that data loader? I'm not sure let's 00:11:12.040 |
But I need that sorry I need that in the same stuff so for 00:11:28.160 |
You know run through each of the steps that we're going to perform for every single training loop 00:11:36.480 |
Initialize the gradient in our optimizer. So zero grad. So 00:11:42.720 |
reason we do this is after the first loop our optimizer is going to be assigned a set of 00:11:54.560 |
On the next loop. We don't want those residual gradients to still be there in our optimizer 00:12:00.760 |
We want to essentially reset it for the next loop. So that's what we're doing here. Then we want our 00:12:41.240 |
We've extracted our tensors and we just need to feed them into our model now 00:12:46.320 |
So we're going to count outputs from the model. We just do model 00:12:53.920 |
attention mask which is going to be equal to mask and 00:13:07.720 |
Everything has been fed into our model. We have our outputs now. We need to extract a few things from the output 00:13:26.400 |
Different parameters in our model. We need to calculate the loss for each one those parameters. So we do this loss dot backwards to 00:13:33.760 |
Back propagate through all of those different values and get that loss 00:13:50.200 |
Then that's everything we need to train the model and there's just a few things so for the progress bar 00:13:55.760 |
I just want a little bit of information there just so I know what's going on and I just write loop 00:14:04.000 |
And that's what I just want to print out the epoch so write that 00:14:13.400 |
Then I want to set the post fix as well. So loop dot set 00:14:20.680 |
Here I just want to see the loss. So it was the last loss item like that 00:14:50.520 |
Probably just need to refresh everything. I hate cuda errors one moment 00:14:57.520 |
Okay, so finally figured it out took so long. So if so a few tips 00:15:10.640 |
Then rerun everything and you should get a more understandable error. So if we come down here, I've changed its CPU 00:15:17.520 |
You see that we get an index error scroll down 00:15:20.360 |
index out of range itself, so the reason for this is 00:15:27.480 |
If you don't add the extra two tokens onto the end of here, but you know, we add them 00:15:36.920 |
and then it took me a really long time to realize that this argument is wrong and 00:15:42.320 |
There should be an S on the end. So that was a that was the error 00:15:47.680 |
So, yeah, super super cool that that was literally it and took me so long to figure that out 00:15:54.800 |
But now we have it that's good. I just need to run everything again 00:16:01.080 |
So I'm just going to run through everything remove the remove this this cell here where I change it to 00:16:17.040 |
So we're back and we've finished training our model now now it has taken a long time. This is a few days later 00:16:25.000 |
And I made a few changes during training as well. So this definitely wasn't the cleanest training process because I was kind of 00:16:41.080 |
We've trained for like three and a bit epochs and I've trained on the full data set as well 00:16:51.160 |
If I come up here, I think do I print out how much data it was 00:17:04.080 |
So if we come down here, so yeah, there's a lot more data here so we have 00:17:10.920 |
200 no 20, let me think 2 million. Okay, so 2 million samples in that final run and 00:17:19.780 |
Initially when we when we start training we started with a 00:17:28.920 |
Looked into this a little bit and it just was not really moving and I'll show you in a minute 00:17:41.480 |
Sorry to 1 e to the minus 4 and that you know that mood started moving things a lot quicker 00:17:46.720 |
So that was good and then in total, like I said, it was 3 and a bit epochs 00:17:53.200 |
the only thing I did was I trained like 1 epoch at a time because I wanted to see how 00:17:58.160 |
You know how the results were looking after each epoch 00:18:01.800 |
And that was quite interesting. So let me let me show you that 00:18:06.560 |
Okay, so this is after the first epoch. So okay we so here what I'm doing is I've got this 00:18:13.200 |
Fill which is a pipeline fill object and I'm entering chow and then putting in our mass and then that and I'm I wanted to 00:18:22.200 |
Say chow come over right and in the middle wouldn't have to predict call me now 00:18:26.200 |
This is the first after the first epoch and you can see it's not yeah, it's just it's putting like random 00:18:45.400 |
Yeah, not not the best right now. We move on to the second epoch and it's getting business boy 00:19:03.880 |
Chow kiva. I don't know if that's the way I always the CH in Italian 00:19:09.120 |
I always get messed up if there's any Italians watching. I'm I'm sorry 00:19:12.640 |
Chow cuz of ah, you know at least we're getting words, but none of these so it doesn't make any sense. Okay, so 00:19:32.960 |
So the first the the rest of these account the rest of them are nonsense. Okay, so the four here 00:19:39.840 |
Ignore them. However at the top we get this score 0.33 and we get chow coming back 00:19:46.800 |
So that's what we wanted. So that's good means it's working. This was this was after the third and a bit epoch 00:20:00.920 |
So here we have our I don't know why this one's so short. Actually. Why is that one so short? 00:20:14.320 |
Yeah, if the last one doesn't look like I finished training for the fully epoch so I thought I did maybe something happened 00:20:23.200 |
I'm not sure but fine. This is what it is. That's fine 00:20:28.000 |
So the first set of training I did was it was here 00:20:32.160 |
And you see in the middle of my my computer went to sleep for a bit overnight because it was just so loud 00:20:40.320 |
Then continue going down now this first epoch is when we were at 00:20:47.120 |
Or 1 e to the minus 5 and then here I was testing the 1 e to the minus 4 and you can see straight 00:20:53.080 |
Away, it goes down way quicker. So I was like, okay, we're gonna go with that. It's clearly a lot better and 00:20:58.680 |
Then continued over here next epoch and then find the final one here, which it didn't seem to change much anyway 00:21:05.720 |
But there was there was so pretty clear difference. So that's the loss over time and 00:21:13.280 |
Yeah, I mean we've seen the results from that. So now we have that let's move on to actually 00:21:20.600 |
Testing the model. So I'm going to bring Lara and I'm going to just open the 00:21:28.120 |
Okay, so this is the the testing we're gonna do so we're using the the file mass. We've got this pipeline 00:21:38.120 |
we're just what I'm going to do is just get Lara to come in and some Italian sentences and just add this random mass token in and 00:21:52.440 |
So I will see you in a minute. This is Lara. She can speak Italian. So she's going to go through this and 00:22:02.960 |
Hopefully say it's good. Let's see. Hopefully. Ciao 00:22:06.600 |
okay, so all you need to do is we have like a sentence here and 00:22:14.280 |
You just write some Italian and then for one of the words in there we want to replace it with this text here and then 00:22:21.880 |
Mask that word and then the model is going to try and predict what is there and hopefully it will predict 00:22:27.800 |
Let's let's see. So just write some Italian phrase not not too difficult yet and see 00:22:35.900 |
So I don't have to write all bar. No, no, no, no you write just write a sentence and okay 00:23:00.100 |
Okay, and then so which which words should we cover come here, okay 00:23:31.180 |
Keep on yeah, but the second one come here. Come here. Ah, it's almost that does Kiva mean anything like who? 00:23:44.740 |
like is there someone but we would like I understand because I'm Italian, but I don't think that 00:23:51.740 |
We don't usually say that I don't think I'm gonna take that as fine I'm gonna take that as it's good 00:24:08.660 |
Wait, actually, what about these ones because of that 00:24:10.940 |
Okay, but it just might be after Buongiorno, I I wouldn't expect 00:24:28.380 |
So you can just put another one like where we put fill again right in the sentence. So we're here. Mm-hmm. Yeah 00:24:47.120 |
Yeah, and then what do you want to replace in country in country, I'm a maybe or dover 00:25:02.320 |
Yeah, that's good though she video mojito Marie Joe the wishy country mojito Marie Joe the wishy siamo 00:25:14.840 |
I do believe Joe no, no, the wishy Troia mojito Marie Joe the wishy ritroviamo. Yeah, that's that's quite good 00:25:21.540 |
Okay, should we try with over like using the same phrase? 00:25:44.080 |
Dove in the second one call me she country I'm 00:25:52.500 |
Not one that she country I was good. Uh-huh. See 00:26:26.640 |
Cause I fire you but you know that's it. That's good. Cause I serve you but she knows the zero 00:26:34.000 |
Yeah, cause I sped up it. She knows the zero me 00:26:42.960 |
Cause I believe it but she knows the zero me. So I didn't find what we said before which was cause I'll be party 00:26:54.600 |
Yeah, it makes sense. Cause I fired with you know, so I said I was gonna serve you but she knows the zero 00:26:59.000 |
So you try something hard like grammatically difficult. Hmm 00:27:04.000 |
No, I'm thinking I don't know when it's like that, you know, I don't know like 00:27:20.460 |
Okay, yeah, and then what should we replace I miss him 00:27:45.260 |
That's very good, no, but it's good because I miss him oh it's for 00:27:50.880 |
Third person plural. It's like we had I miss a third person singular 00:27:57.980 |
So if he had or she had yeah choosing something 00:28:02.120 |
So what would have happened if we had chosen another day 00:28:11.460 |
The first one say I miss a shelter. It will be the third person 00:28:16.560 |
Say I miss a shelter. It will be the first person. So if I had chosen another day 00:28:23.500 |
Say I miss it. Oh, it's a second person plural. So it will be if they had chosen another day 00:28:31.240 |
Say I really should this one now. See how it is. Yes. Yeah, this is good. Seven is a shelter 00:28:41.460 |
No, maybe no, no, but the first three are very good. Yeah, I 00:28:50.820 |
Set. So if we put set Laura, so if we specify the person maybe you will take the correct one 00:29:15.020 |
And then the other one is Laura anno. It's right? 00:29:23.180 |
Incorrect, but yeah, it's in the wrong place, but it's saying the right 00:29:27.020 |
Like the meaning is correct. Yeah, but the grandma it's not correct. Okay. Yeah 00:29:41.140 |
Wait, what's a child coming back, but that was all I tested it with so I was a little bit worried 00:29:47.220 |
Okay anything else? Thank you. You're welcome 00:29:59.240 |
So I mean that that's pretty much everything we needed for building our model 00:30:08.020 |
Although I do want to so we're going to do one more video after this where we're going to 00:30:12.900 |
Upload our model to the Hugging Face model hub 00:30:18.420 |
And then what we'll be able to do is actually download it directly from Hugging Face 00:30:22.900 |
Which I think will be will be super cool to to do that and figure out how we actually pull that together 00:30:34.500 |
Thank you for watching and I will see you again in the next one.