Training and Testing an Italian BERT - Transformers From Scratch #4

00:00:00.000 | Hi, welcome to the video, so this is the fourth video in a

00:00:07.820 | Transformers from Scratch mini-series, so

00:00:11.720 | If you haven't been following along we've essentially covered what you can see on the screen

00:00:18.160 | So we got some data we built a tokenizer with it

00:00:22.520 | And then we've set up our input pipeline ready to begin actually training our model

00:00:27.840 | Which is what we're going to cover in this video

00:00:30.600 | So let's move over to the code, and we see here that we have

00:00:37.580 | Essentially everything we've done so far so

00:00:40.640 | we've

00:00:42.800 | built our

00:00:44.800 | Input data our input pipeline, and we're now at a point where we have a data loader pytorch data loader

00:00:52.400 | Ready, and we can begin training a model with it, so

00:00:57.480 | There are a

00:00:59.480 | Few things to be aware of so I mean first

00:01:03.600 | Let's just have a quick look at the structure of our data, so when we're training a model for mass language modeling

00:01:10.980 | We need a few a few tensors. We need we need three tensors, and this is for training Roberta by the way as well

00:01:18.520 | Same thing with as well

00:01:21.960 | we have our

00:01:24.560 | input IDs

00:01:26.480 | Attention mask and our labels our input IDs have roughly 15% of

00:01:31.440 | Their values mass so we can see that here

00:01:35.280 | We have these two tensors these are the labels and we have the real tokens in here the token IDs and

00:01:41.520 | Then in our input IDs tensor we have these have been replaced with mass tokens that number fours

00:01:49.400 | so

00:01:51.840 | That's the structure of our input data. We've

00:01:54.520 | Created a torch data set from it and use that to create a torch data loader

00:02:01.200 | And with that we can we can actually?

00:02:04.080 | Begin setting up our model for training so there are a few a few things

00:02:10.120 | To that we can't just begin training straight away

00:02:13.320 | so the first thing that we need to do is create a

00:02:17.200 | Roberta config object and

00:02:21.520 | This is the config object is something that we use when we're initializing a transformer from scratch

00:02:27.240 | In order to initialize it with a certain set of parameters

00:02:31.360 | So we'll do that first so we want from transformers

00:02:36.160 | import Roberta config

00:02:40.080 | okay, and

00:02:43.280 | Create that config object. We do this so

00:02:46.840 | We do Roberta config

00:02:51.200 | And then in here we need

00:02:53.200 | to specify different parameters now the

00:02:57.000 | One of the main ones is the vocup size now this needs to match to whichever vocup size we have already

00:03:05.640 | created in our

00:03:08.520 | Tokenizer when building our tokenizer, so I mean for me if I go all the way up here

00:03:19.480 | To do here, this is where I created the tokenizer I can see okay. It's this number here, so the

00:03:26.600 | 30,522 so I'm gonna set that

00:03:31.040 | But if we if you don't have that you can just write tokenizer

00:03:36.800 | vocup size so yeah and

00:03:41.200 | That will return your your focus, so I mean let's let's replace that we'll do this

00:03:48.520 | now

00:03:50.520 | As well as that we want to also set this so max position

00:03:57.600 | embedding and

00:04:00.040 | this needs to be set to your

00:04:02.160 | Max length plus 2 in this case so max length is is set up here

00:04:09.920 | so

00:04:12.520 | Where is it max length here 512?

00:04:15.640 | Plus 2 because we have these added special tokens if we don't do that

00:04:21.080 | We'll end up with a index error because we're going beyond the embedding

00:04:25.560 | limits

00:04:28.200 | Then we want our hidden size so this is the size of the vectors

00:04:33.840 | that our embedding layers within Roberta will create so each token, so we have

00:04:39.800 | 514 or 12 tokens and

00:04:44.160 | Each one those will be signed a vector of size

00:04:47.200 | 768 this is typical number so that's the

00:04:51.480 | originally came from the Burt's base model

00:04:54.920 | then we sell the

00:04:57.680 | architecture of the

00:04:59.920 | Deep internals of the model so we want the number of attention heads

00:05:03.720 | which I'm going to set to 12 and

00:05:06.520 | also the number of

00:05:11.080 | Hidden layers, which I so the default for this is for Roberta

00:05:17.040 | 12, but I'm going to go with 6

00:05:20.000 | For the sake of keeping train times a little shorter

00:05:25.680 | Now we also need to add type

00:05:28.560 | vocab size which is just one

00:05:32.880 | So that's the different token types that we have we just have one don't need to don't need to worry about that

00:05:41.080 | Okay, so

00:05:43.080 | That's our configuration object ready and we can import and initialize a Roberta model with that

00:05:50.560 | So we went from transformers. This is kind of similar to what we usually do import

00:05:57.120 | Roberta, and we're doing this for mast LM. So MLM right so we're training using MLM

00:06:06.040 | so we want Roberta for mast LM and

00:06:08.400 | we initialize our model using that Roberta for mass LM object, and we just pass in our config and

00:06:16.600 | this will that's right there is initialize our

00:06:21.400 | Roberta model so that's a plain Roberta model

00:06:25.560 | randomly initialize weights and so on and

00:06:28.640 | now we can move on to

00:06:31.880 | Setting up everything for for training. So we have our model now need to

00:06:36.360 | Prepare a few things before we train it

00:06:39.080 | first thing is we need to decide which device we're going to be training on so whether that's CPU or a CUDA enabled GPU and

00:06:47.000 | To figure out if we have that we write

00:06:50.200 | well, we can write torch CUDA is

00:06:54.040 | available

00:06:56.520 | so write this and for me is

00:06:59.480 | so the the typical way that you would you would decide whether you're using

00:07:03.280 | CUDA or CPU or the typical line of code that will decide it for you is you write device and

00:07:09.920 | You do torch CUDA or torch device. Sorry

00:07:16.000 | And then you write CUDA inside here

00:07:19.400 | If it's available, otherwise we are going to use torch

00:07:25.440 | device

00:07:27.760 | CPU now

00:07:29.760 | CPU

00:07:32.120 | Takes yeah, it's just takes really long time. So if you are using CPU

00:07:38.880 | Know you you have to leave it overnight for sure. Maybe even longer

00:07:45.440 | Even if it's just like a little bit of data. It takes so long

00:07:49.440 | so

00:07:51.920 | But hopefully hopefully you have a GPU if not, just you're gonna have to be patient. That's all

00:07:58.160 | Or if you could maybe try and use Google Colab

00:08:01.760 | but you have to use a premium version because otherwise it's just gonna

00:08:06.120 | Shut off after like an hour or two. I don't know. I don't really use it

00:08:10.740 | So I don't know how long it will it would train for before just deciding

00:08:15.400 | So it's done and the GPU is also not that good anyway, so yeah

00:08:22.920 | However, however, you can however you can do it and then after that we want to move our model to our

00:08:29.680 | Device so whether it's GPU or CPU we move over there. We're gonna get really big output now

00:08:38.360 | So it's just our model. So this is like the structure of our model. So we can see a few interesting things. We've got

00:08:45.040 | Roberta for MLM. We have the Roberta model and then inside that we have our embeddings and then we have our

00:08:53.200 | 12 did I say 12? I think it was six

00:08:55.960 | Six encoders should be yeah

00:08:59.080 | so it goes up it goes from 0 to 5 so 6 and then we have the the outputs here and then our

00:09:05.280 | Final bit which is a language modeling head the MLM head

00:09:08.840 | So that's cool. Now. We need our

00:09:11.800 | optimizer so from transformers

00:09:17.480 | Import Adam W, which is Adam with weighted decay and

00:09:21.880 | And what we're going to do is just gonna activate the training mode of our model give us loads of output again

00:09:29.920 | So just

00:09:34.200 | You know, I'm maybe I can just

00:09:36.200 | Let's just remove that. There we go easier and

00:09:39.680 | Then I optimize it is going to be Adam W

00:09:43.600 | We need to pass in our model parameters

00:09:47.680 | And we need a learning rate so

00:09:50.560 | From I mean, I don't usually use Roberta but

00:09:55.160 | Looking online

00:09:58.040 | This looks like a reasonable

00:10:00.280 | Learning rate. I think you can go from sort of here to I think from what I remember down to like here

00:10:07.640 | That's the sort of typical range

00:10:10.000 | But obviously it's going to depend on how much data you have and don't do that

00:10:14.400 | How much data you have and loads of different things, right? So

00:10:17.840 | That's what I'm gonna go with and

00:10:22.600 | That should be pretty much it so

00:10:28.000 | That's ourselves. Now. We're just going to create our training loop now for the training loop

00:10:34.800 | want to

00:10:37.360 | Import TQDM so we can see how far through we are

00:10:41.440 | We're

00:10:43.440 | Going to train for two epochs and

00:10:47.120 | We're going to initialize our loop object using TQDM. So

00:10:52.240 | TQDM

00:10:55.640 | We have our data loader. What is the name of that data loader? I'm not sure let's

00:11:01.680 | Data loader cool

00:11:07.520 | Data loader and we set leave equals true

00:11:12.040 | But I need that sorry I need that in the same stuff so for

00:11:18.560 | Batch in loop

00:11:22.720 | And then here we

00:11:28.160 | You know run through each of the steps that we're going to perform for every single training loop

00:11:34.480 | so the first thing we do is

00:11:36.480 | Initialize the gradient in our optimizer. So zero grad. So

00:11:42.720 | reason we do this is after the first loop our optimizer is going to be assigned a set of

00:11:49.280 | gradients which is going to use to

00:11:51.800 | optimize our model and

00:11:54.560 | On the next loop. We don't want those residual gradients to still be there in our optimizer

00:12:00.760 | We want to essentially reset it for the next loop. So that's what we're doing here. Then we want our

00:12:07.120 | tensors so we have

00:12:09.840 | input IDs and

00:12:11.840 | That is going to be batch

00:12:13.920 | input IDs

00:12:16.720 | And we also want to move that over to our

00:12:20.160 | GPU or CPU if you're on if you're on that

00:12:24.320 | And this is pretty much the same for

00:12:28.960 | our three so

00:12:30.960 | mask

00:12:34.040 | Labels

00:12:36.160 | And this is just attention mask, okay, so

00:12:41.240 | We've extracted our tensors and we just need to feed them into our model now

00:12:46.320 | So we're going to count outputs from the model. We just do model

00:12:51.440 | input IDs

00:12:53.920 | attention mask which is going to be equal to mask and

00:12:58.880 | Our labels

00:13:00.880 | Equal to labels

00:13:04.080 | So

00:13:07.720 | Everything has been fed into our model. We have our outputs now. We need to extract a few things from the output

00:13:13.960 | So we what we need the loss

00:13:15.960 | so we write loss equals outputs dot loss and

00:13:19.560 | from that we want to calculate

00:13:22.600 | all of the

00:13:26.400 | Different parameters in our model. We need to calculate the loss for each one those parameters. So we do this loss dot backwards to

00:13:33.760 | Back propagate through all of those different values and get that loss

00:13:38.120 | After we've done that we use our optimizer

00:13:43.360 | take a step and

00:13:46.200 | optimize

00:13:47.560 | All those parameters based on that net loss

00:13:50.200 | Then that's everything we need to train the model and there's just a few things so for the progress bar

00:13:55.760 | I just want a little bit of information there just so I know what's going on and I just write loop

00:14:00.880 | set description

00:14:04.000 | And that's what I just want to print out the epoch so write that

00:14:09.960 | And

00:14:13.400 | Then I want to set the post fix as well. So loop dot set

00:14:17.760 | post fix and

00:14:20.680 | Here I just want to see the loss. So it was the last loss item like that

00:14:25.000 | So that should be everything

00:14:28.000 | Yeah, let's let's run that see

00:14:31.800 | See what happens

00:14:34.640 | Hopefully should work. No didn't work

00:14:38.840 | Okay

00:14:42.120 | See

00:14:44.280 | No, no, it's a cute error

00:14:48.520 | So

00:14:50.520 | Probably just need to refresh everything. I hate cuda errors one moment

00:14:57.520 | Okay, so finally figured it out took so long. So if so a few tips

00:15:05.360 | Anyway, when you do get a cuda error

00:15:07.600 | switch your device to CPU and

00:15:10.640 | Then rerun everything and you should get a more understandable error. So if we come down here, I've changed its CPU

00:15:17.520 | You see that we get an index error scroll down

00:15:20.360 | index out of range itself, so the reason for this is

00:15:24.440 | So you get this error?

00:15:27.480 | If you don't add the extra two tokens onto the end of here, but you know, we add them

00:15:33.800 | So I was pretty confused about that

00:15:36.920 | and then it took me a really long time to realize that this argument is wrong and

00:15:42.320 | There should be an S on the end. So that was a that was the error

00:15:47.680 | So, yeah, super super cool that that was literally it and took me so long to figure that out

00:15:54.800 | But now we have it that's good. I just need to run everything again

00:16:01.080 | So I'm just going to run through everything remove the remove this this cell here where I change it to

00:16:06.200 | CPU because I don't need it now and

00:16:08.720 | Just react to all that

00:16:15.040 | Okay

00:16:17.040 | So we're back and we've finished training our model now now it has taken a long time. This is a few days later

00:16:25.000 | And I made a few changes during training as well. So this definitely wasn't the cleanest training process because I was kind of

00:16:33.040 | Updating parameters as it was going along

00:16:37.080 | so initially

00:16:39.520 | well first

00:16:41.080 | We've trained for like three and a bit epochs and I've trained on the full data set as well

00:16:47.120 | So

00:16:51.160 | If I come up here, I think do I print out how much data it was

00:16:56.960 | Maybe in another file

00:17:04.080 | So if we come down here, so yeah, there's a lot more data here so we have

00:17:10.920 | 200 no 20, let me think 2 million. Okay, so 2 million samples in that final run and

00:17:19.780 | Initially when we when we start training we started with a

00:17:24.800 | learning rate of 1 e to the minus 5 now I

00:17:28.920 | Looked into this a little bit and it just was not really moving and I'll show you in a minute

00:17:34.440 | so I for the second epoch, I

00:17:36.520 | Moved it down to 1 e to the minus

00:17:39.880 | 4 I'll move it up

00:17:41.480 | Sorry to 1 e to the minus 4 and that you know that mood started moving things a lot quicker

00:17:46.720 | So that was good and then in total, like I said, it was 3 and a bit epochs

00:17:51.120 | Well, then I didn't really change anything

00:17:53.200 | the only thing I did was I trained like 1 epoch at a time because I wanted to see how

00:17:58.160 | You know how the results were looking after each epoch

00:18:01.800 | And that was quite interesting. So let me let me show you that

00:18:06.560 | Okay, so this is after the first epoch. So okay we so here what I'm doing is I've got this

00:18:13.200 | Fill which is a pipeline fill object and I'm entering chow and then putting in our mass and then that and I'm I wanted to

00:18:22.200 | Say chow come over right and in the middle wouldn't have to predict call me now

00:18:26.200 | This is the first after the first epoch and you can see it's not yeah, it's just it's putting like random

00:18:34.760 | Random characters. So question mark here

00:18:36.960 | Three dots here

00:18:40.480 | Chow and chow again here

00:18:42.480 | kind of weird

00:18:44.360 | so

00:18:45.400 | Yeah, not not the best right now. We move on to the second epoch and it's getting business boy

00:18:52.640 | So rubbish, okay, at least it's got words

00:18:55.320 | So like here we have a word

00:18:58.000 | chow

00:19:00.160 | Kiva or chiva?

00:19:02.000 | Okay

00:19:03.880 | Chow kiva. I don't know if that's the way I always the CH in Italian

00:19:09.120 | I always get messed up if there's any Italians watching. I'm I'm sorry

00:19:12.640 | Chow cuz of ah, you know at least we're getting words, but none of these so it doesn't make any sense. Okay, so

00:19:21.920 | No, I'm still not good

00:19:25.240 | now if we come across again, so this is

00:19:30.040 | This one. Yeah this one now we get it

00:19:32.960 | So the first the the rest of these account the rest of them are nonsense. Okay, so the four here

00:19:39.840 | Ignore them. However at the top we get this score 0.33 and we get chow coming back

00:19:46.800 | So that's what we wanted. So that's good means it's working. This was this was after the third and a bit epoch

00:19:52.720 | Let me show you loss function as well

00:19:57.800 | So this I know this is really messy

00:20:00.920 | So here we have our I don't know why this one's so short. Actually. Why is that one so short?

00:20:08.240 | Hmm strange, well, maybe I didn't

00:20:14.320 | Yeah, if the last one doesn't look like I finished training for the fully epoch so I thought I did maybe something happened

00:20:23.200 | I'm not sure but fine. This is what it is. That's fine

00:20:28.000 | So the first set of training I did was it was here

00:20:32.160 | And you see in the middle of my my computer went to sleep for a bit overnight because it was just so loud

00:20:37.320 | so I turned it off for a bit and

00:20:40.320 | Then continue going down now this first epoch is when we were at

00:20:44.560 | one point

00:20:47.120 | Or 1 e to the minus 5 and then here I was testing the 1 e to the minus 4 and you can see straight

00:20:53.080 | Away, it goes down way quicker. So I was like, okay, we're gonna go with that. It's clearly a lot better and

00:20:58.680 | Then continued over here next epoch and then find the final one here, which it didn't seem to change much anyway

00:21:05.720 | But there was there was so pretty clear difference. So that's the loss over time and

00:21:13.280 | Yeah, I mean we've seen the results from that. So now we have that let's move on to actually

00:21:20.600 | Testing the model. So I'm going to bring Lara and I'm going to just open the

00:21:25.040 | the file

00:21:28.120 | Okay, so this is the the testing we're gonna do so we're using the the file mass. We've got this pipeline

00:21:34.120 | sorry fill mass I've got this pipeline and

00:21:38.120 | we're just what I'm going to do is just get Lara to come in and some Italian sentences and just add this random mass token in and

00:21:48.200 | See if the results are

00:21:50.200 | Bearable or not. So let's see

00:21:52.440 | So I will see you in a minute. This is Lara. She can speak Italian. So she's going to go through this and

00:22:00.320 | test it a few times and

00:22:02.960 | Hopefully say it's good. Let's see. Hopefully. Ciao

00:22:06.600 | okay, so all you need to do is we have like a sentence here and

00:22:14.280 | You just write some Italian and then for one of the words in there we want to replace it with this text here and then

00:22:20.680 | that's going to like

00:22:21.880 | Mask that word and then the model is going to try and predict what is there and hopefully it will predict

00:22:27.800 | Let's let's see. So just write some Italian phrase not not too difficult yet and see

00:22:35.900 | So I don't have to write all bar. No, no, no, no you write just write a sentence and okay

00:22:44.340 | Do that

00:22:46.340 | Buongiorno

00:22:48.460 | Dante no, buongiorno

00:22:51.380 | Maybe a few words

00:22:54.100 | Okay, can I put comma or?

00:22:56.340 | Buongiorno, come on

00:23:00.100 | Okay, and then so which which words should we cover come here, okay

00:23:09.460 | And then

00:23:13.500 | Okay, so just cover it with the

00:23:16.460 | mask and

00:23:19.660 | See what it says. So not this I

00:23:22.420 | Seem to rerun these as well

00:23:26.140 | Okay, so let's give it a moment

00:23:31.180 | Keep on yeah, but the second one come here. Come here. Ah, it's almost that does Kiva mean anything like who?

00:23:42.740 | Yeah, it's like

00:23:44.740 | like is there someone but we would like I understand because I'm Italian, but I don't think that

00:23:51.740 | We don't usually say that I don't think I'm gonna take that as fine I'm gonna take that as it's good

00:24:00.680 | So let's do it again. Maybe yeah

00:24:03.680 | Try another one. Oh

00:24:08.660 | Wait, actually, what about these ones because of that

00:24:10.940 | Okay, but it just might be after Buongiorno, I I wouldn't expect

00:24:22.500 | Kiva

00:24:25.900 | It's okay

00:24:28.380 | So you can just put another one like where we put fill again right in the sentence. So we're here. Mm-hmm. Yeah

00:24:36.000 | So we can write

00:24:38.580 | Mm-hmm

00:24:40.580 | Yeah

00:24:47.120 | Yeah, and then what do you want to replace in country in country, I'm a maybe or dover

00:24:54.060 | Yeah, so which one we decide it's fine

00:25:00.320 | Yeah

00:25:02.320 | Yeah, that's good though she video mojito Marie Joe the wishy country mojito Marie Joe the wishy siamo

00:25:14.840 | I do believe Joe no, no, the wishy Troia mojito Marie Joe the wishy ritroviamo. Yeah, that's that's quite good

00:25:21.540 | Okay, should we try with over like using the same phrase?

00:25:29.720 | Okay, you can control Z, right, yeah

00:25:33.440 | Okay, let's run it

00:25:44.080 | Dove in the second one call me she country I'm

00:25:52.500 | Not one that she country I was good. Uh-huh. See

00:25:58.320 | It's cool. Yeah cause something say

00:26:01.620 | Let's try another one. Yeah

00:26:04.480 | Okay, let's remove the body

00:26:23.160 | Yeah, yeah go run it

00:26:26.640 | Cause I fire you but you know that's it. That's good. Cause I serve you but she knows the zero

00:26:34.000 | Yeah, cause I sped up it. She knows the zero me

00:26:38.480 | Cause I said it but she knows the zero me

00:26:42.960 | Cause I believe it but she knows the zero me. So I didn't find what we said before which was cause I'll be party

00:26:52.600 | Yeah

00:26:54.600 | Yeah, it makes sense. Cause I fired with you know, so I said I was gonna serve you but she knows the zero

00:26:59.000 | So you try something hard like grammatically difficult. Hmm

00:27:04.000 | No, I'm thinking I don't know when it's like that, you know, I don't know like

00:27:18.460 | It doesn't count in my mind

00:27:20.460 | Okay, yeah, and then what should we replace I miss him

00:27:31.660 | I miss him. What does that mean? If we had?

00:27:43.260 | This

00:27:45.260 | That's very good, no, but it's good because I miss him oh it's for

00:27:50.880 | Third person plural. It's like we had I miss a third person singular

00:27:57.980 | So if he had or she had yeah choosing something

00:28:02.120 | So what would have happened if we had chosen another day

00:28:09.020 | so

00:28:11.460 | The first one say I miss a shelter. It will be the third person

00:28:16.560 | Say I miss a shelter. It will be the first person. So if I had chosen another day

00:28:23.500 | Say I miss it. Oh, it's a second person plural. So it will be if they had chosen another day

00:28:31.240 | Say I really should this one now. See how it is. Yes. Yeah, this is good. Seven is a shelter

00:28:39.540 | No

00:28:41.460 | No, maybe no, no, but the first three are very good. Yeah, I

00:28:46.260 | Have an idea. So now if we change to

00:28:50.820 | Set. So if we put set Laura, so if we specify the person maybe you will take the correct one

00:28:59.960 | So if we put set Laura

00:29:02.180 | And then we expect it to say a mess

00:29:06.500 | Avescero. Avescero. So let's run it

00:29:09.540 | You see? That's cool. That's very good

00:29:15.020 | And then the other one is Laura anno. It's right?

00:29:18.420 | I mean the I'm saying well the the verb it's

00:29:23.180 | Incorrect, but yeah, it's in the wrong place, but it's saying the right

00:29:27.020 | Like the meaning is correct. Yeah, but the grandma it's not correct. Okay. Yeah

00:29:36.100 | It's cool

00:29:38.100 | Because I wasn't sure how far you could just

00:29:41.140 | Wait, what's a child coming back, but that was all I tested it with so I was a little bit worried

00:29:47.220 | Okay anything else? Thank you. You're welcome

00:29:51.120 | Bye

00:29:54.420 | Okay, so I think that's a pretty good result

00:29:59.240 | So I mean that that's pretty much everything we needed for building our model

00:30:06.020 | our transform model

00:30:08.020 | Although I do want to so we're going to do one more video after this where we're going to

00:30:12.900 | Upload our model to the Hugging Face model hub

00:30:18.420 | And then what we'll be able to do is actually download it directly from Hugging Face

00:30:22.900 | Which I think will be will be super cool to to do that and figure out how we actually pull that together

00:30:28.260 | So yeah, I think

00:30:30.500 | good result

00:30:32.020 | pretty happy with that and

00:30:34.500 | Thank you for watching and I will see you again in the next one.

Training and Testing an Italian BERT - Transformers From Scratch #4

Chapters