back to index

Training and Testing an Italian BERT - Transformers From Scratch #4


Chapters

0:0 Intro
0:35 Review of Code
2:2 Config Object
6:28 Setup For Training
10:30 Training Loop
14:57 Dealing With CUDA Errors
16:17 Training Results
19:52 Loss
21:18 Fill-mask Pipeline For Testing
21:54 Testing With Laura

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi, welcome to the video, so this is the fourth video in a
00:00:07.820 | Transformers from Scratch mini-series, so
00:00:11.720 | If you haven't been following along we've essentially covered what you can see on the screen
00:00:18.160 | So we got some data we built a tokenizer with it
00:00:22.520 | And then we've set up our input pipeline ready to begin actually training our model
00:00:27.840 | Which is what we're going to cover in this video
00:00:30.600 | So let's move over to the code, and we see here that we have
00:00:37.580 | Essentially everything we've done so far so
00:00:40.640 | we've
00:00:42.800 | built our
00:00:44.800 | Input data our input pipeline, and we're now at a point where we have a data loader pytorch data loader
00:00:52.400 | Ready, and we can begin training a model with it, so
00:00:57.480 | There are a
00:00:59.480 | Few things to be aware of so I mean first
00:01:03.600 | Let's just have a quick look at the structure of our data, so when we're training a model for mass language modeling
00:01:10.980 | We need a few a few tensors. We need we need three tensors, and this is for training Roberta by the way as well
00:01:18.520 | Same thing with as well
00:01:21.960 | we have our
00:01:24.560 | input IDs
00:01:26.480 | Attention mask and our labels our input IDs have roughly 15% of
00:01:31.440 | Their values mass so we can see that here
00:01:35.280 | We have these two tensors these are the labels and we have the real tokens in here the token IDs and
00:01:41.520 | Then in our input IDs tensor we have these have been replaced with mass tokens that number fours
00:01:51.840 | That's the structure of our input data. We've
00:01:54.520 | Created a torch data set from it and use that to create a torch data loader
00:02:01.200 | And with that we can we can actually?
00:02:04.080 | Begin setting up our model for training so there are a few a few things
00:02:10.120 | To that we can't just begin training straight away
00:02:13.320 | so the first thing that we need to do is create a
00:02:17.200 | Roberta config object and
00:02:21.520 | This is the config object is something that we use when we're initializing a transformer from scratch
00:02:27.240 | In order to initialize it with a certain set of parameters
00:02:31.360 | So we'll do that first so we want from transformers
00:02:36.160 | import Roberta config
00:02:40.080 | okay, and
00:02:43.280 | Create that config object. We do this so
00:02:46.840 | We do Roberta config
00:02:51.200 | And then in here we need
00:02:53.200 | to specify different parameters now the
00:02:57.000 | One of the main ones is the vocup size now this needs to match to whichever vocup size we have already
00:03:05.640 | created in our
00:03:08.520 | Tokenizer when building our tokenizer, so I mean for me if I go all the way up here
00:03:19.480 | To do here, this is where I created the tokenizer I can see okay. It's this number here, so the
00:03:26.600 | 30,522 so I'm gonna set that
00:03:31.040 | But if we if you don't have that you can just write tokenizer
00:03:36.800 | vocup size so yeah and
00:03:41.200 | That will return your your focus, so I mean let's let's replace that we'll do this
00:03:50.520 | As well as that we want to also set this so max position
00:03:57.600 | embedding and
00:04:00.040 | this needs to be set to your
00:04:02.160 | Max length plus 2 in this case so max length is is set up here
00:04:12.520 | Where is it max length here 512?
00:04:15.640 | Plus 2 because we have these added special tokens if we don't do that
00:04:21.080 | We'll end up with a index error because we're going beyond the embedding
00:04:25.560 | limits
00:04:28.200 | Then we want our hidden size so this is the size of the vectors
00:04:33.840 | that our embedding layers within Roberta will create so each token, so we have
00:04:39.800 | 514 or 12 tokens and
00:04:44.160 | Each one those will be signed a vector of size
00:04:47.200 | 768 this is typical number so that's the
00:04:51.480 | originally came from the Burt's base model
00:04:54.920 | then we sell the
00:04:57.680 | architecture of the
00:04:59.920 | Deep internals of the model so we want the number of attention heads
00:05:03.720 | which I'm going to set to 12 and
00:05:06.520 | also the number of
00:05:11.080 | Hidden layers, which I so the default for this is for Roberta
00:05:17.040 | 12, but I'm going to go with 6
00:05:20.000 | For the sake of keeping train times a little shorter
00:05:25.680 | Now we also need to add type
00:05:28.560 | vocab size which is just one
00:05:32.880 | So that's the different token types that we have we just have one don't need to don't need to worry about that
00:05:41.080 | Okay, so
00:05:43.080 | That's our configuration object ready and we can import and initialize a Roberta model with that
00:05:50.560 | So we went from transformers. This is kind of similar to what we usually do import
00:05:57.120 | Roberta, and we're doing this for mast LM. So MLM right so we're training using MLM
00:06:06.040 | so we want Roberta for mast LM and
00:06:08.400 | we initialize our model using that Roberta for mass LM object, and we just pass in our config and
00:06:16.600 | this will that's right there is initialize our
00:06:21.400 | Roberta model so that's a plain Roberta model
00:06:25.560 | randomly initialize weights and so on and
00:06:28.640 | now we can move on to
00:06:31.880 | Setting up everything for for training. So we have our model now need to
00:06:36.360 | Prepare a few things before we train it
00:06:39.080 | first thing is we need to decide which device we're going to be training on so whether that's CPU or a CUDA enabled GPU and
00:06:47.000 | To figure out if we have that we write
00:06:50.200 | well, we can write torch CUDA is
00:06:54.040 | available
00:06:56.520 | so write this and for me is
00:06:59.480 | so the the typical way that you would you would decide whether you're using
00:07:03.280 | CUDA or CPU or the typical line of code that will decide it for you is you write device and
00:07:09.920 | You do torch CUDA or torch device. Sorry
00:07:16.000 | And then you write CUDA inside here
00:07:19.400 | If it's available, otherwise we are going to use torch
00:07:25.440 | device
00:07:27.760 | CPU now
00:07:32.120 | Takes yeah, it's just takes really long time. So if you are using CPU
00:07:38.880 | Know you you have to leave it overnight for sure. Maybe even longer
00:07:45.440 | Even if it's just like a little bit of data. It takes so long
00:07:51.920 | But hopefully hopefully you have a GPU if not, just you're gonna have to be patient. That's all
00:07:58.160 | Or if you could maybe try and use Google Colab
00:08:01.760 | but you have to use a premium version because otherwise it's just gonna
00:08:06.120 | Shut off after like an hour or two. I don't know. I don't really use it
00:08:10.740 | So I don't know how long it will it would train for before just deciding
00:08:15.400 | So it's done and the GPU is also not that good anyway, so yeah
00:08:22.920 | However, however, you can however you can do it and then after that we want to move our model to our
00:08:29.680 | Device so whether it's GPU or CPU we move over there. We're gonna get really big output now
00:08:38.360 | So it's just our model. So this is like the structure of our model. So we can see a few interesting things. We've got
00:08:45.040 | Roberta for MLM. We have the Roberta model and then inside that we have our embeddings and then we have our
00:08:53.200 | 12 did I say 12? I think it was six
00:08:55.960 | Six encoders should be yeah
00:08:59.080 | so it goes up it goes from 0 to 5 so 6 and then we have the the outputs here and then our
00:09:05.280 | Final bit which is a language modeling head the MLM head
00:09:08.840 | So that's cool. Now. We need our
00:09:11.800 | optimizer so from transformers
00:09:17.480 | Import Adam W, which is Adam with weighted decay and
00:09:21.880 | And what we're going to do is just gonna activate the training mode of our model give us loads of output again
00:09:29.920 | So just
00:09:34.200 | You know, I'm maybe I can just
00:09:36.200 | Let's just remove that. There we go easier and
00:09:39.680 | Then I optimize it is going to be Adam W
00:09:43.600 | We need to pass in our model parameters
00:09:47.680 | And we need a learning rate so
00:09:50.560 | From I mean, I don't usually use Roberta but
00:09:55.160 | Looking online
00:09:58.040 | This looks like a reasonable
00:10:00.280 | Learning rate. I think you can go from sort of here to I think from what I remember down to like here
00:10:07.640 | That's the sort of typical range
00:10:10.000 | But obviously it's going to depend on how much data you have and don't do that
00:10:14.400 | How much data you have and loads of different things, right? So
00:10:17.840 | That's what I'm gonna go with and
00:10:22.600 | That should be pretty much it so
00:10:28.000 | That's ourselves. Now. We're just going to create our training loop now for the training loop
00:10:34.800 | want to
00:10:37.360 | Import TQDM so we can see how far through we are
00:10:41.440 | We're
00:10:43.440 | Going to train for two epochs and
00:10:47.120 | We're going to initialize our loop object using TQDM. So
00:10:55.640 | We have our data loader. What is the name of that data loader? I'm not sure let's
00:11:01.680 | Data loader cool
00:11:07.520 | Data loader and we set leave equals true
00:11:12.040 | But I need that sorry I need that in the same stuff so for
00:11:18.560 | Batch in loop
00:11:22.720 | And then here we
00:11:28.160 | You know run through each of the steps that we're going to perform for every single training loop
00:11:34.480 | so the first thing we do is
00:11:36.480 | Initialize the gradient in our optimizer. So zero grad. So
00:11:42.720 | reason we do this is after the first loop our optimizer is going to be assigned a set of
00:11:49.280 | gradients which is going to use to
00:11:51.800 | optimize our model and
00:11:54.560 | On the next loop. We don't want those residual gradients to still be there in our optimizer
00:12:00.760 | We want to essentially reset it for the next loop. So that's what we're doing here. Then we want our
00:12:07.120 | tensors so we have
00:12:09.840 | input IDs and
00:12:11.840 | That is going to be batch
00:12:13.920 | input IDs
00:12:16.720 | And we also want to move that over to our
00:12:20.160 | GPU or CPU if you're on if you're on that
00:12:24.320 | And this is pretty much the same for
00:12:28.960 | our three so
00:12:34.040 | Labels
00:12:36.160 | And this is just attention mask, okay, so
00:12:41.240 | We've extracted our tensors and we just need to feed them into our model now
00:12:46.320 | So we're going to count outputs from the model. We just do model
00:12:51.440 | input IDs
00:12:53.920 | attention mask which is going to be equal to mask and
00:12:58.880 | Our labels
00:13:00.880 | Equal to labels
00:13:07.720 | Everything has been fed into our model. We have our outputs now. We need to extract a few things from the output
00:13:13.960 | So we what we need the loss
00:13:15.960 | so we write loss equals outputs dot loss and
00:13:19.560 | from that we want to calculate
00:13:22.600 | all of the
00:13:26.400 | Different parameters in our model. We need to calculate the loss for each one those parameters. So we do this loss dot backwards to
00:13:33.760 | Back propagate through all of those different values and get that loss
00:13:38.120 | After we've done that we use our optimizer
00:13:43.360 | take a step and
00:13:46.200 | optimize
00:13:47.560 | All those parameters based on that net loss
00:13:50.200 | Then that's everything we need to train the model and there's just a few things so for the progress bar
00:13:55.760 | I just want a little bit of information there just so I know what's going on and I just write loop
00:14:00.880 | set description
00:14:04.000 | And that's what I just want to print out the epoch so write that
00:14:13.400 | Then I want to set the post fix as well. So loop dot set
00:14:17.760 | post fix and
00:14:20.680 | Here I just want to see the loss. So it was the last loss item like that
00:14:25.000 | So that should be everything
00:14:28.000 | Yeah, let's let's run that see
00:14:31.800 | See what happens
00:14:34.640 | Hopefully should work. No didn't work
00:14:44.280 | No, no, it's a cute error
00:14:50.520 | Probably just need to refresh everything. I hate cuda errors one moment
00:14:57.520 | Okay, so finally figured it out took so long. So if so a few tips
00:15:05.360 | Anyway, when you do get a cuda error
00:15:07.600 | switch your device to CPU and
00:15:10.640 | Then rerun everything and you should get a more understandable error. So if we come down here, I've changed its CPU
00:15:17.520 | You see that we get an index error scroll down
00:15:20.360 | index out of range itself, so the reason for this is
00:15:24.440 | So you get this error?
00:15:27.480 | If you don't add the extra two tokens onto the end of here, but you know, we add them
00:15:33.800 | So I was pretty confused about that
00:15:36.920 | and then it took me a really long time to realize that this argument is wrong and
00:15:42.320 | There should be an S on the end. So that was a that was the error
00:15:47.680 | So, yeah, super super cool that that was literally it and took me so long to figure that out
00:15:54.800 | But now we have it that's good. I just need to run everything again
00:16:01.080 | So I'm just going to run through everything remove the remove this this cell here where I change it to
00:16:06.200 | CPU because I don't need it now and
00:16:08.720 | Just react to all that
00:16:17.040 | So we're back and we've finished training our model now now it has taken a long time. This is a few days later
00:16:25.000 | And I made a few changes during training as well. So this definitely wasn't the cleanest training process because I was kind of
00:16:33.040 | Updating parameters as it was going along
00:16:37.080 | so initially
00:16:39.520 | well first
00:16:41.080 | We've trained for like three and a bit epochs and I've trained on the full data set as well
00:16:51.160 | If I come up here, I think do I print out how much data it was
00:16:56.960 | Maybe in another file
00:17:04.080 | So if we come down here, so yeah, there's a lot more data here so we have
00:17:10.920 | 200 no 20, let me think 2 million. Okay, so 2 million samples in that final run and
00:17:19.780 | Initially when we when we start training we started with a
00:17:24.800 | learning rate of 1 e to the minus 5 now I
00:17:28.920 | Looked into this a little bit and it just was not really moving and I'll show you in a minute
00:17:34.440 | so I for the second epoch, I
00:17:36.520 | Moved it down to 1 e to the minus
00:17:39.880 | 4 I'll move it up
00:17:41.480 | Sorry to 1 e to the minus 4 and that you know that mood started moving things a lot quicker
00:17:46.720 | So that was good and then in total, like I said, it was 3 and a bit epochs
00:17:51.120 | Well, then I didn't really change anything
00:17:53.200 | the only thing I did was I trained like 1 epoch at a time because I wanted to see how
00:17:58.160 | You know how the results were looking after each epoch
00:18:01.800 | And that was quite interesting. So let me let me show you that
00:18:06.560 | Okay, so this is after the first epoch. So okay we so here what I'm doing is I've got this
00:18:13.200 | Fill which is a pipeline fill object and I'm entering chow and then putting in our mass and then that and I'm I wanted to
00:18:22.200 | Say chow come over right and in the middle wouldn't have to predict call me now
00:18:26.200 | This is the first after the first epoch and you can see it's not yeah, it's just it's putting like random
00:18:34.760 | Random characters. So question mark here
00:18:36.960 | Three dots here
00:18:40.480 | Chow and chow again here
00:18:42.480 | kind of weird
00:18:45.400 | Yeah, not not the best right now. We move on to the second epoch and it's getting business boy
00:18:52.640 | So rubbish, okay, at least it's got words
00:18:55.320 | So like here we have a word
00:19:00.160 | Kiva or chiva?
00:19:03.880 | Chow kiva. I don't know if that's the way I always the CH in Italian
00:19:09.120 | I always get messed up if there's any Italians watching. I'm I'm sorry
00:19:12.640 | Chow cuz of ah, you know at least we're getting words, but none of these so it doesn't make any sense. Okay, so
00:19:21.920 | No, I'm still not good
00:19:25.240 | now if we come across again, so this is
00:19:30.040 | This one. Yeah this one now we get it
00:19:32.960 | So the first the the rest of these account the rest of them are nonsense. Okay, so the four here
00:19:39.840 | Ignore them. However at the top we get this score 0.33 and we get chow coming back
00:19:46.800 | So that's what we wanted. So that's good means it's working. This was this was after the third and a bit epoch
00:19:52.720 | Let me show you loss function as well
00:19:57.800 | So this I know this is really messy
00:20:00.920 | So here we have our I don't know why this one's so short. Actually. Why is that one so short?
00:20:08.240 | Hmm strange, well, maybe I didn't
00:20:14.320 | Yeah, if the last one doesn't look like I finished training for the fully epoch so I thought I did maybe something happened
00:20:23.200 | I'm not sure but fine. This is what it is. That's fine
00:20:28.000 | So the first set of training I did was it was here
00:20:32.160 | And you see in the middle of my my computer went to sleep for a bit overnight because it was just so loud
00:20:37.320 | so I turned it off for a bit and
00:20:40.320 | Then continue going down now this first epoch is when we were at
00:20:44.560 | one point
00:20:47.120 | Or 1 e to the minus 5 and then here I was testing the 1 e to the minus 4 and you can see straight
00:20:53.080 | Away, it goes down way quicker. So I was like, okay, we're gonna go with that. It's clearly a lot better and
00:20:58.680 | Then continued over here next epoch and then find the final one here, which it didn't seem to change much anyway
00:21:05.720 | But there was there was so pretty clear difference. So that's the loss over time and
00:21:13.280 | Yeah, I mean we've seen the results from that. So now we have that let's move on to actually
00:21:20.600 | Testing the model. So I'm going to bring Lara and I'm going to just open the
00:21:25.040 | the file
00:21:28.120 | Okay, so this is the the testing we're gonna do so we're using the the file mass. We've got this pipeline
00:21:34.120 | sorry fill mass I've got this pipeline and
00:21:38.120 | we're just what I'm going to do is just get Lara to come in and some Italian sentences and just add this random mass token in and
00:21:48.200 | See if the results are
00:21:50.200 | Bearable or not. So let's see
00:21:52.440 | So I will see you in a minute. This is Lara. She can speak Italian. So she's going to go through this and
00:22:00.320 | test it a few times and
00:22:02.960 | Hopefully say it's good. Let's see. Hopefully. Ciao
00:22:06.600 | okay, so all you need to do is we have like a sentence here and
00:22:14.280 | You just write some Italian and then for one of the words in there we want to replace it with this text here and then
00:22:20.680 | that's going to like
00:22:21.880 | Mask that word and then the model is going to try and predict what is there and hopefully it will predict
00:22:27.800 | Let's let's see. So just write some Italian phrase not not too difficult yet and see
00:22:35.900 | So I don't have to write all bar. No, no, no, no you write just write a sentence and okay
00:22:44.340 | Do that
00:22:46.340 | Buongiorno
00:22:48.460 | Dante no, buongiorno
00:22:51.380 | Maybe a few words
00:22:54.100 | Okay, can I put comma or?
00:22:56.340 | Buongiorno, come on
00:23:00.100 | Okay, and then so which which words should we cover come here, okay
00:23:09.460 | And then
00:23:13.500 | Okay, so just cover it with the
00:23:16.460 | mask and
00:23:19.660 | See what it says. So not this I
00:23:22.420 | Seem to rerun these as well
00:23:26.140 | Okay, so let's give it a moment
00:23:31.180 | Keep on yeah, but the second one come here. Come here. Ah, it's almost that does Kiva mean anything like who?
00:23:42.740 | Yeah, it's like
00:23:44.740 | like is there someone but we would like I understand because I'm Italian, but I don't think that
00:23:51.740 | We don't usually say that I don't think I'm gonna take that as fine I'm gonna take that as it's good
00:24:00.680 | So let's do it again. Maybe yeah
00:24:03.680 | Try another one. Oh
00:24:08.660 | Wait, actually, what about these ones because of that
00:24:10.940 | Okay, but it just might be after Buongiorno, I I wouldn't expect
00:24:25.900 | It's okay
00:24:28.380 | So you can just put another one like where we put fill again right in the sentence. So we're here. Mm-hmm. Yeah
00:24:36.000 | So we can write
00:24:38.580 | Mm-hmm
00:24:47.120 | Yeah, and then what do you want to replace in country in country, I'm a maybe or dover
00:24:54.060 | Yeah, so which one we decide it's fine
00:25:02.320 | Yeah, that's good though she video mojito Marie Joe the wishy country mojito Marie Joe the wishy siamo
00:25:14.840 | I do believe Joe no, no, the wishy Troia mojito Marie Joe the wishy ritroviamo. Yeah, that's that's quite good
00:25:21.540 | Okay, should we try with over like using the same phrase?
00:25:29.720 | Okay, you can control Z, right, yeah
00:25:33.440 | Okay, let's run it
00:25:44.080 | Dove in the second one call me she country I'm
00:25:52.500 | Not one that she country I was good. Uh-huh. See
00:25:58.320 | It's cool. Yeah cause something say
00:26:01.620 | Let's try another one. Yeah
00:26:04.480 | Okay, let's remove the body
00:26:23.160 | Yeah, yeah go run it
00:26:26.640 | Cause I fire you but you know that's it. That's good. Cause I serve you but she knows the zero
00:26:34.000 | Yeah, cause I sped up it. She knows the zero me
00:26:38.480 | Cause I said it but she knows the zero me
00:26:42.960 | Cause I believe it but she knows the zero me. So I didn't find what we said before which was cause I'll be party
00:26:54.600 | Yeah, it makes sense. Cause I fired with you know, so I said I was gonna serve you but she knows the zero
00:26:59.000 | So you try something hard like grammatically difficult. Hmm
00:27:04.000 | No, I'm thinking I don't know when it's like that, you know, I don't know like
00:27:18.460 | It doesn't count in my mind
00:27:20.460 | Okay, yeah, and then what should we replace I miss him
00:27:31.660 | I miss him. What does that mean? If we had?
00:27:45.260 | That's very good, no, but it's good because I miss him oh it's for
00:27:50.880 | Third person plural. It's like we had I miss a third person singular
00:27:57.980 | So if he had or she had yeah choosing something
00:28:02.120 | So what would have happened if we had chosen another day
00:28:11.460 | The first one say I miss a shelter. It will be the third person
00:28:16.560 | Say I miss a shelter. It will be the first person. So if I had chosen another day
00:28:23.500 | Say I miss it. Oh, it's a second person plural. So it will be if they had chosen another day
00:28:31.240 | Say I really should this one now. See how it is. Yes. Yeah, this is good. Seven is a shelter
00:28:41.460 | No, maybe no, no, but the first three are very good. Yeah, I
00:28:46.260 | Have an idea. So now if we change to
00:28:50.820 | Set. So if we put set Laura, so if we specify the person maybe you will take the correct one
00:28:59.960 | So if we put set Laura
00:29:02.180 | And then we expect it to say a mess
00:29:06.500 | Avescero. Avescero. So let's run it
00:29:09.540 | You see? That's cool. That's very good
00:29:15.020 | And then the other one is Laura anno. It's right?
00:29:18.420 | I mean the I'm saying well the the verb it's
00:29:23.180 | Incorrect, but yeah, it's in the wrong place, but it's saying the right
00:29:27.020 | Like the meaning is correct. Yeah, but the grandma it's not correct. Okay. Yeah
00:29:36.100 | It's cool
00:29:38.100 | Because I wasn't sure how far you could just
00:29:41.140 | Wait, what's a child coming back, but that was all I tested it with so I was a little bit worried
00:29:47.220 | Okay anything else? Thank you. You're welcome
00:29:54.420 | Okay, so I think that's a pretty good result
00:29:59.240 | So I mean that that's pretty much everything we needed for building our model
00:30:06.020 | our transform model
00:30:08.020 | Although I do want to so we're going to do one more video after this where we're going to
00:30:12.900 | Upload our model to the Hugging Face model hub
00:30:18.420 | And then what we'll be able to do is actually download it directly from Hugging Face
00:30:22.900 | Which I think will be will be super cool to to do that and figure out how we actually pull that together
00:30:28.260 | So yeah, I think
00:30:30.500 | good result
00:30:32.020 | pretty happy with that and
00:30:34.500 | Thank you for watching and I will see you again in the next one.