Training and Testing an Italian BERT - Transformers From Scratch #4

Hi, welcome to the video, so this is the fourth video in a Transformers from Scratch mini-series, so If you haven't been following along we've essentially covered what you can see on the screen So we got some data we built a tokenizer with it And then we've set up our input pipeline ready to begin actually training our model Which is what we're going to cover in this video So let's move over to the code, and we see here that we have Essentially everything we've done so far so we've built our Input data our input pipeline, and we're now at a point where we have a data loader pytorch data loader Ready, and we can begin training a model with it, so There are a Few things to be aware of so I mean first Let's just have a quick look at the structure of our data, so when we're training a model for mass language modeling We need a few a few tensors.

We need we need three tensors, and this is for training Roberta by the way as well Same thing with as well we have our input IDs Attention mask and our labels our input IDs have roughly 15% of Their values mass so we can see that here We have these two tensors these are the labels and we have the real tokens in here the token IDs and Then in our input IDs tensor we have these have been replaced with mass tokens that number fours so That's the structure of our input data.

We've Created a torch data set from it and use that to create a torch data loader And with that we can we can actually? Begin setting up our model for training so there are a few a few things To that we can't just begin training straight away so the first thing that we need to do is create a Roberta config object and This is the config object is something that we use when we're initializing a transformer from scratch In order to initialize it with a certain set of parameters So we'll do that first so we want from transformers import Roberta config okay, and Create that config object.

We do this so We do Roberta config And then in here we need to specify different parameters now the One of the main ones is the vocup size now this needs to match to whichever vocup size we have already created in our Tokenizer when building our tokenizer, so I mean for me if I go all the way up here To do here, this is where I created the tokenizer I can see okay.

It's this number here, so the 30,522 so I'm gonna set that But if we if you don't have that you can just write tokenizer vocup size so yeah and That will return your your focus, so I mean let's let's replace that we'll do this now As well as that we want to also set this so max position embedding and this needs to be set to your Max length plus 2 in this case so max length is is set up here so Where is it max length here 512?

Plus 2 because we have these added special tokens if we don't do that We'll end up with a index error because we're going beyond the embedding limits Then we want our hidden size so this is the size of the vectors that our embedding layers within Roberta will create so each token, so we have 514 or 12 tokens and Each one those will be signed a vector of size 768 this is typical number so that's the originally came from the Burt's base model then we sell the architecture of the Deep internals of the model so we want the number of attention heads which I'm going to set to 12 and also the number of Hidden layers, which I so the default for this is for Roberta 12, but I'm going to go with 6 For the sake of keeping train times a little shorter Now we also need to add type vocab size which is just one So that's the different token types that we have we just have one don't need to don't need to worry about that Okay, so That's our configuration object ready and we can import and initialize a Roberta model with that So we went from transformers.

This is kind of similar to what we usually do import Roberta, and we're doing this for mast LM. So MLM right so we're training using MLM so we want Roberta for mast LM and we initialize our model using that Roberta for mass LM object, and we just pass in our config and this will that's right there is initialize our Roberta model so that's a plain Roberta model randomly initialize weights and so on and now we can move on to Setting up everything for for training.

So we have our model now need to Prepare a few things before we train it first thing is we need to decide which device we're going to be training on so whether that's CPU or a CUDA enabled GPU and To figure out if we have that we write well, we can write torch CUDA is available so write this and for me is so the the typical way that you would you would decide whether you're using CUDA or CPU or the typical line of code that will decide it for you is you write device and You do torch CUDA or torch device.

Sorry And then you write CUDA inside here If it's available, otherwise we are going to use torch device CPU now CPU Takes yeah, it's just takes really long time. So if you are using CPU Know you you have to leave it overnight for sure. Maybe even longer Even if it's just like a little bit of data.

It takes so long so But hopefully hopefully you have a GPU if not, just you're gonna have to be patient. That's all Or if you could maybe try and use Google Colab but you have to use a premium version because otherwise it's just gonna Shut off after like an hour or two.

I don't know. I don't really use it So I don't know how long it will it would train for before just deciding So it's done and the GPU is also not that good anyway, so yeah However, however, you can however you can do it and then after that we want to move our model to our Device so whether it's GPU or CPU we move over there.

We're gonna get really big output now So it's just our model. So this is like the structure of our model. So we can see a few interesting things. We've got Roberta for MLM. We have the Roberta model and then inside that we have our embeddings and then we have our 12 did I say 12?

I think it was six Six encoders should be yeah so it goes up it goes from 0 to 5 so 6 and then we have the the outputs here and then our Final bit which is a language modeling head the MLM head So that's cool. Now. We need our optimizer so from transformers Import Adam W, which is Adam with weighted decay and And what we're going to do is just gonna activate the training mode of our model give us loads of output again So just You know, I'm maybe I can just Let's just remove that.

There we go easier and Then I optimize it is going to be Adam W We need to pass in our model parameters And we need a learning rate so From I mean, I don't usually use Roberta but Looking online This looks like a reasonable Learning rate. I think you can go from sort of here to I think from what I remember down to like here That's the sort of typical range But obviously it's going to depend on how much data you have and don't do that How much data you have and loads of different things, right?

So That's what I'm gonna go with and That should be pretty much it so That's ourselves. Now. We're just going to create our training loop now for the training loop want to Import TQDM so we can see how far through we are We're Going to train for two epochs and We're going to initialize our loop object using TQDM.

So TQDM We have our data loader. What is the name of that data loader? I'm not sure let's Data loader cool Data loader and we set leave equals true But I need that sorry I need that in the same stuff so for Batch in loop And then here we You know run through each of the steps that we're going to perform for every single training loop so the first thing we do is Initialize the gradient in our optimizer.

So zero grad. So reason we do this is after the first loop our optimizer is going to be assigned a set of gradients which is going to use to optimize our model and On the next loop. We don't want those residual gradients to still be there in our optimizer We want to essentially reset it for the next loop.

So that's what we're doing here. Then we want our tensors so we have input IDs and That is going to be batch input IDs And we also want to move that over to our GPU or CPU if you're on if you're on that And this is pretty much the same for our three so mask Labels And this is just attention mask, okay, so We've extracted our tensors and we just need to feed them into our model now So we're going to count outputs from the model.

We just do model input IDs attention mask which is going to be equal to mask and Our labels Equal to labels So Everything has been fed into our model. We have our outputs now. We need to extract a few things from the output So we what we need the loss so we write loss equals outputs dot loss and from that we want to calculate all of the Different parameters in our model.

We need to calculate the loss for each one those parameters. So we do this loss dot backwards to Back propagate through all of those different values and get that loss After we've done that we use our optimizer take a step and optimize All those parameters based on that net loss Then that's everything we need to train the model and there's just a few things so for the progress bar I just want a little bit of information there just so I know what's going on and I just write loop set description And that's what I just want to print out the epoch so write that And Then I want to set the post fix as well.

So loop dot set post fix and Here I just want to see the loss. So it was the last loss item like that So that should be everything Yeah, let's let's run that see See what happens Hopefully should work. No didn't work Okay See No, no, it's a cute error So Probably just need to refresh everything.

I hate cuda errors one moment Okay, so finally figured it out took so long. So if so a few tips Anyway, when you do get a cuda error switch your device to CPU and Then rerun everything and you should get a more understandable error. So if we come down here, I've changed its CPU You see that we get an index error scroll down index out of range itself, so the reason for this is So you get this error?

If you don't add the extra two tokens onto the end of here, but you know, we add them So I was pretty confused about that and then it took me a really long time to realize that this argument is wrong and There should be an S on the end.

So that was a that was the error So, yeah, super super cool that that was literally it and took me so long to figure that out But now we have it that's good. I just need to run everything again So I'm just going to run through everything remove the remove this this cell here where I change it to CPU because I don't need it now and Just react to all that Okay So we're back and we've finished training our model now now it has taken a long time.

This is a few days later And I made a few changes during training as well. So this definitely wasn't the cleanest training process because I was kind of Updating parameters as it was going along so initially well first We've trained for like three and a bit epochs and I've trained on the full data set as well So If I come up here, I think do I print out how much data it was Maybe in another file So if we come down here, so yeah, there's a lot more data here so we have 200 no 20, let me think 2 million.

Okay, so 2 million samples in that final run and Initially when we when we start training we started with a learning rate of 1 e to the minus 5 now I Looked into this a little bit and it just was not really moving and I'll show you in a minute so I for the second epoch, I Moved it down to 1 e to the minus 4 I'll move it up Sorry to 1 e to the minus 4 and that you know that mood started moving things a lot quicker So that was good and then in total, like I said, it was 3 and a bit epochs Well, then I didn't really change anything the only thing I did was I trained like 1 epoch at a time because I wanted to see how You know how the results were looking after each epoch And that was quite interesting.

So let me let me show you that Okay, so this is after the first epoch. So okay we so here what I'm doing is I've got this Fill which is a pipeline fill object and I'm entering chow and then putting in our mass and then that and I'm I wanted to Say chow come over right and in the middle wouldn't have to predict call me now This is the first after the first epoch and you can see it's not yeah, it's just it's putting like random Random characters.

So question mark here Three dots here Chow and chow again here kind of weird so Yeah, not not the best right now. We move on to the second epoch and it's getting business boy So rubbish, okay, at least it's got words So like here we have a word chow Kiva or chiva?

Okay Chow kiva. I don't know if that's the way I always the CH in Italian I always get messed up if there's any Italians watching. I'm I'm sorry Chow cuz of ah, you know at least we're getting words, but none of these so it doesn't make any sense. Okay, so No, I'm still not good now if we come across again, so this is This one.

Yeah this one now we get it So the first the the rest of these account the rest of them are nonsense. Okay, so the four here Ignore them. However at the top we get this score 0.33 and we get chow coming back So that's what we wanted. So that's good means it's working.

This was this was after the third and a bit epoch Let me show you loss function as well So this I know this is really messy So here we have our I don't know why this one's so short. Actually. Why is that one so short? Hmm strange, well, maybe I didn't Yeah, if the last one doesn't look like I finished training for the fully epoch so I thought I did maybe something happened I'm not sure but fine.

This is what it is. That's fine So the first set of training I did was it was here And you see in the middle of my my computer went to sleep for a bit overnight because it was just so loud so I turned it off for a bit and Then continue going down now this first epoch is when we were at one point Or 1 e to the minus 5 and then here I was testing the 1 e to the minus 4 and you can see straight Away, it goes down way quicker.

So I was like, okay, we're gonna go with that. It's clearly a lot better and Then continued over here next epoch and then find the final one here, which it didn't seem to change much anyway But there was there was so pretty clear difference. So that's the loss over time and Yeah, I mean we've seen the results from that.

So now we have that let's move on to actually Testing the model. So I'm going to bring Lara and I'm going to just open the the file Okay, so this is the the testing we're gonna do so we're using the the file mass. We've got this pipeline sorry fill mass I've got this pipeline and we're just what I'm going to do is just get Lara to come in and some Italian sentences and just add this random mass token in and See if the results are Bearable or not.

So let's see So I will see you in a minute. This is Lara. She can speak Italian. So she's going to go through this and test it a few times and Hopefully say it's good. Let's see. Hopefully. Ciao okay, so all you need to do is we have like a sentence here and You just write some Italian and then for one of the words in there we want to replace it with this text here and then that's going to like Mask that word and then the model is going to try and predict what is there and hopefully it will predict Let's let's see.

So just write some Italian phrase not not too difficult yet and see So I don't have to write all bar. No, no, no, no you write just write a sentence and okay Do that Buongiorno Dante no, buongiorno Maybe a few words Okay, can I put comma or? Buongiorno, come on Okay, and then so which which words should we cover come here, okay And then Okay, so just cover it with the mask and See what it says.

So not this I Seem to rerun these as well Okay, so let's give it a moment Keep on yeah, but the second one come here. Come here. Ah, it's almost that does Kiva mean anything like who? Yeah, it's like like is there someone but we would like I understand because I'm Italian, but I don't think that We don't usually say that I don't think I'm gonna take that as fine I'm gonna take that as it's good So let's do it again.

Maybe yeah Try another one. Oh Wait, actually, what about these ones because of that Okay, but it just might be after Buongiorno, I I wouldn't expect Kiva It's okay So you can just put another one like where we put fill again right in the sentence. So we're here. Mm-hmm.

Yeah So we can write Mm-hmm Yeah Yeah, and then what do you want to replace in country in country, I'm a maybe or dover Yeah, so which one we decide it's fine Yeah Yeah, that's good though she video mojito Marie Joe the wishy country mojito Marie Joe the wishy siamo I do believe Joe no, no, the wishy Troia mojito Marie Joe the wishy ritroviamo.

Yeah, that's that's quite good Okay, should we try with over like using the same phrase? Okay, you can control Z, right, yeah Okay, let's run it Dove in the second one call me she country I'm Not one that she country I was good. Uh-huh. See It's cool. Yeah cause something say Let's try another one.

Yeah Okay, let's remove the body Yeah, yeah go run it Cause I fire you but you know that's it. That's good. Cause I serve you but she knows the zero Yeah, cause I sped up it. She knows the zero me Cause I said it but she knows the zero me Cause I believe it but she knows the zero me.

So I didn't find what we said before which was cause I'll be party Yeah Yeah, it makes sense. Cause I fired with you know, so I said I was gonna serve you but she knows the zero So you try something hard like grammatically difficult. Hmm No, I'm thinking I don't know when it's like that, you know, I don't know like It doesn't count in my mind Okay, yeah, and then what should we replace I miss him I miss him.

What does that mean? If we had? This That's very good, no, but it's good because I miss him oh it's for Third person plural. It's like we had I miss a third person singular So if he had or she had yeah choosing something So what would have happened if we had chosen another day so The first one say I miss a shelter.

It will be the third person Say I miss a shelter. It will be the first person. So if I had chosen another day Say I miss it. Oh, it's a second person plural. So it will be if they had chosen another day Say I really should this one now.

See how it is. Yes. Yeah, this is good. Seven is a shelter No No, maybe no, no, but the first three are very good. Yeah, I Have an idea. So now if we change to Set. So if we put set Laura, so if we specify the person maybe you will take the correct one So if we put set Laura And then we expect it to say a mess Avescero.

Avescero. So let's run it You see? That's cool. That's very good And then the other one is Laura anno. It's right? I mean the I'm saying well the the verb it's Incorrect, but yeah, it's in the wrong place, but it's saying the right Like the meaning is correct. Yeah, but the grandma it's not correct.

Okay. Yeah It's cool Because I wasn't sure how far you could just Wait, what's a child coming back, but that was all I tested it with so I was a little bit worried Okay anything else? Thank you. You're welcome Bye Okay, so I think that's a pretty good result So I mean that that's pretty much everything we needed for building our model our transform model Although I do want to so we're going to do one more video after this where we're going to Upload our model to the Hugging Face model hub And then what we'll be able to do is actually download it directly from Hugging Face Which I think will be will be super cool to to do that and figure out how we actually pull that together So yeah, I think good result pretty happy with that and Thank you for watching and I will see you again in the next one.

Training and Testing an Italian BERT - Transformers From Scratch #4

Chapters

Transcript