Training BERT #4 - Train With Next Sentence Prediction (NSP)

Hi and welcome to the video. Here we're going to have a look at how we can use NSP or next sentence prediction to train a BERT model. Now in a previous video I covered how NSP works but I didn't really cover how you actually train a model using it.

So that's what we're going to do here. So we're going to jump straight into it and we have this notebook. Here is the data that we're going to be using. I will load that in a moment but first thing I want to do before doing that is import and initialize everything we need.

So obviously when we are downloading that data I'm going to be using requests for that. So I'm going to import requests and for our actual training of the model we're going to be using both Hugging Faces Transformers and PyTorch. So I need to import Transformers and I'm going to import a BERT Tokenizer class and also a BERT for next sentence prediction class.

So BERT for next sentence prediction and as well as that we need to import Torch. So once we've imported all those we can initialize our Tokenizer and model. So Tokenizer equals BERT Tokenizer from pre-trained and I'm using BERT based on case for this example. Obviously you can use another BERT model if you'd like.

So copy that and initialize our model as well. Okay and we can run that and now let's extract this data. So this warning here we don't need to worry about that. That's just saying if you are using this model for inference you shouldn't because you need to train it a little bit.

So we don't need to worry about that it's fine because we are going to be training. Now what we do need to do is get this data. So the data equals requests.get and we just take this link. I will keep this link in the description for you so you can just copy it across if you want to follow along and we should see that we get a 200 response there that's good.

So all we need to do is extract the text from that and we're going to store it in a another variable here text variable and if we just have a quick look at what we have in there we see that we have all of these paragraphs. This is from the meditations by Marcus Aurelius.

You can get that book online that's why I'm using it and the language is a bit unique as well so that's why I want to use it because when we're using next sentence prediction we're training the BERT model to better comprehend the style of language that we train it on.

So in here we have our paragraphs and they're all separated by a newline character. So I'm just going to add another little bit of code here which is a split by newline character and we have a look here we now have a list containing paragraphs and that's our training data that's what we want to be using.

So when we're using NSP we want to create a 50/50 split of sentences that are random and sentences that are not random. So we're going to be taking sentence A's and 50% of the time we're going to be adding the genuine sentence B for that sentence A's e.g. the sentence that follows it in the text and then the other 50% of the time we're going to be just choosing a random sentence and pulling that in and using that.

So to do that we first want a bag of sentences to actually pull that text from. So the reason we can't just use text directly is because if we for example look at this we see that we have multiple sentences in this single paragraph. So if I just split by period we get this one, two, three, four so we get four sentences and this empty one at the end as well which we we need to remove.

So what I'm going to do is loop through our text here so the text variable, split every sentence by the period characters and append all of those to a new list so a flat list containing just sentences so no paragraphs just sentences. And at the same time we'll need to make sure we don't include these empty ones because we get those with almost I think actually every paragraph in there.

So we need to make sure we don't include those. Now to create this bag we write something like this so we want to go through each sentence so we want each sentence from each paragraph so sentence sorry sentence for each paragraph in the text for the sentences so for sentence in so this is where we're getting sentences from so paragraph.split we split by the period and as well as that we also need to add that condition that we don't want any sentences that look like this so we just add that in so if sentence is not equal to that and that should be okay so let's check the length okay so we get 1372 sentences from that and we'll actually want to save this to a parameter because we're using it later.

So we now have the 1300 sentences to sample from and now what we want to do is loop through each sentence within text or each paragraph within text choose a sentence from each paragraph if there's multiple paragraphs only multiple sentences only and then 50% of the time select a random sentence from a bag and append that to the end 50% of the time append the actual genuine sentence onto it and then we create labels as to whether we have randomized it or not randomized it so to create that random 50% we import the random library and we also want to initialize our sentence a list sentence b list and we also need to initialize a label list okay so that'll be zero one now what we want to do is loop through each paragraph in our text so for paragraph in text and then here we extract our sentences like we did with the bag before so we go sentences equals and here we want to write sentence for sentence in paragraph dot split and remember we have those random empty sentences we don't want to include those so we write if sentence is not equal to that empty sentence so we have now so we're now looping through each paragraph and we've split each paragraph into sentences now what we want to do is check if that paragraph eg our new sentences variable has more than one sentence so we'll do number of sentences equals to length of sentences and then we say if number of sentences is greater than one oops okay don't execute it right now and then we apply our 50 50 logic and append that to our actual training data otherwise if it's just a single sentence we don't actually add it to the train data i mean ideally we would want to do something like that but for this use case i don't want to get make things too complicated so the reason i'm doing that is for example this sentence it's just a single sentence in that paragraph and we can't guarantee that each continuous paragraph is talking about the same subject you might switch so for the sake of simplicity i'm just going to ignore the single sentence paragraphs although we do have them in our bag so they can be pulled in as potential sentence bees when we randomize the selection now what i want to do is set the sentence that we will start from so we write sense of start equals random rand int so this is only if we have more than one sentence remember so our random random so this is going to be the start sentence in the case that we use sentence a and b consecutively so we don't randomize sentence b we want to make sure that we have enough space at the end of our sentences so here to take both sentence a and sentence b so let's say for example we have i'm going to use an example here so we have zero one two three four let's say this is our paragraph we have five sentences in here we want the start sentence the sentence a if we select four then we don't have a sentence b to select from so what we're going to do here is say you choose a random integer between zero and we want three to be the maximum there okay so how do we do that okay we've we've got the number of sentences here so this value will be five in this case so we would say number of sentences is five minus two because we don't we the maximum value we want to select is three in this case so it's going to be the number of sentences minus two okay now what we do is we do our 50 50 randomized or not randomized for sentence b so if random dot random so this will just select random float between the values zero and one if that is a greater than 0.5 then let's say we'll make this our random selection okay so for the random selection what we do is sentence b dot append and then here we would append a random sentence from our bag up here so to do that we will just write bag and then in here we need to select a random integer like we did up here okay so we're going to use that same that same function so random dot randint and that needs to be between zero and the length of our bag minus one so we use bag size that's why that's why we have it so bag size minus one okay now we'll select a random sentence b from that bag for us and as well as that we also want to set label so our label in this case would be a one so we have the zero which means it is the next sentence we have a one which means it is not the next sentence so we set one now our sentence a it our sentence a gets selected it's the same thing no matter whether we have the random sentence b or the not random sentence b so we can actually write our sentence a append up here and this is just going to be sentences and in the index we have start which is our value from here okay so we have the random option now let's do our not random option so in here we'd write sentence b append and this needs to append sentences start plus one so the following sentence after our sentence a and our label here would be zero which means it is the next sentence so let's there's quite a lot of code let's run that and see what we get okay now what i want to do is let's have a look at the first few labels see if we have a mix of different ones in there okay we just have one one one so i'm going to rerun this because i want to show you the difference between zeros and ones here okay so we have these so let me print out what we have so for i in range three so i'm just doing this so we can print and see what we actually have in our training data so i want to print the label at that index and then i want to print the sentence a at that index and we'll follow that with a newline character and a few dashes so we can distinguish between the start and end of sentence a and b and then we will do print sentence b and then i'm just going to add a new line there to distinguish it from the next set of answers so see here that we have zero we have our sentence a and our sentence b is a continuation of that first sentence because we have that label zero we know that so we have sentence a here and again this one here is a continuation of this sentence a and then down here we have a one so this is why we've selected a random sentence b and if we read this i know it's not the easiest thing to read yeah the difference there's reasonably clear difference in the context there okay now this won't always work in some cases we might select even the same sentence for sentence a and b but for what we're doing here i think this is a completely reasonable way of going about it because we don't want to over complicate things if we wanted to really be very strict on it we could add in some extra logic which confirms that we are not getting a sentence b from around the same area as sentence a for example but for now this is i think fine okay so we've now prepared our data what we need to do now is tokenize it so to tokenize our data we're just going to use a tokenizer which we've already initialized and in here we can actually just pass our sentence a and sentence b like this and our tokenizer will deal with how to fit both of those together for us so that's pretty useful we're going to be using pytorch so we want to return tensors pt and as well as that we need to truncate or pad each one of those sequences to a maximum length of 512 then we truncate using this and we are set padding equal to max length okay so that should be okay let's have a look at what we have we see that we have input ids token type ids and attention mass let's have a look at what they look like so you see here we have all these different vectors and that is a single pair of sentence a and sentence b and we have quite a few of those now our token type ids what we would expect is sentence a would have a token type id of zero and sentence b would have token type id of one we don't we don't see those ones in there so let's expand that out a little bit so we'll go with token type ids let's go with number zero okay so now we see okay the reason is because they're in the middle here so what we're seeing here is sentence a followed by sentence b and then these remaining zero tokens are our padding tokens so we can also see that if we switch across to input ids we see that we have all these padding tokens and as well another item that the another thing that the tokenizer does for us automatically is adds a separator token in the middle of our sentence a and b so sentence a is this sentence b is this okay so we have our input tensors we also need to build our labels tensor and to do that we just we add it to this inputs variable so we have inputs labels and we set that equal to torch long tensor and this is a little bit different so let me just expand that out so let's say we just add labels in here so sorry it's label and we just get this one big tensor which is not really in the correct format that we need we we need each one of these to match to our input ids token type ids and tension mass so what i mean by that is if we just have a look at this input ids you see that we get it's like a list within a list we need that but for our labels as well they're in a different format at the moment as you as you can see so we could try transposing that but you see that doesn't actually do anything because it's just a single nav mention so it's just switching everything around so let's remove that transpose and let's add a list inside here you see now we're getting somewhere not not quite there yet so now we have a list within lists and now what we do is we transpose it and now we get what we need so we have this almost vector of each of these and each one of these here so this vector matches up to this value here and this one matches up to this one and that's that's what we want so let's copy that and we'll put it here so now we have all of the sensors we need for training our model and what we now need to do is set up the input pipeline for training so when we're training we're going to need to use a pytorch data loader object and to create that data loader object we need to create a pytorch dataset object from our data so to do that we write this so we're going to be using a dataset class here so i'm going to call it meditations dataset and in here we write torch utils data dataset so that make sure that we are using the correct format for the class for a dataset now we need to define a few methods inside here so our initialization method and for our initialization method we need to be able to pass our data so we'll pass it through this encodings variable and all we need to do in here is assign our encodings variable to be an internal attribute of that dataset or that class so write self encodings equals encodings so that allows us to create our dataset class and then our data loader needs two different methods from this class as well it needs a get item method and a length method so let's do the length mode first it's easier so our length we don't need to pass anything to this it's just it's it's the same as when you would write say you write this and you put something inside it so a list 01 we get that length that's exactly what we're doing here so this creates a enables you to do this same method on your class and inside here all we need to do is return the length so what length should we return well if we just do len of inputs we get four because we only have four items in there so we don't want that we actually want the number of samples that we have within our inputs so what we can do instead is we write inputs input ids shape zero so we have these this 317 items so we just show you here see this is our encoding size so the um the max length we set here and this is a number of sentences or sentence pairs that we have so we take that and we return it but obviously we don't have inputs we now have this self encodings so swap it like that and then we want to pass this get item method and what this does is given a certain index it will return your four dictionaries your input ids token type ids attention mass and labels dictionary which we've created down here for that specific index so we need to allow it to take an index argument there what we do is return let me let me show you down here what that would look like so we want to create a dictionary just like we we have up here but we just want that for that specific index so what we write is key and then we write our value index so maybe maybe it makes more sense for me to write so tensor so our key is our input ids attention mass and so on our tensor is obviously the tensor inside there but we have the full tensor containing all 317 items so then we pull out the index for that tensor but we need to make sure we're doing that for each of our items so because we have multiple tensors here don't we we have the we have the four input ids labels and so on so we do four key and tensor in and in here we would write let's say we do inputs items so let me just take that out so you can see so that gives us our dictionary items and if we do a for loop so four key tensor in we want to print let's print the key right so you see that we're looping through each one of those and we also get the tensor out for each one of those as well but we're specifying a certain tensor with each one so let's say we want zero here we'll get the zero tensor and nothing nothing more okay but we want to specify an index so we copy that and it's what we're going to return here except here we change it to self encodings so that's our class and with that we can initialize our data set object so we'll say data set equals meditations data set and then all we need to do is pass our data which is sorry it's just inputs here like that okay so that's our data set ready now we can initialize our data loader and we do that like this so we do loader torch utils data dot data loader we pass our data set object we also want to specify the batch size so i'm going to use batches of 16 and then we also want to shuffle our data set as well so we write shuffle equals true that's our data loader so now now we just need to set up a few model training parameters so the first thing we want to do is move our model to gpu if we have a gpu so to figure that out what we do is write well let me do this torch device cuda so this is we say we want to use a cuda enabled gpu if torch.cuda is available so this will check our environment and check if we have a cuda enabled gpu if it isn't available we want to use a torch device cpu let's run that and see for me i have a cuda enabled gpu so it comes up with this so we saw that in device and then what we can do is move our model and also move our tensors later on to that device for training so we just write model to device and we'll get a lot of output from that we just ignore that we don't need to don't need to worry about it and we can also activate our models training mode like that okay so we've moved our model over to gpu activated training mode now what we need to do is initialize our optimizer so we're going to be using adam with way to decay for our optimizer so to use that we need to need to import it from transformers so from transformers import and w and we are we initialize the optimizer like this so we w we pass our model parameters and we also want to pass the learning rate which is going to be five plus five e to the minus five okay that's a pretty common one for training transformers and that looks pretty good to me so now we can begin our training loop so first i want to import something called tqdm now this is purely for aesthetics we don't need it for training this is so we don't we see a little progress bar during training otherwise we don't see anything so i just want to include that so we can actually see what is going on so from tqdm import tqdm so this is optional you don't need to include it it's up to you but i would i would recommend it um we'll train for let's go with two epochs again we don't want to train transform models too much because they will easily overfit and to be honest they'll probably overfit on this data set because it's very small but that's fine we just want to use this as an example so we're going to train for two epochs and because we're using tqdm we want to set up our training loop like this so we wrap it within a tqdm instance and all we do here is pass our data loader so we create that up here that's our pytorch data loader and we also want to write leave equals true so this is so that we can see the progress bar and then we loop through each batch that will be generated by that loop generator so for batch in loop so now we're in our training loop what we want to do here very first thing is set our optimizers gradients to zero so obviously in the very first loop that it's fine it doesn't matter but every loop after that our optimizer will have a set of gradients that have been calculated from the previous loop and we need to reset those so we write optim zero grad and after that we can load in our batches or our tensors from our batch here so we want input ids equals a batch and we access it like a dictionary so we have input ids so input ids and one other thing that we need to do is our model is on our gpu so we need to move move the data that we're training on to our gpu as well so we just write that okay and copy this so we have we have one more so we have all these that we we create up here so input ids token type ids attention mask and labels we want all of those so token type ids attention mask and labels okay so initialize our gradients we have pulled in our tensors and now we can process them through our model so we do model input ids we have token type ids we also have the attention mask and we also have our labels okay so that will create two tensors for us in the outputs it will create a logits tensor which is our prediction and it will create a loss tensor which is the difference between our prediction and our labels so let's extract that loss so we do outputs.loss and then we also after extracting that loss we need to calculate that this is the overall loss we need to calculate loss for every parameter within our model so we can optimize on that so we just write loss backward I think it's yeah backward and then we do optim step and this will use our optimizer and take a step to optimize based on the loss that we've calculated here and that is all we actually need for our training loop we do also have the tqdm up here as well so I just want to use that and what we're going to do is we're just going to set the description of our loop at this current step equal to the epoch so this is just purely aesthetics we don't need this for training but it's just so we can see what is going on and we also want to loop set postfix and here I'm going to add in our loss which is just going to be loss equals loss dot item like that now that should be okay let's give it a go let's see what happens okay so that looks pretty good so you can see that our model is training loss is reducing now there isn't that much training data so we're not going to see anything crazy here but we can see that is it is moving in the right direction so that's pretty good so that's everything for this video it's a pretty long video I've recorded for 41 minutes it'll probably be a little bit short for you but yeah that's long so that's everything for this video I hope it's been useful and I will see you in the next one

Training BERT #4 - Train With Next Sentence Prediction (NSP)

Chapters

Transcript