back to indexTraining BERT #4 - Train With Next Sentence Prediction (NSP)
Chapters
0:0
21:9 Set Up the Input Pipeline for Training
21:57 Initialization Method
27:12 Initialize Our Data Loader
27:58 Model Training Parameters
29:30 Initialize Our Optimizer
30:21 Training Loop
00:00:00.000 |
Hi and welcome to the video. Here we're going to have a look at how we can use NSP or next 00:00:07.120 |
sentence prediction to train a BERT model. Now in a previous video I covered how NSP works but I 00:00:15.680 |
didn't really cover how you actually train a model using it. So that's what we're going to do here. 00:00:21.040 |
So we're going to jump straight into it and we have this notebook. Here is the data that we're 00:00:29.760 |
going to be using. I will load that in a moment but first thing I want to do before doing that 00:00:35.120 |
is import and initialize everything we need. So obviously when we are downloading that data 00:00:42.400 |
I'm going to be using requests for that. So I'm going to import requests 00:00:45.760 |
and for our actual training of the model we're going to be using both Hugging Faces Transformers 00:00:53.440 |
and PyTorch. So I need to import Transformers and I'm going to import a BERT Tokenizer class 00:01:02.720 |
and also a BERT for next sentence prediction class. So BERT for next sentence 00:01:15.760 |
prediction and as well as that we need to import Torch. So once we've imported all those we can 00:01:24.800 |
initialize our Tokenizer and model. So Tokenizer equals BERT Tokenizer from pre-trained 00:01:33.120 |
and I'm using BERT based on case for this example. Obviously you can use another BERT model if you'd 00:01:42.480 |
like. So copy that and initialize our model as well. Okay and we can run that and now let's 00:01:59.280 |
extract this data. So this warning here we don't need to worry about that. That's just saying if 00:02:04.720 |
you are using this model for inference you shouldn't because you need to train it a little bit. 00:02:11.440 |
So we don't need to worry about that it's fine because we are going to be training. 00:02:15.280 |
Now what we do need to do is get this data. So the data equals requests.get 00:02:24.400 |
and we just take this link. I will keep this link in the description for you so you can just copy 00:02:32.720 |
it across if you want to follow along and we should see that we get a 200 response there that's good. 00:02:39.280 |
So all we need to do is extract the text from that and we're going to store it in a another 00:02:44.800 |
variable here text variable and if we just have a quick look at what we have in there we see that 00:02:50.320 |
we have all of these paragraphs. This is from the meditations by Marcus Aurelius. You can get that 00:02:56.880 |
book online that's why I'm using it and the language is a bit unique as well so that's why 00:03:01.920 |
I want to use it because when we're using next sentence prediction we're training the BERT model 00:03:08.320 |
to better comprehend the style of language that we train it on. So in here we have our paragraphs 00:03:16.480 |
and they're all separated by a newline character. So I'm just going to add another little bit of 00:03:21.360 |
code here which is a split by newline character and we have a look here we now have a list 00:03:27.920 |
containing paragraphs and that's our training data that's what we want to be using. 00:03:35.760 |
So when we're using NSP we want to create a 50/50 split of sentences that are random and 00:03:45.680 |
sentences that are not random. So we're going to be taking sentence A's and 50% of the time we're 00:03:53.920 |
going to be adding the genuine sentence B for that sentence A's e.g. the sentence that follows it 00:03:59.680 |
in the text and then the other 50% of the time we're going to be just choosing a random sentence 00:04:05.760 |
and pulling that in and using that. So to do that we first want a bag of sentences to actually pull 00:04:17.680 |
that text from. So the reason we can't just use text directly is because if we for example look 00:04:24.800 |
at this we see that we have multiple sentences in this single paragraph. So if I just split by 00:04:32.400 |
period we get this one, two, three, four so we get four sentences and this empty one at the end as 00:04:40.800 |
well which we we need to remove. So what I'm going to do is loop through our text here so the text 00:04:51.440 |
variable, split every sentence by the period characters and append all of those to a new list 00:05:02.400 |
so a flat list containing just sentences so no paragraphs just sentences. And at the same time 00:05:08.640 |
we'll need to make sure we don't include these empty ones because we get those with almost I 00:05:12.800 |
think actually every paragraph in there. So we need to make sure we don't include those. 00:05:19.600 |
Now to create this bag we write something like this so we want to go through each sentence 00:05:26.320 |
so we want each sentence from each paragraph so sentence sorry sentence for each paragraph 00:05:37.200 |
in the text for the sentences so for sentence in so this is where we're getting sentences from 00:05:47.280 |
so paragraph.split we split by the period and as well as that we also need to add that condition 00:05:58.000 |
that we don't want any sentences that look like this so we just add that in so if sentence is not 00:06:04.720 |
equal to that and that should be okay so let's check the length okay so we get 1372 sentences 00:06:17.120 |
from that and we'll actually want to save this to a parameter because we're using it later. 00:06:23.600 |
So we now have the 1300 sentences to sample from and now what we want to do is loop through each 00:06:33.120 |
sentence within text or each paragraph within text choose a sentence from each paragraph if 00:06:39.840 |
there's multiple paragraphs only multiple sentences only and then 50% of the time select a random 00:06:49.600 |
sentence from a bag and append that to the end 50% of the time append the actual genuine sentence 00:06:56.000 |
onto it and then we create labels as to whether we have randomized it or not randomized it so to 00:07:03.920 |
create that random 50% we import the random library and we also want to initialize our 00:07:13.440 |
sentence a list sentence b list and we also need to initialize a label list okay so that'll be zero 00:07:26.800 |
one now what we want to do is loop through each paragraph in our text so for paragraph in text 00:07:35.760 |
and then here we extract our sentences like we did with the bag before so we go sentences 00:07:43.600 |
equals and here we want to write sentence for sentence in paragraph dot split 00:07:56.320 |
and remember we have those random empty sentences we don't want to include those 00:08:03.920 |
so we write if sentence is not equal to that empty sentence so we have now so we're now looping 00:08:16.560 |
through each paragraph and we've split each paragraph into sentences now what we want to do 00:08:22.560 |
is check if that paragraph eg our new sentences variable has more than one sentence so we'll do 00:08:32.880 |
number of sentences equals to length of sentences and then we say if number of sentences 00:08:41.440 |
is greater than one oops okay don't execute it right now and then we apply our 50 50 logic 00:08:52.240 |
and append that to our actual training data otherwise if it's just a single sentence we 00:08:58.640 |
don't actually add it to the train data i mean ideally we would want to do something like that 00:09:03.840 |
but for this use case i don't want to get make things too complicated so the reason i'm doing 00:09:10.880 |
that is for example this sentence it's just a single sentence in that paragraph and we can't 00:09:15.120 |
guarantee that each continuous paragraph is talking about the same subject you might switch 00:09:20.480 |
so for the sake of simplicity i'm just going to ignore the single sentence paragraphs although 00:09:28.160 |
we do have them in our bag so they can be pulled in as potential sentence bees when we randomize 00:09:36.240 |
the selection now what i want to do is set the sentence that we will start from so we write 00:09:43.360 |
sense of start equals random rand int so this is only if we have more than one sentence remember 00:09:54.400 |
so our random random so this is going to be the start sentence in the case that we use 00:10:01.680 |
sentence a and b consecutively so we don't randomize sentence b 00:10:06.160 |
we want to make sure that we have enough space at the end of our sentences so here 00:10:13.840 |
to take both sentence a and sentence b so let's say for example we have i'm going to use 00:10:22.560 |
an example here so we have zero one two three four let's say this is our paragraph we have 00:10:28.720 |
five sentences in here we want the start sentence the sentence a if we select four 00:10:35.600 |
then we don't have a sentence b to select from so what we're going to do here is say you choose 00:10:43.600 |
a random integer between zero and we want three to be the maximum there okay so how do we do that 00:10:51.680 |
okay we've we've got the number of sentences here so this value will be five in this case 00:10:57.120 |
so we would say number of sentences is five minus two because we don't we the maximum 00:11:05.600 |
value we want to select is three in this case so it's going to be the number of sentences 00:11:13.280 |
minus two okay now what we do is we do our 50 50 randomized or not randomized for sentence b 00:11:22.640 |
so if random dot random so this will just select random float between the values zero and one 00:11:30.480 |
if that is a greater than 0.5 then let's say we'll make this our random selection okay 00:11:40.880 |
so for the random selection what we do is sentence b dot append and then here we would 00:11:49.840 |
append a random sentence from our bag up here so to do that we will just write bag 00:11:57.760 |
and then in here we need to select a random integer like we did up here okay so we're going 00:12:04.080 |
to use that same that same function so random dot randint and that needs to be between zero 00:12:12.080 |
and the length of our bag minus one so we use bag size that's why that's why we have it 00:12:20.240 |
so bag size minus one okay now we'll select a random sentence b from that bag for us 00:12:29.520 |
and as well as that we also want to set label so our label in this case would be 00:12:36.480 |
a one so we have the zero which means it is the next sentence we have a one which means it is not 00:12:43.920 |
the next sentence so we set one now our sentence a it our sentence a gets selected it's the same 00:12:53.200 |
thing no matter whether we have the random sentence b or the not random sentence b so 00:13:00.160 |
we can actually write our sentence a append up here and this is just going to be 00:13:08.480 |
sentences and in the index we have start which is our value from here okay so we have the random 00:13:19.680 |
option now let's do our not random option so in here we'd write sentence b append 00:13:32.400 |
start plus one so the following sentence after our sentence a 00:13:42.400 |
and our label here would be zero which means it is the next sentence 00:13:48.320 |
so let's there's quite a lot of code let's run that and see what we get 00:13:53.840 |
okay now what i want to do is let's have a look at the first few labels see if we have a mix of 00:14:06.720 |
different ones in there okay we just have one one one so i'm going to rerun this because i want to 00:14:11.280 |
show you the difference between zeros and ones here okay so we have these so let me print out 00:14:19.360 |
what we have so for i in range three so i'm just doing this so we can print and see what we 00:14:30.160 |
actually have in our training data so i want to print the label at that index and then i want to 00:14:40.400 |
print the sentence a at that index and we'll follow that with a newline character and a few 00:14:53.120 |
dashes so we can distinguish between the start and end of sentence a and b and then we will do print 00:14:58.960 |
sentence b and then i'm just going to add a new line there to distinguish it from the next set 00:15:05.760 |
of answers so see here that we have zero we have our sentence a and our sentence b 00:15:18.640 |
is a continuation of that first sentence because we have that label zero we know that 00:15:26.720 |
so we have sentence a here and again this one here is a continuation of this sentence 00:15:34.000 |
a and then down here we have a one so this is why we've selected a random sentence b 00:15:40.240 |
and if we read this i know it's not the easiest thing to read 00:15:52.400 |
yeah the difference there's reasonably clear difference in the context there okay now this 00:16:00.480 |
won't always work in some cases we might select even the same sentence for sentence a and b 00:16:06.080 |
but for what we're doing here i think this is a completely reasonable way of going about it 00:16:14.800 |
because we don't want to over complicate things if we wanted to really be very strict on it we could 00:16:21.040 |
add in some extra logic which confirms that we are not getting a sentence b from around the same 00:16:27.360 |
area as sentence a for example but for now this is i think fine okay so we've now prepared our data 00:16:36.320 |
what we need to do now is tokenize it so to tokenize our data we're just going to use a 00:16:43.760 |
tokenizer which we've already initialized and in here we can actually just pass our sentence a 00:16:50.800 |
and sentence b like this and our tokenizer will deal with how to fit both of those together 00:16:56.080 |
for us so that's pretty useful we're going to be using pytorch so we want to return tensors pt 00:17:03.360 |
and as well as that we need to truncate or pad each one of those sequences to a maximum length 00:17:15.440 |
of 512 then we truncate using this and we are set padding equal to max length okay so 00:17:29.280 |
that should be okay let's have a look at what we have we see that we have input ids token type 00:17:36.720 |
ids and attention mass let's have a look at what they look like so you see here we have 00:17:43.280 |
all these different vectors and that is a single pair of sentence a and sentence b 00:17:49.040 |
and we have quite a few of those now our token type ids what we would expect is sentence a 00:18:00.400 |
would have a token type id of zero and sentence b would have token type id of 00:18:07.280 |
one we don't we don't see those ones in there so let's expand that out a little bit 00:18:13.840 |
so we'll go with token type ids let's go with number zero okay so now we see okay the reason 00:18:25.360 |
is because they're in the middle here so what we're seeing here is sentence a followed by 00:18:31.360 |
sentence b and then these remaining zero tokens are our padding tokens so we can also see that 00:18:39.120 |
if we switch across to input ids we see that we have all these padding tokens and as well another 00:18:47.600 |
item that the another thing that the tokenizer does for us automatically is adds a separator 00:18:52.640 |
token in the middle of our sentence a and b so sentence a is this sentence b is this okay 00:19:03.520 |
so we have our input tensors we also need to build our labels tensor and to do that we just 00:19:12.720 |
we add it to this inputs variable so we have inputs labels and we set that equal to torch 00:19:23.120 |
long tensor and this is a little bit different so let me just expand that out so let's say we 00:19:33.360 |
just add labels in here so sorry it's label and we just get this one big tensor which is not 00:19:42.640 |
really in the correct format that we need we we need each one of these to match to our input ids 00:19:53.280 |
token type ids and tension mass so what i mean by that is if we just have a look at this input ids 00:20:03.760 |
you see that we get it's like a list within a list we need that but for our labels as well 00:20:10.560 |
they're in a different format at the moment as you as you can see so we could try transposing that 00:20:17.680 |
but you see that doesn't actually do anything because it's just a single nav mention so it's 00:20:22.880 |
just switching everything around so let's remove that transpose and let's add a list inside here 00:20:30.000 |
you see now we're getting somewhere not not quite there yet so now we have a list within lists 00:20:38.240 |
and now what we do is we transpose it and now we get what we need so we have this 00:20:44.320 |
almost vector of each of these and each one of these here so this vector matches up to this 00:20:52.080 |
value here and this one matches up to this one and that's that's what we want so let's copy that 00:20:59.760 |
and we'll put it here so now we have all of the sensors we need for training our model 00:21:08.480 |
and what we now need to do is set up the input pipeline for training so when we're training 00:21:14.480 |
we're going to need to use a pytorch data loader object and to create that data loader object 00:21:21.920 |
we need to create a pytorch dataset object from our data so to do that we write this so 00:21:30.240 |
we're going to be using a dataset class here so i'm going to call it meditations dataset 00:21:36.320 |
and in here we write torch utils data dataset so that make sure that we are using the correct 00:21:48.240 |
format for the class for a dataset now we need to define a few methods inside here so 00:21:57.280 |
our initialization method and for our initialization method we need to be able to 00:22:01.920 |
pass our data so we'll pass it through this encodings variable and all we need to do in 00:22:08.960 |
here is assign our encodings variable to be an internal attribute of that dataset or that class 00:22:15.680 |
so write self encodings equals encodings so that allows us to create our dataset class 00:22:24.320 |
and then our data loader needs two different methods from this class as well it needs a 00:22:30.960 |
get item method and a length method so let's do the length mode first it's easier 00:22:39.920 |
so our length we don't need to pass anything to this it's just it's it's the same as when you 00:22:46.320 |
would write say you write this and you put something inside it so a list 01 we get that 00:22:53.280 |
length that's exactly what we're doing here so this creates a enables you to do this same method 00:23:00.400 |
on your class and inside here all we need to do is return the length so what length should we 00:23:08.640 |
return well if we just do len of inputs we get four because we only have four items in there so 00:23:18.160 |
we don't want that we actually want the number of samples that we have within our inputs so what we 00:23:26.000 |
can do instead is we write inputs input ids shape zero so we have these this 317 items so we just 00:23:37.600 |
show you here see this is our encoding size so the um the max length we set here 00:23:45.680 |
and this is a number of sentences or sentence pairs that we have 00:23:52.400 |
so we take that and we return it but obviously we don't have inputs we 00:24:03.200 |
now have this self encodings so swap it like that and then we want to pass this get item 00:24:13.600 |
method and what this does is given a certain index it will return your four dictionaries your 00:24:22.480 |
input ids token type ids attention mass and labels dictionary which we've created down here 00:24:28.560 |
for that specific index so we need to allow it to take an index argument there what we do 00:24:36.720 |
is return let me let me show you down here what that would look like so we want to create 00:24:46.480 |
a dictionary just like we we have up here but we just want that for that specific index 00:24:52.240 |
so what we write is key and then we write our value index so 00:24:59.200 |
maybe maybe it makes more sense for me to write so tensor 00:25:03.680 |
so our key is our input ids attention mass and so on our tensor is obviously the tensor inside 00:25:12.560 |
there but we have the full tensor containing all 317 items so then we pull out the index 00:25:21.680 |
for that tensor but we need to make sure we're doing that for each of our items so 00:25:27.680 |
because we have multiple tensors here don't we we have the we have the four input ids 00:25:32.560 |
labels and so on so we do four key and tensor in and in here we would write let's say we do 00:25:46.160 |
inputs items so let me just take that out so you can see so that gives us our dictionary items 00:26:04.240 |
we want to print let's print the key right so you see that we're looping through each one of those 00:26:14.160 |
and we also get the tensor out for each one of those as well but we're specifying a certain 00:26:20.880 |
tensor with each one so let's say we want zero here we'll get the zero tensor and nothing nothing 00:26:30.400 |
more okay but we want to specify an index so we copy that and it's what we're going to return here 00:26:40.560 |
except here we change it to self encodings so that's our class 00:26:47.680 |
and with that we can initialize our data set object so we'll say data set equals 00:26:57.760 |
meditations data set and then all we need to do is pass our data which is sorry it's just 00:27:06.080 |
inputs here like that okay so that's our data set ready now we can initialize our data loader 00:27:15.040 |
and we do that like this so we do loader torch utils data dot data loader 00:27:30.000 |
we pass our data set object we also want to specify the batch size so i'm going to use batches 00:27:37.360 |
of 16 and then we also want to shuffle our data set as well so we write shuffle 00:27:51.360 |
that's our data loader so now now we just need to set up a few model training parameters so 00:28:01.440 |
the first thing we want to do is move our model to gpu if we have a gpu so to figure that out 00:28:12.080 |
what we do is write well let me do this torch device cuda so this is we say we want to use a 00:28:22.560 |
cuda enabled gpu if torch.cuda is available so this will check our environment and check if we 00:28:35.600 |
have a cuda enabled gpu if it isn't available we want to use a torch device cpu let's run that and 00:28:47.760 |
see for me i have a cuda enabled gpu so it comes up with this so we saw that in device and then 00:28:55.920 |
what we can do is move our model and also move our tensors later on to that device for training 00:29:04.160 |
so we just write model to device and we'll get a lot of output from that we just ignore that we 00:29:12.480 |
don't need to don't need to worry about it and we can also activate our models training mode like 00:29:20.560 |
that okay so we've moved our model over to gpu activated training mode now what we need to do 00:29:30.000 |
is initialize our optimizer so we're going to be using adam with way to decay for our optimizer 00:29:36.240 |
so to use that we need to need to import it from transformers so from transformers 00:29:45.440 |
import and w and we are we initialize the optimizer like this so we w we pass our model 00:29:59.920 |
parameters and we also want to pass the learning rate which is going to be five plus five e to the 00:30:09.200 |
minus five okay that's a pretty common one for training transformers and that looks pretty good 00:30:18.640 |
to me so now we can begin our training loop so first i want to import something called tqdm now 00:30:28.960 |
this is purely for aesthetics we don't need it for training this is so we don't we see a little 00:30:34.880 |
progress bar during training otherwise we don't see anything so i just want to include that so 00:30:40.160 |
we can actually see what is going on so from tqdm import tqdm so this is optional you don't need to 00:30:46.400 |
include it it's up to you but i would i would recommend it um we'll train for let's go with 00:30:55.280 |
two epochs again we don't want to train transform models too much because they will easily overfit 00:31:02.720 |
and to be honest they'll probably overfit on this data set because it's very small 00:31:06.240 |
but that's fine we just want to use this as an example so we're going to train for two epochs 00:31:15.120 |
and because we're using tqdm we want to set up our training loop like this so we wrap it within 00:31:23.040 |
a tqdm instance and all we do here is pass our data loader so we create that up here that's 00:31:32.000 |
our pytorch data loader and we also want to write leave equals true so this is so that we can see 00:31:41.520 |
the progress bar and then we loop through each batch that will be generated by that 00:31:53.040 |
loop generator so for batch in loop so now we're in our training loop what we want to do here 00:32:02.000 |
very first thing is set our optimizers gradients to zero so obviously in the very first loop that 00:32:13.200 |
it's fine it doesn't matter but every loop after that our optimizer will have a set of gradients 00:32:19.520 |
that have been calculated from the previous loop and we need to reset those so we write optim zero 00:32:26.320 |
grad and after that we can load in our batches or our tensors from our batch here so we want input 00:32:43.600 |
ids equals a batch and we access it like a dictionary so we have input ids so input ids 00:32:53.600 |
and one other thing that we need to do is our model is on our gpu so we need to move 00:33:00.080 |
move the data that we're training on to our gpu as well so we just write that okay 00:33:11.040 |
and copy this so we have we have one more so we have all these that we 00:33:18.960 |
we create up here so input ids token type ids attention mask and labels we want all of those 00:33:35.040 |
okay so initialize our gradients we have pulled in our tensors and now we can 00:33:55.280 |
process them through our model so we do model input ids we have token type ids 00:34:19.840 |
so that will create two tensors for us in the outputs it will create a logits tensor which is 00:34:26.960 |
our prediction and it will create a loss tensor which is the difference between our prediction 00:34:35.120 |
and our labels so let's extract that loss so we do outputs.loss and then we also after extracting 00:34:46.720 |
that loss we need to calculate that this is the overall loss we need to calculate loss for every 00:34:51.120 |
parameter within our model so we can optimize on that so we just write loss backward I think it's 00:35:00.000 |
yeah backward and then we do optim step and this will use our optimizer and take a step 00:35:13.680 |
to optimize based on the loss that we've calculated here and that is all we actually 00:35:19.680 |
need for our training loop we do also have the tqdm up here as well so I just want to 00:35:25.920 |
use that and what we're going to do is we're just going to set the description 00:35:31.120 |
of our loop at this current step equal to the epoch 00:35:38.960 |
so this is just purely aesthetics we don't need this for training but it's just so we can see 00:35:44.000 |
what is going on and we also want to loop set postfix and here I'm going to add in 00:35:50.560 |
our loss which is just going to be loss equals loss dot item like that 00:35:59.200 |
now that should be okay let's give it a go let's see what happens 00:36:07.440 |
okay so that looks pretty good so you can see that our model is training loss is 00:36:14.080 |
reducing now there isn't that much training data so we're not going to see anything 00:36:19.280 |
crazy here but we can see that is it is moving in the right direction so that's pretty good 00:36:24.480 |
so that's everything for this video it's a pretty long video 00:36:28.720 |
I've recorded for 41 minutes it'll probably be a little bit short for you 00:36:34.800 |
but yeah that's long so that's everything for this video I hope it's been useful