Training BERT #4 - Train With Next Sentence Prediction (NSP)

00:00:00.000 | Hi and welcome to the video. Here we're going to have a look at how we can use NSP or next

00:00:07.120 | sentence prediction to train a BERT model. Now in a previous video I covered how NSP works but I

00:00:15.680 | didn't really cover how you actually train a model using it. So that's what we're going to do here.

00:00:21.040 | So we're going to jump straight into it and we have this notebook. Here is the data that we're

00:00:29.760 | going to be using. I will load that in a moment but first thing I want to do before doing that

00:00:35.120 | is import and initialize everything we need. So obviously when we are downloading that data

00:00:42.400 | I'm going to be using requests for that. So I'm going to import requests

00:00:45.760 | and for our actual training of the model we're going to be using both Hugging Faces Transformers

00:00:53.440 | and PyTorch. So I need to import Transformers and I'm going to import a BERT Tokenizer class

00:01:02.720 | and also a BERT for next sentence prediction class. So BERT for next sentence

00:01:15.760 | prediction and as well as that we need to import Torch. So once we've imported all those we can

00:01:24.800 | initialize our Tokenizer and model. So Tokenizer equals BERT Tokenizer from pre-trained

00:01:33.120 | and I'm using BERT based on case for this example. Obviously you can use another BERT model if you'd

00:01:42.480 | like. So copy that and initialize our model as well. Okay and we can run that and now let's

00:01:59.280 | extract this data. So this warning here we don't need to worry about that. That's just saying if

00:02:04.720 | you are using this model for inference you shouldn't because you need to train it a little bit.

00:02:11.440 | So we don't need to worry about that it's fine because we are going to be training.

00:02:15.280 | Now what we do need to do is get this data. So the data equals requests.get

00:02:24.400 | and we just take this link. I will keep this link in the description for you so you can just copy

00:02:32.720 | it across if you want to follow along and we should see that we get a 200 response there that's good.

00:02:39.280 | So all we need to do is extract the text from that and we're going to store it in a another

00:02:44.800 | variable here text variable and if we just have a quick look at what we have in there we see that

00:02:50.320 | we have all of these paragraphs. This is from the meditations by Marcus Aurelius. You can get that

00:02:56.880 | book online that's why I'm using it and the language is a bit unique as well so that's why

00:03:01.920 | I want to use it because when we're using next sentence prediction we're training the BERT model

00:03:08.320 | to better comprehend the style of language that we train it on. So in here we have our paragraphs

00:03:16.480 | and they're all separated by a newline character. So I'm just going to add another little bit of

00:03:21.360 | code here which is a split by newline character and we have a look here we now have a list

00:03:27.920 | containing paragraphs and that's our training data that's what we want to be using.

00:03:35.760 | So when we're using NSP we want to create a 50/50 split of sentences that are random and

00:03:45.680 | sentences that are not random. So we're going to be taking sentence A's and 50% of the time we're

00:03:53.920 | going to be adding the genuine sentence B for that sentence A's e.g. the sentence that follows it

00:03:59.680 | in the text and then the other 50% of the time we're going to be just choosing a random sentence

00:04:05.760 | and pulling that in and using that. So to do that we first want a bag of sentences to actually pull

00:04:17.680 | that text from. So the reason we can't just use text directly is because if we for example look

00:04:24.800 | at this we see that we have multiple sentences in this single paragraph. So if I just split by

00:04:32.400 | period we get this one, two, three, four so we get four sentences and this empty one at the end as

00:04:40.800 | well which we we need to remove. So what I'm going to do is loop through our text here so the text

00:04:51.440 | variable, split every sentence by the period characters and append all of those to a new list

00:05:02.400 | so a flat list containing just sentences so no paragraphs just sentences. And at the same time

00:05:08.640 | we'll need to make sure we don't include these empty ones because we get those with almost I

00:05:12.800 | think actually every paragraph in there. So we need to make sure we don't include those.

00:05:19.600 | Now to create this bag we write something like this so we want to go through each sentence

00:05:26.320 | so we want each sentence from each paragraph so sentence sorry sentence for each paragraph

00:05:37.200 | in the text for the sentences so for sentence in so this is where we're getting sentences from

00:05:47.280 | so paragraph.split we split by the period and as well as that we also need to add that condition

00:05:58.000 | that we don't want any sentences that look like this so we just add that in so if sentence is not

00:06:04.720 | equal to that and that should be okay so let's check the length okay so we get 1372 sentences

00:06:17.120 | from that and we'll actually want to save this to a parameter because we're using it later.

00:06:23.600 | So we now have the 1300 sentences to sample from and now what we want to do is loop through each

00:06:33.120 | sentence within text or each paragraph within text choose a sentence from each paragraph if

00:06:39.840 | there's multiple paragraphs only multiple sentences only and then 50% of the time select a random

00:06:49.600 | sentence from a bag and append that to the end 50% of the time append the actual genuine sentence

00:06:56.000 | onto it and then we create labels as to whether we have randomized it or not randomized it so to

00:07:03.920 | create that random 50% we import the random library and we also want to initialize our

00:07:13.440 | sentence a list sentence b list and we also need to initialize a label list okay so that'll be zero

00:07:26.800 | one now what we want to do is loop through each paragraph in our text so for paragraph in text

00:07:35.760 | and then here we extract our sentences like we did with the bag before so we go sentences

00:07:43.600 | equals and here we want to write sentence for sentence in paragraph dot split

00:07:56.320 | and remember we have those random empty sentences we don't want to include those

00:08:03.920 | so we write if sentence is not equal to that empty sentence so we have now so we're now looping

00:08:16.560 | through each paragraph and we've split each paragraph into sentences now what we want to do

00:08:22.560 | is check if that paragraph eg our new sentences variable has more than one sentence so we'll do

00:08:32.880 | number of sentences equals to length of sentences and then we say if number of sentences

00:08:41.440 | is greater than one oops okay don't execute it right now and then we apply our 50 50 logic

00:08:52.240 | and append that to our actual training data otherwise if it's just a single sentence we

00:08:58.640 | don't actually add it to the train data i mean ideally we would want to do something like that

00:09:03.840 | but for this use case i don't want to get make things too complicated so the reason i'm doing

00:09:10.880 | that is for example this sentence it's just a single sentence in that paragraph and we can't

00:09:15.120 | guarantee that each continuous paragraph is talking about the same subject you might switch

00:09:20.480 | so for the sake of simplicity i'm just going to ignore the single sentence paragraphs although

00:09:28.160 | we do have them in our bag so they can be pulled in as potential sentence bees when we randomize

00:09:36.240 | the selection now what i want to do is set the sentence that we will start from so we write

00:09:43.360 | sense of start equals random rand int so this is only if we have more than one sentence remember

00:09:54.400 | so our random random so this is going to be the start sentence in the case that we use

00:10:01.680 | sentence a and b consecutively so we don't randomize sentence b

00:10:06.160 | we want to make sure that we have enough space at the end of our sentences so here

00:10:13.840 | to take both sentence a and sentence b so let's say for example we have i'm going to use

00:10:22.560 | an example here so we have zero one two three four let's say this is our paragraph we have

00:10:28.720 | five sentences in here we want the start sentence the sentence a if we select four

00:10:35.600 | then we don't have a sentence b to select from so what we're going to do here is say you choose

00:10:43.600 | a random integer between zero and we want three to be the maximum there okay so how do we do that

00:10:51.680 | okay we've we've got the number of sentences here so this value will be five in this case

00:10:57.120 | so we would say number of sentences is five minus two because we don't we the maximum

00:11:05.600 | value we want to select is three in this case so it's going to be the number of sentences

00:11:13.280 | minus two okay now what we do is we do our 50 50 randomized or not randomized for sentence b

00:11:22.640 | so if random dot random so this will just select random float between the values zero and one

00:11:30.480 | if that is a greater than 0.5 then let's say we'll make this our random selection okay

00:11:40.880 | so for the random selection what we do is sentence b dot append and then here we would

00:11:49.840 | append a random sentence from our bag up here so to do that we will just write bag

00:11:57.760 | and then in here we need to select a random integer like we did up here okay so we're going

00:12:04.080 | to use that same that same function so random dot randint and that needs to be between zero

00:12:12.080 | and the length of our bag minus one so we use bag size that's why that's why we have it

00:12:20.240 | so bag size minus one okay now we'll select a random sentence b from that bag for us

00:12:29.520 | and as well as that we also want to set label so our label in this case would be

00:12:36.480 | a one so we have the zero which means it is the next sentence we have a one which means it is not

00:12:43.920 | the next sentence so we set one now our sentence a it our sentence a gets selected it's the same

00:12:53.200 | thing no matter whether we have the random sentence b or the not random sentence b so

00:13:00.160 | we can actually write our sentence a append up here and this is just going to be

00:13:08.480 | sentences and in the index we have start which is our value from here okay so we have the random

00:13:19.680 | option now let's do our not random option so in here we'd write sentence b append

00:13:28.240 | and this needs to append sentences

00:13:32.400 | start plus one so the following sentence after our sentence a

00:13:42.400 | and our label here would be zero which means it is the next sentence

00:13:48.320 | so let's there's quite a lot of code let's run that and see what we get

00:13:53.840 | okay now what i want to do is let's have a look at the first few labels see if we have a mix of

00:14:06.720 | different ones in there okay we just have one one one so i'm going to rerun this because i want to

00:14:11.280 | show you the difference between zeros and ones here okay so we have these so let me print out

00:14:19.360 | what we have so for i in range three so i'm just doing this so we can print and see what we

00:14:30.160 | actually have in our training data so i want to print the label at that index and then i want to

00:14:40.400 | print the sentence a at that index and we'll follow that with a newline character and a few

00:14:53.120 | dashes so we can distinguish between the start and end of sentence a and b and then we will do print

00:14:58.960 | sentence b and then i'm just going to add a new line there to distinguish it from the next set

00:15:05.760 | of answers so see here that we have zero we have our sentence a and our sentence b

00:15:18.640 | is a continuation of that first sentence because we have that label zero we know that

00:15:26.720 | so we have sentence a here and again this one here is a continuation of this sentence

00:15:34.000 | a and then down here we have a one so this is why we've selected a random sentence b

00:15:40.240 | and if we read this i know it's not the easiest thing to read

00:15:52.400 | yeah the difference there's reasonably clear difference in the context there okay now this

00:16:00.480 | won't always work in some cases we might select even the same sentence for sentence a and b

00:16:06.080 | but for what we're doing here i think this is a completely reasonable way of going about it

00:16:14.800 | because we don't want to over complicate things if we wanted to really be very strict on it we could

00:16:21.040 | add in some extra logic which confirms that we are not getting a sentence b from around the same

00:16:27.360 | area as sentence a for example but for now this is i think fine okay so we've now prepared our data

00:16:36.320 | what we need to do now is tokenize it so to tokenize our data we're just going to use a

00:16:43.760 | tokenizer which we've already initialized and in here we can actually just pass our sentence a

00:16:50.800 | and sentence b like this and our tokenizer will deal with how to fit both of those together

00:16:56.080 | for us so that's pretty useful we're going to be using pytorch so we want to return tensors pt

00:17:03.360 | and as well as that we need to truncate or pad each one of those sequences to a maximum length

00:17:15.440 | of 512 then we truncate using this and we are set padding equal to max length okay so

00:17:29.280 | that should be okay let's have a look at what we have we see that we have input ids token type

00:17:36.720 | ids and attention mass let's have a look at what they look like so you see here we have

00:17:43.280 | all these different vectors and that is a single pair of sentence a and sentence b

00:17:49.040 | and we have quite a few of those now our token type ids what we would expect is sentence a

00:18:00.400 | would have a token type id of zero and sentence b would have token type id of

00:18:07.280 | one we don't we don't see those ones in there so let's expand that out a little bit

00:18:13.840 | so we'll go with token type ids let's go with number zero okay so now we see okay the reason

00:18:25.360 | is because they're in the middle here so what we're seeing here is sentence a followed by

00:18:31.360 | sentence b and then these remaining zero tokens are our padding tokens so we can also see that

00:18:39.120 | if we switch across to input ids we see that we have all these padding tokens and as well another

00:18:47.600 | item that the another thing that the tokenizer does for us automatically is adds a separator

00:18:52.640 | token in the middle of our sentence a and b so sentence a is this sentence b is this okay

00:19:03.520 | so we have our input tensors we also need to build our labels tensor and to do that we just

00:19:12.720 | we add it to this inputs variable so we have inputs labels and we set that equal to torch

00:19:23.120 | long tensor and this is a little bit different so let me just expand that out so let's say we

00:19:33.360 | just add labels in here so sorry it's label and we just get this one big tensor which is not

00:19:42.640 | really in the correct format that we need we we need each one of these to match to our input ids

00:19:53.280 | token type ids and tension mass so what i mean by that is if we just have a look at this input ids

00:20:03.760 | you see that we get it's like a list within a list we need that but for our labels as well

00:20:10.560 | they're in a different format at the moment as you as you can see so we could try transposing that

00:20:17.680 | but you see that doesn't actually do anything because it's just a single nav mention so it's

00:20:22.880 | just switching everything around so let's remove that transpose and let's add a list inside here

00:20:30.000 | you see now we're getting somewhere not not quite there yet so now we have a list within lists

00:20:38.240 | and now what we do is we transpose it and now we get what we need so we have this

00:20:44.320 | almost vector of each of these and each one of these here so this vector matches up to this

00:20:52.080 | value here and this one matches up to this one and that's that's what we want so let's copy that

00:20:59.760 | and we'll put it here so now we have all of the sensors we need for training our model

00:21:08.480 | and what we now need to do is set up the input pipeline for training so when we're training

00:21:14.480 | we're going to need to use a pytorch data loader object and to create that data loader object

00:21:21.920 | we need to create a pytorch dataset object from our data so to do that we write this so

00:21:30.240 | we're going to be using a dataset class here so i'm going to call it meditations dataset

00:21:36.320 | and in here we write torch utils data dataset so that make sure that we are using the correct

00:21:48.240 | format for the class for a dataset now we need to define a few methods inside here so

00:21:57.280 | our initialization method and for our initialization method we need to be able to

00:22:01.920 | pass our data so we'll pass it through this encodings variable and all we need to do in

00:22:08.960 | here is assign our encodings variable to be an internal attribute of that dataset or that class

00:22:15.680 | so write self encodings equals encodings so that allows us to create our dataset class

00:22:24.320 | and then our data loader needs two different methods from this class as well it needs a

00:22:30.960 | get item method and a length method so let's do the length mode first it's easier

00:22:39.920 | so our length we don't need to pass anything to this it's just it's it's the same as when you

00:22:46.320 | would write say you write this and you put something inside it so a list 01 we get that

00:22:53.280 | length that's exactly what we're doing here so this creates a enables you to do this same method

00:23:00.400 | on your class and inside here all we need to do is return the length so what length should we

00:23:08.640 | return well if we just do len of inputs we get four because we only have four items in there so

00:23:18.160 | we don't want that we actually want the number of samples that we have within our inputs so what we

00:23:26.000 | can do instead is we write inputs input ids shape zero so we have these this 317 items so we just

00:23:37.600 | show you here see this is our encoding size so the um the max length we set here

00:23:45.680 | and this is a number of sentences or sentence pairs that we have

00:23:52.400 | so we take that and we return it but obviously we don't have inputs we

00:24:03.200 | now have this self encodings so swap it like that and then we want to pass this get item

00:24:13.600 | method and what this does is given a certain index it will return your four dictionaries your

00:24:22.480 | input ids token type ids attention mass and labels dictionary which we've created down here

00:24:28.560 | for that specific index so we need to allow it to take an index argument there what we do

00:24:36.720 | is return let me let me show you down here what that would look like so we want to create

00:24:46.480 | a dictionary just like we we have up here but we just want that for that specific index

00:24:52.240 | so what we write is key and then we write our value index so

00:24:59.200 | maybe maybe it makes more sense for me to write so tensor

00:25:03.680 | so our key is our input ids attention mass and so on our tensor is obviously the tensor inside

00:25:12.560 | there but we have the full tensor containing all 317 items so then we pull out the index

00:25:21.680 | for that tensor but we need to make sure we're doing that for each of our items so

00:25:27.680 | because we have multiple tensors here don't we we have the we have the four input ids

00:25:32.560 | labels and so on so we do four key and tensor in and in here we would write let's say we do

00:25:46.160 | inputs items so let me just take that out so you can see so that gives us our dictionary items

00:25:56.000 | and if we do a for loop so

00:25:59.200 | four key tensor in

00:26:04.240 | we want to print let's print the key right so you see that we're looping through each one of those

00:26:14.160 | and we also get the tensor out for each one of those as well but we're specifying a certain

00:26:20.880 | tensor with each one so let's say we want zero here we'll get the zero tensor and nothing nothing

00:26:30.400 | more okay but we want to specify an index so we copy that and it's what we're going to return here

00:26:40.560 | except here we change it to self encodings so that's our class

00:26:47.680 | and with that we can initialize our data set object so we'll say data set equals

00:26:57.760 | meditations data set and then all we need to do is pass our data which is sorry it's just

00:27:06.080 | inputs here like that okay so that's our data set ready now we can initialize our data loader

00:27:15.040 | and we do that like this so we do loader torch utils data dot data loader

00:27:30.000 | we pass our data set object we also want to specify the batch size so i'm going to use batches

00:27:37.360 | of 16 and then we also want to shuffle our data set as well so we write shuffle

00:27:45.440 | equals true

00:27:51.360 | that's our data loader so now now we just need to set up a few model training parameters so

00:28:01.440 | the first thing we want to do is move our model to gpu if we have a gpu so to figure that out

00:28:12.080 | what we do is write well let me do this torch device cuda so this is we say we want to use a

00:28:22.560 | cuda enabled gpu if torch.cuda is available so this will check our environment and check if we

00:28:35.600 | have a cuda enabled gpu if it isn't available we want to use a torch device cpu let's run that and

00:28:47.760 | see for me i have a cuda enabled gpu so it comes up with this so we saw that in device and then

00:28:55.920 | what we can do is move our model and also move our tensors later on to that device for training

00:29:04.160 | so we just write model to device and we'll get a lot of output from that we just ignore that we

00:29:12.480 | don't need to don't need to worry about it and we can also activate our models training mode like

00:29:20.560 | that okay so we've moved our model over to gpu activated training mode now what we need to do

00:29:30.000 | is initialize our optimizer so we're going to be using adam with way to decay for our optimizer

00:29:36.240 | so to use that we need to need to import it from transformers so from transformers

00:29:45.440 | import and w and we are we initialize the optimizer like this so we w we pass our model

00:29:59.920 | parameters and we also want to pass the learning rate which is going to be five plus five e to the

00:30:09.200 | minus five okay that's a pretty common one for training transformers and that looks pretty good

00:30:18.640 | to me so now we can begin our training loop so first i want to import something called tqdm now

00:30:28.960 | this is purely for aesthetics we don't need it for training this is so we don't we see a little

00:30:34.880 | progress bar during training otherwise we don't see anything so i just want to include that so

00:30:40.160 | we can actually see what is going on so from tqdm import tqdm so this is optional you don't need to

00:30:46.400 | include it it's up to you but i would i would recommend it um we'll train for let's go with

00:30:55.280 | two epochs again we don't want to train transform models too much because they will easily overfit

00:31:02.720 | and to be honest they'll probably overfit on this data set because it's very small

00:31:06.240 | but that's fine we just want to use this as an example so we're going to train for two epochs

00:31:15.120 | and because we're using tqdm we want to set up our training loop like this so we wrap it within

00:31:23.040 | a tqdm instance and all we do here is pass our data loader so we create that up here that's

00:31:32.000 | our pytorch data loader and we also want to write leave equals true so this is so that we can see

00:31:41.520 | the progress bar and then we loop through each batch that will be generated by that

00:31:53.040 | loop generator so for batch in loop so now we're in our training loop what we want to do here

00:32:02.000 | very first thing is set our optimizers gradients to zero so obviously in the very first loop that

00:32:13.200 | it's fine it doesn't matter but every loop after that our optimizer will have a set of gradients

00:32:19.520 | that have been calculated from the previous loop and we need to reset those so we write optim zero

00:32:26.320 | grad and after that we can load in our batches or our tensors from our batch here so we want input

00:32:43.600 | ids equals a batch and we access it like a dictionary so we have input ids so input ids

00:32:53.600 | and one other thing that we need to do is our model is on our gpu so we need to move

00:33:00.080 | move the data that we're training on to our gpu as well so we just write that okay

00:33:11.040 | and copy this so we have we have one more so we have all these that we

00:33:18.960 | we create up here so input ids token type ids attention mask and labels we want all of those

00:33:25.920 | so token type ids attention mask and labels

00:33:35.040 | okay so initialize our gradients we have pulled in our tensors and now we can

00:33:55.280 | process them through our model so we do model input ids we have token type ids

00:34:02.160 | we also have the attention mask

00:34:11.120 | and we also have our labels okay

00:34:19.840 | so that will create two tensors for us in the outputs it will create a logits tensor which is

00:34:26.960 | our prediction and it will create a loss tensor which is the difference between our prediction

00:34:35.120 | and our labels so let's extract that loss so we do outputs.loss and then we also after extracting

00:34:46.720 | that loss we need to calculate that this is the overall loss we need to calculate loss for every

00:34:51.120 | parameter within our model so we can optimize on that so we just write loss backward I think it's

00:35:00.000 | yeah backward and then we do optim step and this will use our optimizer and take a step

00:35:13.680 | to optimize based on the loss that we've calculated here and that is all we actually

00:35:19.680 | need for our training loop we do also have the tqdm up here as well so I just want to

00:35:25.920 | use that and what we're going to do is we're just going to set the description

00:35:31.120 | of our loop at this current step equal to the epoch

00:35:38.960 | so this is just purely aesthetics we don't need this for training but it's just so we can see

00:35:44.000 | what is going on and we also want to loop set postfix and here I'm going to add in

00:35:50.560 | our loss which is just going to be loss equals loss dot item like that

00:35:59.200 | now that should be okay let's give it a go let's see what happens

00:36:07.440 | okay so that looks pretty good so you can see that our model is training loss is

00:36:14.080 | reducing now there isn't that much training data so we're not going to see anything

00:36:19.280 | crazy here but we can see that is it is moving in the right direction so that's pretty good

00:36:24.480 | so that's everything for this video it's a pretty long video

00:36:28.720 | I've recorded for 41 minutes it'll probably be a little bit short for you

00:36:34.800 | but yeah that's long so that's everything for this video I hope it's been useful

00:36:40.880 | and I will see you in the next one

Training BERT #4 - Train With Next Sentence Prediction (NSP)

Chapters