Sentence Similarity With Transformers and PyTorch (Python)

00:00:00.000 | Today we're going to have a look at how we can use transformers like BERT to create embeddings

00:00:06.640 | for sentences and how we can then take those sentence vectors and use them to calculate the

00:00:13.600 | semantic similarity between different sentences. So at a high level what you can see on the screen

00:00:21.040 | right now is a BERT base model. Inside BERT base we have multiple encoders and at the bottom we

00:00:29.040 | can see we have our tokenized text, we have 512 tokens here and they get passed into our first

00:00:37.680 | encoder to create these hidden state vectors which are of the size 768 in BERT. Now these get

00:00:52.800 | processed through multiple encoders and between every one of these encoders, there's 12 in total,

00:00:59.200 | there are going to be a vector of size 768 for every single token that we have so 512 tokens in

00:01:09.840 | this case. Now what we're going to do is take the final tensor out here so this last hidden state

00:01:16.800 | tensor and we're going to use mean pooling to compress it into a 768 by 1 vector and that

00:01:30.800 | is our sentence vector. Then once we've built our sentence vector we're going to use cosine similarity

00:01:39.680 | to compare different sentences and see if we can get something that works.

00:01:47.120 | So switching across to Python, these are the sentences we're going to be comparing and there's

00:01:55.120 | two, so there's this one here which is three years later the coffin was still full of jello

00:02:00.480 | and that has the same meaning as this here. I just rewrote it but with completely different words so

00:02:09.280 | I don't think there's really any words here that match so instead of years we have dozens of months

00:02:15.680 | jelly jello coffin person box right no normal human would even say that second well no normal

00:02:24.400 | human would probably say either of those but we definitely wouldn't use person box for coffin

00:02:30.240 | and many dozens of months for years. So it's reasonably complicated but we'll see that this

00:02:40.720 | should work for similarity so we'll find that these two share the highest similarity score

00:02:45.840 | after we've encoded them with BERT and calculate our cosine similarity.

00:02:50.640 | And down here is the model we'll be using so we're going to be using sentence transformers

00:02:58.240 | and then the BERT base NLI mean tokens model. Now there's two approaches that we can take here,

00:03:04.160 | the easy approach using something called sentence transformers. I'm going to be covering that in

00:03:08.800 | another video and this approach which is a little more involved where we're going to be using

00:03:14.320 | transformers and PyTorch. So the first thing we need to do is actually create our last hidden

00:03:23.440 | state tensor. So of course we need to import the libraries that we're going to be using

00:03:29.840 | so transformers we're going to be using the auto tokenizer and the auto model

00:03:38.320 | and then we need to import torch as well.

00:03:43.200 | And then after we've imported these we need to first initialize our tokenizer model

00:03:52.560 | which we just do auto tokenizer and then for both these we're going to use from pre-trained

00:04:00.000 | and we're going to use the model name that we've already defined. So these are coming

00:04:06.080 | from face library obviously and we can see the model here so it's this one

00:04:14.560 | and then our model is auto model from pre-trained again

00:04:20.720 | from those and now what we want to do is tokenize all of our sentences. Now to do this we're going

00:04:32.080 | to use a tokens dictionary and in here we're going to have input IDs and this will contain a list

00:04:41.360 | and you'll see why in a moment and attention mask which will also contain a list.

00:04:48.320 | Now when we're going through each sentence we have to do this one by one for sentence in sentences

00:05:01.280 | we are going to be using the tokenizers encode plus method. So tokenizer encode plus

00:05:10.960 | and then in here we need to pass our sentence we need to pass the maximum length of our sequence

00:05:18.800 | so with BERT usually we would set this to 512 but because we're using this BERT based

00:05:24.480 | NLI mean tokens model this should actually be set to 128. So we set max length to 128

00:05:34.240 | and anything longer than this we want to truncate so we set truncation equal to true and anything

00:05:42.720 | shorter than this which they all will be in our case we set padding equal to the max length to

00:05:49.200 | pad it up to that max length and then here we want to say return tensors and we set this equal to pt

00:05:57.840 | because we're using PyTorch. Now this will return a dictionary containing input IDs and attention

00:06:06.560 | mask for a single sentence so we'll take the new tokens assign it to that variable

00:06:17.200 | and then what we're going to do is access our tokens dictionary which inputs IDs first and append

00:06:25.520 | the input IDs for the single sentence from the new tokens variable

00:06:31.760 | so input IDs and then we do the same for our attention mask

00:06:44.320 | okay so

00:06:46.400 | that gives us those there's another thing as well we these are wrapped as vectors so we also

00:06:57.040 | want to just extract the first element there because it's they're like almost like lists

00:07:04.320 | within a list but in tensor format and we want to extract the list. Now that's good but obviously

00:07:13.520 | we're using PyTorch here we want PyTorch tensors not lists so within these lists we do have

00:07:20.640 | PyTorch tensors so in fact let me just show you

00:07:24.400 | so if we have a look in here

00:07:29.760 | we'll see that we have our PyTorch tensors but they're contained within a normal Python list

00:07:40.800 | so we can even check that if we do type we see we get lists and inside there we have the torch

00:07:48.560 | tensor which is what we want for all of them so to convert this list of PyTorch tensors into a

00:07:56.000 | single PyTorch tensor what we do is we take this torch and we use the stack method

00:08:08.800 | and what the stack method does is takes a list and within that list we expect PyTorch

00:08:16.720 | tensors and it will stack all of those on top of each other essentially adding another dimension

00:08:21.440 | and stacking them all on top of each other hence the hence the name

00:08:26.240 | so take that and we want to do for both input ids and attention mask

00:08:35.440 | and then let's have a look what we have so let's go attention

00:08:38.560 | or input ids and now we just have a single tensor okay so you type

00:08:46.320 | and now we just have a tensor now that's great

00:09:00.560 | check its size so we have six sentences that have all been encoded into the 128 tokens

00:09:09.760 | ready to go into our model so to process these through our model we'll output the

00:09:18.640 | outputs to this outputs variable and we take our model and we pass our tokens as keyword arguments

00:09:29.360 | into the model input there so we process that and that will give us this output object and

00:09:42.720 | inside this ipod object we have the last hidden state tensor here

00:09:47.680 | and we can also see that if we print out keys you see that we have the last hidden state and

00:09:54.640 | we also have this pooler output now we want to take our last hidden state tensor

00:10:02.480 | and then perform the mean pooling operation to convert it into a sentence vector so to

00:10:13.440 | get that last insight we will assign it to this embeddings variable

00:10:21.120 | and we extract it using hidden or last hidden state like that and let's just check what we

00:10:33.200 | have here so we'll just have a look at shape you see now we have the six sentences we have the 128

00:10:40.880 | tokens and then we have the 768 dimension size which is just the hidden state dimensions within

00:10:50.640 | bert so what we have at the moment is this last hidden state tensor and what we're going to do

00:10:59.920 | is now convert it into this using a mean pooling operation so the the first thing we need to do is

00:11:11.520 | multiply every value within this last hidden state tensor by zero where we shouldn't have

00:11:21.520 | a real token so if we look up here we padded all of these and obviously there's more padding tokens

00:11:29.280 | in this sentence than there are in this sentence so we need to take each of those attention mass

00:11:37.120 | tensors that we took here which just contain ones and zeros ones where there's real tokens

00:11:42.160 | zeros where there are padding tokens and multiply that out to remove any activations where there

00:11:49.360 | should just be padding tokens eg zeros now the only problem is that if we have a look at our

00:11:57.520 | attention mask so tokens attention mass if we have a look at the size we get a six by 128

00:12:11.520 | so what we need to do is add this other dimension which is the 768 and then we can just multiply

00:12:19.680 | those two tensors together and this will remove the embedding values where there shouldn't be

00:12:25.840 | embedding values and to do that we'll we'll assign it to mass but we'll do it later actually

00:12:33.600 | so tension and what we want to do is use the unsqueeze method

00:12:39.600 | and if we look at the shape so we can see what is actually happening here

00:12:46.640 | see that we've added this other dimension and then what that allows us to do is expand that

00:12:53.600 | dimension out to 768 which will then match to the correct shape that we need to multiply

00:13:01.760 | those two together so we do expand and here what we want is we'll take embeddings

00:13:10.240 | and we want to expand it out to the embeddings shape that we have already

00:13:19.120 | used up here so that will compare these two and see that we need to expand this

00:13:26.720 | one dimension out to 768 and if we execute that we can see that it has worked so

00:13:35.840 | the final thing that we need to do there is convert that into a float tensor then we assign

00:13:44.640 | that to the mask here so this uh float at the end that's just converting it from integer to float

00:13:51.040 | so now what we can do is apply this mask to our embeddings so we'll call this one

00:13:59.120 | mask embeddings and it is very simple we just do embeddings multiplied by mask

00:14:09.760 | and now if we just compare embeddings have a look what we have here so it's quite a lot

00:14:16.080 | and now we have a look at mask embeddings

00:14:20.160 | and you see here that we have the same values here so looking at the top these are the same

00:14:31.600 | but then these values here have been mapped to zero because they are just padding tokens we

00:14:40.160 | don't want to pay attention to those so that's the point of the masking operation there

00:14:50.720 | so remove those and now what we want to do is take all of those embeddings because

00:15:01.120 | if we have a look at the shape that we have

00:15:02.960 | we still have this 128 tokens we want to convert this into one token

00:15:12.000 | and there's two operations that we need to do here so we're doing a mean pooling operation

00:15:19.280 | so we need to calculate the sum within each of these so if we summed all these up together

00:15:26.720 | that's what we are going to be doing and pushing them into a single value

00:15:31.040 | and then we also need to count all of those values but only where we were supposed to be

00:15:39.760 | paying attention so where we converted them into zeros we don't want to count those values

00:15:44.160 | and then we divide that sum by the count to get our mean so to get the summed we do torch dot sum

00:15:53.600 | and then it's just masked embeddings

00:15:55.280 | and this is in the dimension one which is this dimension here

00:16:10.000 | let's have a look at the shape that we have here okay so now we can see that we've removed

00:16:14.960 | this dimension and now what we want to do is create our counts and to do this we use a slightly

00:16:22.320 | different approach we just do torch clamp and then inside here we do mask dot sum

00:16:33.760 | again in the dimension one and then we also have we also add a min argument here which

00:16:44.080 | just stops us from creating any divide by zero error so we do one e and all this needs to be

00:16:55.600 | is a very small number i think by default it's one e to the minus eight but i usually just use

00:17:01.760 | one e to the minus nine although in reality it shouldn't really make a difference

00:17:07.200 | and sorry just put counts there okay so that's our sum and our counts and now we get the mean

00:17:21.280 | pooled so we do mean pooled equals summed divided by the counts

00:17:29.760 | and we'll just check the size of that again okay so that is our sentence vector

00:17:42.080 | so we have six of them here each one contains just 768 values and let's have a look at what

00:17:51.120 | they look like we just get these values here now what we can do is compare each of these

00:17:58.640 | and see which ones get the highest cosine similarity value now we're going to be using the

00:18:08.560 | sklearn implementation which is metrics dot pairwise

00:18:14.160 | we import cosine similarity

00:18:19.280 | and then this would expect numpy arrays obviously we have pytorch tensors so we

00:18:28.160 | are going to get an error i'm going to i'm going to show you so you at least

00:18:31.680 | see it you know how to fix it so we cosine similarity and in here we want to pass

00:18:42.800 | a single vector that we are going to be comparing so i'm going to compare the

00:18:48.480 | first text sentence so if we just take these

00:18:57.680 | put them down here

00:18:58.640 | so i'm going to take the very first one of those which is mean pooled

00:19:06.560 | zero and because we are extracting this out directly that means we get a

00:19:15.120 | it's like a list format we want it to be in a vector format so it's a list within the list

00:19:23.040 | and then we want to extract the remaining so five yeah five sentences so go one all the way to the

00:19:33.200 | end so that's those last five there now if we run this we're going to get this runtime error we go

00:19:39.600 | down and we see common quantum pi on tensor that requires grad so this is just with pytorch

00:19:49.360 | we this tensor is currently within our pytorch model and we need to detach it

00:19:56.480 | from pytorch in order to convert it into something that pytorch cannot read anymore

00:20:01.840 | and it actually tells us exactly what we need to do so use tensor detach numpy instead so we take

00:20:08.880 | detach and numpy and all we need to do is write mean pooled equals that rerun it

00:20:23.760 | and we get our similarity scores so straight away we got 0.33 17 4455 this one is the one

00:20:37.280 | the highest similarity 0.72 by a fair bit as well so that is comparing this sentence

00:20:46.640 | and sentence at index one of our last five which is this one so there we've calculated similarity

00:20:58.800 | and it is clearly working so that's it for this video i hope it's been useful i think this is

00:21:05.760 | really cool. And I'll see you in the next one.

Sentence Similarity With Transformers and PyTorch (Python)

Chapters