back to indexSentence Similarity With Transformers and PyTorch (Python)
Chapters
0:0 Intro
0:16 BERT Base Network
1:11 Sentence Vectors and Similarity
1:47 The Data and Model
3:1 Two Approaches
3:16 Tokenizing Sentences
9:11 Creating last_hidden_state Tensor
11:8 Creating Sentence Vectors
17:53 Cosine Similarity
00:00:00.000 |
Today we're going to have a look at how we can use transformers like BERT to create embeddings 00:00:06.640 |
for sentences and how we can then take those sentence vectors and use them to calculate the 00:00:13.600 |
semantic similarity between different sentences. So at a high level what you can see on the screen 00:00:21.040 |
right now is a BERT base model. Inside BERT base we have multiple encoders and at the bottom we 00:00:29.040 |
can see we have our tokenized text, we have 512 tokens here and they get passed into our first 00:00:37.680 |
encoder to create these hidden state vectors which are of the size 768 in BERT. Now these get 00:00:52.800 |
processed through multiple encoders and between every one of these encoders, there's 12 in total, 00:00:59.200 |
there are going to be a vector of size 768 for every single token that we have so 512 tokens in 00:01:09.840 |
this case. Now what we're going to do is take the final tensor out here so this last hidden state 00:01:16.800 |
tensor and we're going to use mean pooling to compress it into a 768 by 1 vector and that 00:01:30.800 |
is our sentence vector. Then once we've built our sentence vector we're going to use cosine similarity 00:01:39.680 |
to compare different sentences and see if we can get something that works. 00:01:47.120 |
So switching across to Python, these are the sentences we're going to be comparing and there's 00:01:55.120 |
two, so there's this one here which is three years later the coffin was still full of jello 00:02:00.480 |
and that has the same meaning as this here. I just rewrote it but with completely different words so 00:02:09.280 |
I don't think there's really any words here that match so instead of years we have dozens of months 00:02:15.680 |
jelly jello coffin person box right no normal human would even say that second well no normal 00:02:24.400 |
human would probably say either of those but we definitely wouldn't use person box for coffin 00:02:30.240 |
and many dozens of months for years. So it's reasonably complicated but we'll see that this 00:02:40.720 |
should work for similarity so we'll find that these two share the highest similarity score 00:02:45.840 |
after we've encoded them with BERT and calculate our cosine similarity. 00:02:50.640 |
And down here is the model we'll be using so we're going to be using sentence transformers 00:02:58.240 |
and then the BERT base NLI mean tokens model. Now there's two approaches that we can take here, 00:03:04.160 |
the easy approach using something called sentence transformers. I'm going to be covering that in 00:03:08.800 |
another video and this approach which is a little more involved where we're going to be using 00:03:14.320 |
transformers and PyTorch. So the first thing we need to do is actually create our last hidden 00:03:23.440 |
state tensor. So of course we need to import the libraries that we're going to be using 00:03:29.840 |
so transformers we're going to be using the auto tokenizer and the auto model 00:03:43.200 |
And then after we've imported these we need to first initialize our tokenizer model 00:03:52.560 |
which we just do auto tokenizer and then for both these we're going to use from pre-trained 00:04:00.000 |
and we're going to use the model name that we've already defined. So these are coming 00:04:06.080 |
from face library obviously and we can see the model here so it's this one 00:04:14.560 |
and then our model is auto model from pre-trained again 00:04:20.720 |
from those and now what we want to do is tokenize all of our sentences. Now to do this we're going 00:04:32.080 |
to use a tokens dictionary and in here we're going to have input IDs and this will contain a list 00:04:41.360 |
and you'll see why in a moment and attention mask which will also contain a list. 00:04:48.320 |
Now when we're going through each sentence we have to do this one by one for sentence in sentences 00:05:01.280 |
we are going to be using the tokenizers encode plus method. So tokenizer encode plus 00:05:10.960 |
and then in here we need to pass our sentence we need to pass the maximum length of our sequence 00:05:18.800 |
so with BERT usually we would set this to 512 but because we're using this BERT based 00:05:24.480 |
NLI mean tokens model this should actually be set to 128. So we set max length to 128 00:05:34.240 |
and anything longer than this we want to truncate so we set truncation equal to true and anything 00:05:42.720 |
shorter than this which they all will be in our case we set padding equal to the max length to 00:05:49.200 |
pad it up to that max length and then here we want to say return tensors and we set this equal to pt 00:05:57.840 |
because we're using PyTorch. Now this will return a dictionary containing input IDs and attention 00:06:06.560 |
mask for a single sentence so we'll take the new tokens assign it to that variable 00:06:17.200 |
and then what we're going to do is access our tokens dictionary which inputs IDs first and append 00:06:25.520 |
the input IDs for the single sentence from the new tokens variable 00:06:31.760 |
so input IDs and then we do the same for our attention mask 00:06:46.400 |
that gives us those there's another thing as well we these are wrapped as vectors so we also 00:06:57.040 |
want to just extract the first element there because it's they're like almost like lists 00:07:04.320 |
within a list but in tensor format and we want to extract the list. Now that's good but obviously 00:07:13.520 |
we're using PyTorch here we want PyTorch tensors not lists so within these lists we do have 00:07:20.640 |
PyTorch tensors so in fact let me just show you 00:07:29.760 |
we'll see that we have our PyTorch tensors but they're contained within a normal Python list 00:07:40.800 |
so we can even check that if we do type we see we get lists and inside there we have the torch 00:07:48.560 |
tensor which is what we want for all of them so to convert this list of PyTorch tensors into a 00:07:56.000 |
single PyTorch tensor what we do is we take this torch and we use the stack method 00:08:08.800 |
and what the stack method does is takes a list and within that list we expect PyTorch 00:08:16.720 |
tensors and it will stack all of those on top of each other essentially adding another dimension 00:08:21.440 |
and stacking them all on top of each other hence the hence the name 00:08:26.240 |
so take that and we want to do for both input ids and attention mask 00:08:35.440 |
and then let's have a look what we have so let's go attention 00:08:38.560 |
or input ids and now we just have a single tensor okay so you type 00:08:46.320 |
and now we just have a tensor now that's great 00:09:00.560 |
check its size so we have six sentences that have all been encoded into the 128 tokens 00:09:09.760 |
ready to go into our model so to process these through our model we'll output the 00:09:18.640 |
outputs to this outputs variable and we take our model and we pass our tokens as keyword arguments 00:09:29.360 |
into the model input there so we process that and that will give us this output object and 00:09:42.720 |
inside this ipod object we have the last hidden state tensor here 00:09:47.680 |
and we can also see that if we print out keys you see that we have the last hidden state and 00:09:54.640 |
we also have this pooler output now we want to take our last hidden state tensor 00:10:02.480 |
and then perform the mean pooling operation to convert it into a sentence vector so to 00:10:13.440 |
get that last insight we will assign it to this embeddings variable 00:10:21.120 |
and we extract it using hidden or last hidden state like that and let's just check what we 00:10:33.200 |
have here so we'll just have a look at shape you see now we have the six sentences we have the 128 00:10:40.880 |
tokens and then we have the 768 dimension size which is just the hidden state dimensions within 00:10:50.640 |
bert so what we have at the moment is this last hidden state tensor and what we're going to do 00:10:59.920 |
is now convert it into this using a mean pooling operation so the the first thing we need to do is 00:11:11.520 |
multiply every value within this last hidden state tensor by zero where we shouldn't have 00:11:21.520 |
a real token so if we look up here we padded all of these and obviously there's more padding tokens 00:11:29.280 |
in this sentence than there are in this sentence so we need to take each of those attention mass 00:11:37.120 |
tensors that we took here which just contain ones and zeros ones where there's real tokens 00:11:42.160 |
zeros where there are padding tokens and multiply that out to remove any activations where there 00:11:49.360 |
should just be padding tokens eg zeros now the only problem is that if we have a look at our 00:11:57.520 |
attention mask so tokens attention mass if we have a look at the size we get a six by 128 00:12:11.520 |
so what we need to do is add this other dimension which is the 768 and then we can just multiply 00:12:19.680 |
those two tensors together and this will remove the embedding values where there shouldn't be 00:12:25.840 |
embedding values and to do that we'll we'll assign it to mass but we'll do it later actually 00:12:33.600 |
so tension and what we want to do is use the unsqueeze method 00:12:39.600 |
and if we look at the shape so we can see what is actually happening here 00:12:46.640 |
see that we've added this other dimension and then what that allows us to do is expand that 00:12:53.600 |
dimension out to 768 which will then match to the correct shape that we need to multiply 00:13:01.760 |
those two together so we do expand and here what we want is we'll take embeddings 00:13:10.240 |
and we want to expand it out to the embeddings shape that we have already 00:13:19.120 |
used up here so that will compare these two and see that we need to expand this 00:13:26.720 |
one dimension out to 768 and if we execute that we can see that it has worked so 00:13:35.840 |
the final thing that we need to do there is convert that into a float tensor then we assign 00:13:44.640 |
that to the mask here so this uh float at the end that's just converting it from integer to float 00:13:51.040 |
so now what we can do is apply this mask to our embeddings so we'll call this one 00:13:59.120 |
mask embeddings and it is very simple we just do embeddings multiplied by mask 00:14:09.760 |
and now if we just compare embeddings have a look what we have here so it's quite a lot 00:14:20.160 |
and you see here that we have the same values here so looking at the top these are the same 00:14:31.600 |
but then these values here have been mapped to zero because they are just padding tokens we 00:14:40.160 |
don't want to pay attention to those so that's the point of the masking operation there 00:14:50.720 |
so remove those and now what we want to do is take all of those embeddings because 00:15:02.960 |
we still have this 128 tokens we want to convert this into one token 00:15:12.000 |
and there's two operations that we need to do here so we're doing a mean pooling operation 00:15:19.280 |
so we need to calculate the sum within each of these so if we summed all these up together 00:15:26.720 |
that's what we are going to be doing and pushing them into a single value 00:15:31.040 |
and then we also need to count all of those values but only where we were supposed to be 00:15:39.760 |
paying attention so where we converted them into zeros we don't want to count those values 00:15:44.160 |
and then we divide that sum by the count to get our mean so to get the summed we do torch dot sum 00:15:55.280 |
and this is in the dimension one which is this dimension here 00:16:10.000 |
let's have a look at the shape that we have here okay so now we can see that we've removed 00:16:14.960 |
this dimension and now what we want to do is create our counts and to do this we use a slightly 00:16:22.320 |
different approach we just do torch clamp and then inside here we do mask dot sum 00:16:33.760 |
again in the dimension one and then we also have we also add a min argument here which 00:16:44.080 |
just stops us from creating any divide by zero error so we do one e and all this needs to be 00:16:55.600 |
is a very small number i think by default it's one e to the minus eight but i usually just use 00:17:01.760 |
one e to the minus nine although in reality it shouldn't really make a difference 00:17:07.200 |
and sorry just put counts there okay so that's our sum and our counts and now we get the mean 00:17:21.280 |
pooled so we do mean pooled equals summed divided by the counts 00:17:29.760 |
and we'll just check the size of that again okay so that is our sentence vector 00:17:42.080 |
so we have six of them here each one contains just 768 values and let's have a look at what 00:17:51.120 |
they look like we just get these values here now what we can do is compare each of these 00:17:58.640 |
and see which ones get the highest cosine similarity value now we're going to be using the 00:18:08.560 |
sklearn implementation which is metrics dot pairwise 00:18:19.280 |
and then this would expect numpy arrays obviously we have pytorch tensors so we 00:18:28.160 |
are going to get an error i'm going to i'm going to show you so you at least 00:18:31.680 |
see it you know how to fix it so we cosine similarity and in here we want to pass 00:18:42.800 |
a single vector that we are going to be comparing so i'm going to compare the 00:18:58.640 |
so i'm going to take the very first one of those which is mean pooled 00:19:06.560 |
zero and because we are extracting this out directly that means we get a 00:19:15.120 |
it's like a list format we want it to be in a vector format so it's a list within the list 00:19:23.040 |
and then we want to extract the remaining so five yeah five sentences so go one all the way to the 00:19:33.200 |
end so that's those last five there now if we run this we're going to get this runtime error we go 00:19:39.600 |
down and we see common quantum pi on tensor that requires grad so this is just with pytorch 00:19:49.360 |
we this tensor is currently within our pytorch model and we need to detach it 00:19:56.480 |
from pytorch in order to convert it into something that pytorch cannot read anymore 00:20:01.840 |
and it actually tells us exactly what we need to do so use tensor detach numpy instead so we take 00:20:08.880 |
detach and numpy and all we need to do is write mean pooled equals that rerun it 00:20:23.760 |
and we get our similarity scores so straight away we got 0.33 17 4455 this one is the one 00:20:37.280 |
the highest similarity 0.72 by a fair bit as well so that is comparing this sentence 00:20:46.640 |
and sentence at index one of our last five which is this one so there we've calculated similarity 00:20:58.800 |
and it is clearly working so that's it for this video i hope it's been useful i think this is 00:21:05.760 |
really cool. And I'll see you in the next one.