Back to Index

Sentence Similarity With Transformers and PyTorch (Python)


Chapters

0:0 Intro
0:16 BERT Base Network
1:11 Sentence Vectors and Similarity
1:47 The Data and Model
3:1 Two Approaches
3:16 Tokenizing Sentences
9:11 Creating last_hidden_state Tensor
11:8 Creating Sentence Vectors
17:53 Cosine Similarity

Transcript

Today we're going to have a look at how we can use transformers like BERT to create embeddings for sentences and how we can then take those sentence vectors and use them to calculate the semantic similarity between different sentences. So at a high level what you can see on the screen right now is a BERT base model.

Inside BERT base we have multiple encoders and at the bottom we can see we have our tokenized text, we have 512 tokens here and they get passed into our first encoder to create these hidden state vectors which are of the size 768 in BERT. Now these get processed through multiple encoders and between every one of these encoders, there's 12 in total, there are going to be a vector of size 768 for every single token that we have so 512 tokens in this case.

Now what we're going to do is take the final tensor out here so this last hidden state tensor and we're going to use mean pooling to compress it into a 768 by 1 vector and that is our sentence vector. Then once we've built our sentence vector we're going to use cosine similarity to compare different sentences and see if we can get something that works.

So switching across to Python, these are the sentences we're going to be comparing and there's two, so there's this one here which is three years later the coffin was still full of jello and that has the same meaning as this here. I just rewrote it but with completely different words so I don't think there's really any words here that match so instead of years we have dozens of months jelly jello coffin person box right no normal human would even say that second well no normal human would probably say either of those but we definitely wouldn't use person box for coffin and many dozens of months for years.

So it's reasonably complicated but we'll see that this should work for similarity so we'll find that these two share the highest similarity score after we've encoded them with BERT and calculate our cosine similarity. And down here is the model we'll be using so we're going to be using sentence transformers and then the BERT base NLI mean tokens model.

Now there's two approaches that we can take here, the easy approach using something called sentence transformers. I'm going to be covering that in another video and this approach which is a little more involved where we're going to be using transformers and PyTorch. So the first thing we need to do is actually create our last hidden state tensor.

So of course we need to import the libraries that we're going to be using so transformers we're going to be using the auto tokenizer and the auto model and then we need to import torch as well. And then after we've imported these we need to first initialize our tokenizer model which we just do auto tokenizer and then for both these we're going to use from pre-trained and we're going to use the model name that we've already defined.

So these are coming from face library obviously and we can see the model here so it's this one and then our model is auto model from pre-trained again from those and now what we want to do is tokenize all of our sentences. Now to do this we're going to use a tokens dictionary and in here we're going to have input IDs and this will contain a list and you'll see why in a moment and attention mask which will also contain a list.

Now when we're going through each sentence we have to do this one by one for sentence in sentences we are going to be using the tokenizers encode plus method. So tokenizer encode plus and then in here we need to pass our sentence we need to pass the maximum length of our sequence so with BERT usually we would set this to 512 but because we're using this BERT based NLI mean tokens model this should actually be set to 128.

So we set max length to 128 and anything longer than this we want to truncate so we set truncation equal to true and anything shorter than this which they all will be in our case we set padding equal to the max length to pad it up to that max length and then here we want to say return tensors and we set this equal to pt because we're using PyTorch.

Now this will return a dictionary containing input IDs and attention mask for a single sentence so we'll take the new tokens assign it to that variable and then what we're going to do is access our tokens dictionary which inputs IDs first and append the input IDs for the single sentence from the new tokens variable so input IDs and then we do the same for our attention mask okay so that gives us those there's another thing as well we these are wrapped as vectors so we also want to just extract the first element there because it's they're like almost like lists within a list but in tensor format and we want to extract the list.

Now that's good but obviously we're using PyTorch here we want PyTorch tensors not lists so within these lists we do have PyTorch tensors so in fact let me just show you so if we have a look in here we'll see that we have our PyTorch tensors but they're contained within a normal Python list so we can even check that if we do type we see we get lists and inside there we have the torch tensor which is what we want for all of them so to convert this list of PyTorch tensors into a single PyTorch tensor what we do is we take this torch and we use the stack method and what the stack method does is takes a list and within that list we expect PyTorch tensors and it will stack all of those on top of each other essentially adding another dimension and stacking them all on top of each other hence the hence the name so take that and we want to do for both input ids and attention mask and then let's have a look what we have so let's go attention or input ids and now we just have a single tensor okay so you type and now we just have a tensor now that's great check its size so we have six sentences that have all been encoded into the 128 tokens ready to go into our model so to process these through our model we'll output the outputs to this outputs variable and we take our model and we pass our tokens as keyword arguments into the model input there so we process that and that will give us this output object and inside this ipod object we have the last hidden state tensor here and we can also see that if we print out keys you see that we have the last hidden state and we also have this pooler output now we want to take our last hidden state tensor and then perform the mean pooling operation to convert it into a sentence vector so to get that last insight we will assign it to this embeddings variable and we extract it using hidden or last hidden state like that and let's just check what we have here so we'll just have a look at shape you see now we have the six sentences we have the 128 tokens and then we have the 768 dimension size which is just the hidden state dimensions within bert so what we have at the moment is this last hidden state tensor and what we're going to do is now convert it into this using a mean pooling operation so the the first thing we need to do is multiply every value within this last hidden state tensor by zero where we shouldn't have a real token so if we look up here we padded all of these and obviously there's more padding tokens in this sentence than there are in this sentence so we need to take each of those attention mass tensors that we took here which just contain ones and zeros ones where there's real tokens zeros where there are padding tokens and multiply that out to remove any activations where there should just be padding tokens eg zeros now the only problem is that if we have a look at our attention mask so tokens attention mass if we have a look at the size we get a six by 128 so what we need to do is add this other dimension which is the 768 and then we can just multiply those two tensors together and this will remove the embedding values where there shouldn't be embedding values and to do that we'll we'll assign it to mass but we'll do it later actually so tension and what we want to do is use the unsqueeze method and if we look at the shape so we can see what is actually happening here see that we've added this other dimension and then what that allows us to do is expand that dimension out to 768 which will then match to the correct shape that we need to multiply those two together so we do expand and here what we want is we'll take embeddings and we want to expand it out to the embeddings shape that we have already used up here so that will compare these two and see that we need to expand this one dimension out to 768 and if we execute that we can see that it has worked so the final thing that we need to do there is convert that into a float tensor then we assign that to the mask here so this uh float at the end that's just converting it from integer to float so now what we can do is apply this mask to our embeddings so we'll call this one mask embeddings and it is very simple we just do embeddings multiplied by mask and now if we just compare embeddings have a look what we have here so it's quite a lot and now we have a look at mask embeddings and you see here that we have the same values here so looking at the top these are the same but then these values here have been mapped to zero because they are just padding tokens we don't want to pay attention to those so that's the point of the masking operation there so remove those and now what we want to do is take all of those embeddings because if we have a look at the shape that we have we still have this 128 tokens we want to convert this into one token and there's two operations that we need to do here so we're doing a mean pooling operation so we need to calculate the sum within each of these so if we summed all these up together that's what we are going to be doing and pushing them into a single value and then we also need to count all of those values but only where we were supposed to be paying attention so where we converted them into zeros we don't want to count those values and then we divide that sum by the count to get our mean so to get the summed we do torch dot sum and then it's just masked embeddings and this is in the dimension one which is this dimension here let's have a look at the shape that we have here okay so now we can see that we've removed this dimension and now what we want to do is create our counts and to do this we use a slightly different approach we just do torch clamp and then inside here we do mask dot sum again in the dimension one and then we also have we also add a min argument here which just stops us from creating any divide by zero error so we do one e and all this needs to be is a very small number i think by default it's one e to the minus eight but i usually just use one e to the minus nine although in reality it shouldn't really make a difference and sorry just put counts there okay so that's our sum and our counts and now we get the mean pooled so we do mean pooled equals summed divided by the counts and we'll just check the size of that again okay so that is our sentence vector so we have six of them here each one contains just 768 values and let's have a look at what they look like we just get these values here now what we can do is compare each of these and see which ones get the highest cosine similarity value now we're going to be using the sklearn implementation which is metrics dot pairwise we import cosine similarity and then this would expect numpy arrays obviously we have pytorch tensors so we are going to get an error i'm going to i'm going to show you so you at least see it you know how to fix it so we cosine similarity and in here we want to pass a single vector that we are going to be comparing so i'm going to compare the first text sentence so if we just take these put them down here so i'm going to take the very first one of those which is mean pooled zero and because we are extracting this out directly that means we get a it's like a list format we want it to be in a vector format so it's a list within the list and then we want to extract the remaining so five yeah five sentences so go one all the way to the end so that's those last five there now if we run this we're going to get this runtime error we go down and we see common quantum pi on tensor that requires grad so this is just with pytorch we this tensor is currently within our pytorch model and we need to detach it from pytorch in order to convert it into something that pytorch cannot read anymore and it actually tells us exactly what we need to do so use tensor detach numpy instead so we take detach and numpy and all we need to do is write mean pooled equals that rerun it and we get our similarity scores so straight away we got 0.33 17 4455 this one is the one the highest similarity 0.72 by a fair bit as well so that is comparing this sentence and sentence at index one of our last five which is this one so there we've calculated similarity and it is clearly working so that's it for this video i hope it's been useful i think this is really cool.

And I'll see you in the next one.