back to index

Sentence Similarity With Transformers and PyTorch (Python)


Chapters

0:0 Intro
0:16 BERT Base Network
1:11 Sentence Vectors and Similarity
1:47 The Data and Model
3:1 Two Approaches
3:16 Tokenizing Sentences
9:11 Creating last_hidden_state Tensor
11:8 Creating Sentence Vectors
17:53 Cosine Similarity

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we're going to have a look at how we can use transformers like BERT to create embeddings
00:00:06.640 | for sentences and how we can then take those sentence vectors and use them to calculate the
00:00:13.600 | semantic similarity between different sentences. So at a high level what you can see on the screen
00:00:21.040 | right now is a BERT base model. Inside BERT base we have multiple encoders and at the bottom we
00:00:29.040 | can see we have our tokenized text, we have 512 tokens here and they get passed into our first
00:00:37.680 | encoder to create these hidden state vectors which are of the size 768 in BERT. Now these get
00:00:52.800 | processed through multiple encoders and between every one of these encoders, there's 12 in total,
00:00:59.200 | there are going to be a vector of size 768 for every single token that we have so 512 tokens in
00:01:09.840 | this case. Now what we're going to do is take the final tensor out here so this last hidden state
00:01:16.800 | tensor and we're going to use mean pooling to compress it into a 768 by 1 vector and that
00:01:30.800 | is our sentence vector. Then once we've built our sentence vector we're going to use cosine similarity
00:01:39.680 | to compare different sentences and see if we can get something that works.
00:01:47.120 | So switching across to Python, these are the sentences we're going to be comparing and there's
00:01:55.120 | two, so there's this one here which is three years later the coffin was still full of jello
00:02:00.480 | and that has the same meaning as this here. I just rewrote it but with completely different words so
00:02:09.280 | I don't think there's really any words here that match so instead of years we have dozens of months
00:02:15.680 | jelly jello coffin person box right no normal human would even say that second well no normal
00:02:24.400 | human would probably say either of those but we definitely wouldn't use person box for coffin
00:02:30.240 | and many dozens of months for years. So it's reasonably complicated but we'll see that this
00:02:40.720 | should work for similarity so we'll find that these two share the highest similarity score
00:02:45.840 | after we've encoded them with BERT and calculate our cosine similarity.
00:02:50.640 | And down here is the model we'll be using so we're going to be using sentence transformers
00:02:58.240 | and then the BERT base NLI mean tokens model. Now there's two approaches that we can take here,
00:03:04.160 | the easy approach using something called sentence transformers. I'm going to be covering that in
00:03:08.800 | another video and this approach which is a little more involved where we're going to be using
00:03:14.320 | transformers and PyTorch. So the first thing we need to do is actually create our last hidden
00:03:23.440 | state tensor. So of course we need to import the libraries that we're going to be using
00:03:29.840 | so transformers we're going to be using the auto tokenizer and the auto model
00:03:38.320 | and then we need to import torch as well.
00:03:43.200 | And then after we've imported these we need to first initialize our tokenizer model
00:03:52.560 | which we just do auto tokenizer and then for both these we're going to use from pre-trained
00:04:00.000 | and we're going to use the model name that we've already defined. So these are coming
00:04:06.080 | from face library obviously and we can see the model here so it's this one
00:04:14.560 | and then our model is auto model from pre-trained again
00:04:20.720 | from those and now what we want to do is tokenize all of our sentences. Now to do this we're going
00:04:32.080 | to use a tokens dictionary and in here we're going to have input IDs and this will contain a list
00:04:41.360 | and you'll see why in a moment and attention mask which will also contain a list.
00:04:48.320 | Now when we're going through each sentence we have to do this one by one for sentence in sentences
00:05:01.280 | we are going to be using the tokenizers encode plus method. So tokenizer encode plus
00:05:10.960 | and then in here we need to pass our sentence we need to pass the maximum length of our sequence
00:05:18.800 | so with BERT usually we would set this to 512 but because we're using this BERT based
00:05:24.480 | NLI mean tokens model this should actually be set to 128. So we set max length to 128
00:05:34.240 | and anything longer than this we want to truncate so we set truncation equal to true and anything
00:05:42.720 | shorter than this which they all will be in our case we set padding equal to the max length to
00:05:49.200 | pad it up to that max length and then here we want to say return tensors and we set this equal to pt
00:05:57.840 | because we're using PyTorch. Now this will return a dictionary containing input IDs and attention
00:06:06.560 | mask for a single sentence so we'll take the new tokens assign it to that variable
00:06:17.200 | and then what we're going to do is access our tokens dictionary which inputs IDs first and append
00:06:25.520 | the input IDs for the single sentence from the new tokens variable
00:06:31.760 | so input IDs and then we do the same for our attention mask
00:06:44.320 | okay so
00:06:46.400 | that gives us those there's another thing as well we these are wrapped as vectors so we also
00:06:57.040 | want to just extract the first element there because it's they're like almost like lists
00:07:04.320 | within a list but in tensor format and we want to extract the list. Now that's good but obviously
00:07:13.520 | we're using PyTorch here we want PyTorch tensors not lists so within these lists we do have
00:07:20.640 | PyTorch tensors so in fact let me just show you
00:07:24.400 | so if we have a look in here
00:07:29.760 | we'll see that we have our PyTorch tensors but they're contained within a normal Python list
00:07:40.800 | so we can even check that if we do type we see we get lists and inside there we have the torch
00:07:48.560 | tensor which is what we want for all of them so to convert this list of PyTorch tensors into a
00:07:56.000 | single PyTorch tensor what we do is we take this torch and we use the stack method
00:08:08.800 | and what the stack method does is takes a list and within that list we expect PyTorch
00:08:16.720 | tensors and it will stack all of those on top of each other essentially adding another dimension
00:08:21.440 | and stacking them all on top of each other hence the hence the name
00:08:26.240 | so take that and we want to do for both input ids and attention mask
00:08:35.440 | and then let's have a look what we have so let's go attention
00:08:38.560 | or input ids and now we just have a single tensor okay so you type
00:08:46.320 | and now we just have a tensor now that's great
00:09:00.560 | check its size so we have six sentences that have all been encoded into the 128 tokens
00:09:09.760 | ready to go into our model so to process these through our model we'll output the
00:09:18.640 | outputs to this outputs variable and we take our model and we pass our tokens as keyword arguments
00:09:29.360 | into the model input there so we process that and that will give us this output object and
00:09:42.720 | inside this ipod object we have the last hidden state tensor here
00:09:47.680 | and we can also see that if we print out keys you see that we have the last hidden state and
00:09:54.640 | we also have this pooler output now we want to take our last hidden state tensor
00:10:02.480 | and then perform the mean pooling operation to convert it into a sentence vector so to
00:10:13.440 | get that last insight we will assign it to this embeddings variable
00:10:21.120 | and we extract it using hidden or last hidden state like that and let's just check what we
00:10:33.200 | have here so we'll just have a look at shape you see now we have the six sentences we have the 128
00:10:40.880 | tokens and then we have the 768 dimension size which is just the hidden state dimensions within
00:10:50.640 | bert so what we have at the moment is this last hidden state tensor and what we're going to do
00:10:59.920 | is now convert it into this using a mean pooling operation so the the first thing we need to do is
00:11:11.520 | multiply every value within this last hidden state tensor by zero where we shouldn't have
00:11:21.520 | a real token so if we look up here we padded all of these and obviously there's more padding tokens
00:11:29.280 | in this sentence than there are in this sentence so we need to take each of those attention mass
00:11:37.120 | tensors that we took here which just contain ones and zeros ones where there's real tokens
00:11:42.160 | zeros where there are padding tokens and multiply that out to remove any activations where there
00:11:49.360 | should just be padding tokens eg zeros now the only problem is that if we have a look at our
00:11:57.520 | attention mask so tokens attention mass if we have a look at the size we get a six by 128
00:12:11.520 | so what we need to do is add this other dimension which is the 768 and then we can just multiply
00:12:19.680 | those two tensors together and this will remove the embedding values where there shouldn't be
00:12:25.840 | embedding values and to do that we'll we'll assign it to mass but we'll do it later actually
00:12:33.600 | so tension and what we want to do is use the unsqueeze method
00:12:39.600 | and if we look at the shape so we can see what is actually happening here
00:12:46.640 | see that we've added this other dimension and then what that allows us to do is expand that
00:12:53.600 | dimension out to 768 which will then match to the correct shape that we need to multiply
00:13:01.760 | those two together so we do expand and here what we want is we'll take embeddings
00:13:10.240 | and we want to expand it out to the embeddings shape that we have already
00:13:19.120 | used up here so that will compare these two and see that we need to expand this
00:13:26.720 | one dimension out to 768 and if we execute that we can see that it has worked so
00:13:35.840 | the final thing that we need to do there is convert that into a float tensor then we assign
00:13:44.640 | that to the mask here so this uh float at the end that's just converting it from integer to float
00:13:51.040 | so now what we can do is apply this mask to our embeddings so we'll call this one
00:13:59.120 | mask embeddings and it is very simple we just do embeddings multiplied by mask
00:14:09.760 | and now if we just compare embeddings have a look what we have here so it's quite a lot
00:14:16.080 | and now we have a look at mask embeddings
00:14:20.160 | and you see here that we have the same values here so looking at the top these are the same
00:14:31.600 | but then these values here have been mapped to zero because they are just padding tokens we
00:14:40.160 | don't want to pay attention to those so that's the point of the masking operation there
00:14:50.720 | so remove those and now what we want to do is take all of those embeddings because
00:15:01.120 | if we have a look at the shape that we have
00:15:02.960 | we still have this 128 tokens we want to convert this into one token
00:15:12.000 | and there's two operations that we need to do here so we're doing a mean pooling operation
00:15:19.280 | so we need to calculate the sum within each of these so if we summed all these up together
00:15:26.720 | that's what we are going to be doing and pushing them into a single value
00:15:31.040 | and then we also need to count all of those values but only where we were supposed to be
00:15:39.760 | paying attention so where we converted them into zeros we don't want to count those values
00:15:44.160 | and then we divide that sum by the count to get our mean so to get the summed we do torch dot sum
00:15:53.600 | and then it's just masked embeddings
00:15:55.280 | and this is in the dimension one which is this dimension here
00:16:10.000 | let's have a look at the shape that we have here okay so now we can see that we've removed
00:16:14.960 | this dimension and now what we want to do is create our counts and to do this we use a slightly
00:16:22.320 | different approach we just do torch clamp and then inside here we do mask dot sum
00:16:33.760 | again in the dimension one and then we also have we also add a min argument here which
00:16:44.080 | just stops us from creating any divide by zero error so we do one e and all this needs to be
00:16:55.600 | is a very small number i think by default it's one e to the minus eight but i usually just use
00:17:01.760 | one e to the minus nine although in reality it shouldn't really make a difference
00:17:07.200 | and sorry just put counts there okay so that's our sum and our counts and now we get the mean
00:17:21.280 | pooled so we do mean pooled equals summed divided by the counts
00:17:29.760 | and we'll just check the size of that again okay so that is our sentence vector
00:17:42.080 | so we have six of them here each one contains just 768 values and let's have a look at what
00:17:51.120 | they look like we just get these values here now what we can do is compare each of these
00:17:58.640 | and see which ones get the highest cosine similarity value now we're going to be using the
00:18:08.560 | sklearn implementation which is metrics dot pairwise
00:18:14.160 | we import cosine similarity
00:18:19.280 | and then this would expect numpy arrays obviously we have pytorch tensors so we
00:18:28.160 | are going to get an error i'm going to i'm going to show you so you at least
00:18:31.680 | see it you know how to fix it so we cosine similarity and in here we want to pass
00:18:42.800 | a single vector that we are going to be comparing so i'm going to compare the
00:18:48.480 | first text sentence so if we just take these
00:18:57.680 | put them down here
00:18:58.640 | so i'm going to take the very first one of those which is mean pooled
00:19:06.560 | zero and because we are extracting this out directly that means we get a
00:19:15.120 | it's like a list format we want it to be in a vector format so it's a list within the list
00:19:23.040 | and then we want to extract the remaining so five yeah five sentences so go one all the way to the
00:19:33.200 | end so that's those last five there now if we run this we're going to get this runtime error we go
00:19:39.600 | down and we see common quantum pi on tensor that requires grad so this is just with pytorch
00:19:49.360 | we this tensor is currently within our pytorch model and we need to detach it
00:19:56.480 | from pytorch in order to convert it into something that pytorch cannot read anymore
00:20:01.840 | and it actually tells us exactly what we need to do so use tensor detach numpy instead so we take
00:20:08.880 | detach and numpy and all we need to do is write mean pooled equals that rerun it
00:20:23.760 | and we get our similarity scores so straight away we got 0.33 17 4455 this one is the one
00:20:37.280 | the highest similarity 0.72 by a fair bit as well so that is comparing this sentence
00:20:46.640 | and sentence at index one of our last five which is this one so there we've calculated similarity
00:20:58.800 | and it is clearly working so that's it for this video i hope it's been useful i think this is
00:21:05.760 | really cool. And I'll see you in the next one.