back to index

Sentence Similarity With Sentence-Transformers in Python


Chapters

0:0
2:0 initialize a sentence transformer model
2:40 create our sentence vectors or sentence embeddings
4:14 calculate our cosine similarity

Whisper Transcript | Transcript Only Page

00:00:00.000 | I welcome to this video on using the sentence transformers library to compare similarity
00:00:07.600 | between different sentences. So this is going to be a pretty short video, I'm not going to go
00:00:12.000 | really into depth, I'm just going to show you how to actually use the library. Now if you do want to
00:00:18.400 | go into a little more depth, I have another video that I'll be releasing just before this one and
00:00:26.160 | that will go into what is actually happening here, how we are calculating similarity or pulling
00:00:33.680 | how the BERT model that we'll be using is actually creating those embeddings and then how
00:00:41.360 | we're actually calculating similarity there. So if you're interested in that, go check it out.
00:00:46.720 | Otherwise, if you just want to get a quick similarity score between two sentences,
00:00:54.400 | this is probably the way to go. So we have these six sentences up here and this one,
00:01:01.600 | three years later the coffin was still full of jello and this one, the person box was packed
00:01:08.480 | with jelly many dozens of months later. They're saying the same thing but the second one is
00:01:16.720 | saying it in a way that most of us wouldn't normally say it. Instead of saying coffin,
00:01:21.440 | we're saying person box. Instead of jello, we're saying jelly. I think that's kind of normal
00:01:26.400 | actually and instead of years, we're saying dozens of months. So it's not really sharing the same
00:01:32.960 | words but we're going to see that we can actually find that these two sentences are the most similar
00:01:40.400 | out of all of these. So we're taking those and we're going to be importing the sentence
00:01:48.400 | transformers library. And we want to import the sentence transformer. And then from that,
00:01:58.880 | we want to initialize a sentence transformer model. So we write sentence transformer.
00:02:06.720 | And then in here, we're going to be using this model that I've already defined a model name
00:02:13.360 | for which is the BERT base MLI mean tokens model. So initialize that. I need to rerun that.
00:02:22.800 | So we have our model and I'll just show you really quickly. This model is coming from the
00:02:29.600 | HuggingFace transformers library behind sentence transformers. So this is the actual model we are
00:02:36.400 | using. Now, first thing we do here is create our sentence vectors or sentence embeddings.
00:02:43.920 | So we'll call a sentence vects equals model and code. And all we need to do here is pass
00:02:54.480 | our sentences. So we can pass a single sentence or a list of sentences. It's completely fine.
00:03:03.440 | And then let's just have a quick look at what we have here. So you see that we have this big array.
00:03:09.200 | And if we look at the shape, we see that we have a six by 768 array. So the six refers to our six
00:03:21.200 | sentences here. And the 768 refers to the hidden state size within the BERT model that we're using.
00:03:31.680 | So each one of these sentences is now being represented by a dense vector containing 768
00:03:38.640 | values. And that means that we are ready to take those and compare similarity between them. So
00:03:45.040 | to do that, we're going to be using the sklearn implementation of cosine similarity,
00:03:52.240 | which we can import like this. So sklearn pairwise or metrics
00:04:00.480 | pairwise. And we import cosine similarity.
00:04:13.280 | And to calculate our cosine similarity, all we do is take that function. And inside here, we pass
00:04:22.480 | our first sentence. So this three years later, the coffin is still full of jello. I want to pass
00:04:28.880 | that sentence vector, which is just an index zero of our sentence vects array.
00:04:43.040 | And because we are extracting that single array value. So if we just have a look at this,
00:04:50.080 | you see that we have a almost like a list of lists here. If we just extract this, we only get a list.
00:04:58.800 | So what we want to do is actually keep that inside a list. Otherwise, we'll get dimension error.
00:05:04.960 | And then we do sentence vects one onwards. So this will be the remaining
00:05:11.200 | sentences. Okay, so let's take these, or let's just bring them down here.
00:05:19.360 | Calculate this. And we can see that our highest similarity by quite a bit is this 0.72. Now,
00:05:30.880 | that means that between this sentence, and this sentence, we have a similarity score of 0.72.
00:05:38.960 | So clearly, it's working, it's scoring high similarity. And you can play around this
00:05:46.320 | and, and test multiple different words and sentences and just see how it works. But that's
00:05:53.760 | the easy way putting all this together. So I think it's really cool that we can do that so easily.
00:06:00.640 | But I don't think there's really anything else to say about it. So
00:06:04.400 | thank you for watching, and I'll see you in the next one.