back to indexSentence Similarity With Sentence-Transformers in Python
Chapters
0:0
2:0 initialize a sentence transformer model
2:40 create our sentence vectors or sentence embeddings
4:14 calculate our cosine similarity
00:00:00.000 |
I welcome to this video on using the sentence transformers library to compare similarity 00:00:07.600 |
between different sentences. So this is going to be a pretty short video, I'm not going to go 00:00:12.000 |
really into depth, I'm just going to show you how to actually use the library. Now if you do want to 00:00:18.400 |
go into a little more depth, I have another video that I'll be releasing just before this one and 00:00:26.160 |
that will go into what is actually happening here, how we are calculating similarity or pulling 00:00:33.680 |
how the BERT model that we'll be using is actually creating those embeddings and then how 00:00:41.360 |
we're actually calculating similarity there. So if you're interested in that, go check it out. 00:00:46.720 |
Otherwise, if you just want to get a quick similarity score between two sentences, 00:00:54.400 |
this is probably the way to go. So we have these six sentences up here and this one, 00:01:01.600 |
three years later the coffin was still full of jello and this one, the person box was packed 00:01:08.480 |
with jelly many dozens of months later. They're saying the same thing but the second one is 00:01:16.720 |
saying it in a way that most of us wouldn't normally say it. Instead of saying coffin, 00:01:21.440 |
we're saying person box. Instead of jello, we're saying jelly. I think that's kind of normal 00:01:26.400 |
actually and instead of years, we're saying dozens of months. So it's not really sharing the same 00:01:32.960 |
words but we're going to see that we can actually find that these two sentences are the most similar 00:01:40.400 |
out of all of these. So we're taking those and we're going to be importing the sentence 00:01:48.400 |
transformers library. And we want to import the sentence transformer. And then from that, 00:01:58.880 |
we want to initialize a sentence transformer model. So we write sentence transformer. 00:02:06.720 |
And then in here, we're going to be using this model that I've already defined a model name 00:02:13.360 |
for which is the BERT base MLI mean tokens model. So initialize that. I need to rerun that. 00:02:22.800 |
So we have our model and I'll just show you really quickly. This model is coming from the 00:02:29.600 |
HuggingFace transformers library behind sentence transformers. So this is the actual model we are 00:02:36.400 |
using. Now, first thing we do here is create our sentence vectors or sentence embeddings. 00:02:43.920 |
So we'll call a sentence vects equals model and code. And all we need to do here is pass 00:02:54.480 |
our sentences. So we can pass a single sentence or a list of sentences. It's completely fine. 00:03:03.440 |
And then let's just have a quick look at what we have here. So you see that we have this big array. 00:03:09.200 |
And if we look at the shape, we see that we have a six by 768 array. So the six refers to our six 00:03:21.200 |
sentences here. And the 768 refers to the hidden state size within the BERT model that we're using. 00:03:31.680 |
So each one of these sentences is now being represented by a dense vector containing 768 00:03:38.640 |
values. And that means that we are ready to take those and compare similarity between them. So 00:03:45.040 |
to do that, we're going to be using the sklearn implementation of cosine similarity, 00:03:52.240 |
which we can import like this. So sklearn pairwise or metrics 00:04:13.280 |
And to calculate our cosine similarity, all we do is take that function. And inside here, we pass 00:04:22.480 |
our first sentence. So this three years later, the coffin is still full of jello. I want to pass 00:04:28.880 |
that sentence vector, which is just an index zero of our sentence vects array. 00:04:43.040 |
And because we are extracting that single array value. So if we just have a look at this, 00:04:50.080 |
you see that we have a almost like a list of lists here. If we just extract this, we only get a list. 00:04:58.800 |
So what we want to do is actually keep that inside a list. Otherwise, we'll get dimension error. 00:05:04.960 |
And then we do sentence vects one onwards. So this will be the remaining 00:05:11.200 |
sentences. Okay, so let's take these, or let's just bring them down here. 00:05:19.360 |
Calculate this. And we can see that our highest similarity by quite a bit is this 0.72. Now, 00:05:30.880 |
that means that between this sentence, and this sentence, we have a similarity score of 0.72. 00:05:38.960 |
So clearly, it's working, it's scoring high similarity. And you can play around this 00:05:46.320 |
and, and test multiple different words and sentences and just see how it works. But that's 00:05:53.760 |
the easy way putting all this together. So I think it's really cool that we can do that so easily. 00:06:00.640 |
But I don't think there's really anything else to say about it. So 00:06:04.400 |
thank you for watching, and I'll see you in the next one.