Back to Index

Sentence Similarity With Sentence-Transformers in Python


Chapters

0:0
2:0 initialize a sentence transformer model
2:40 create our sentence vectors or sentence embeddings
4:14 calculate our cosine similarity

Transcript

I welcome to this video on using the sentence transformers library to compare similarity between different sentences. So this is going to be a pretty short video, I'm not going to go really into depth, I'm just going to show you how to actually use the library. Now if you do want to go into a little more depth, I have another video that I'll be releasing just before this one and that will go into what is actually happening here, how we are calculating similarity or pulling how the BERT model that we'll be using is actually creating those embeddings and then how we're actually calculating similarity there.

So if you're interested in that, go check it out. Otherwise, if you just want to get a quick similarity score between two sentences, this is probably the way to go. So we have these six sentences up here and this one, three years later the coffin was still full of jello and this one, the person box was packed with jelly many dozens of months later.

They're saying the same thing but the second one is saying it in a way that most of us wouldn't normally say it. Instead of saying coffin, we're saying person box. Instead of jello, we're saying jelly. I think that's kind of normal actually and instead of years, we're saying dozens of months.

So it's not really sharing the same words but we're going to see that we can actually find that these two sentences are the most similar out of all of these. So we're taking those and we're going to be importing the sentence transformers library. And we want to import the sentence transformer.

And then from that, we want to initialize a sentence transformer model. So we write sentence transformer. And then in here, we're going to be using this model that I've already defined a model name for which is the BERT base MLI mean tokens model. So initialize that. I need to rerun that.

So we have our model and I'll just show you really quickly. This model is coming from the HuggingFace transformers library behind sentence transformers. So this is the actual model we are using. Now, first thing we do here is create our sentence vectors or sentence embeddings. So we'll call a sentence vects equals model and code.

And all we need to do here is pass our sentences. So we can pass a single sentence or a list of sentences. It's completely fine. And then let's just have a quick look at what we have here. So you see that we have this big array. And if we look at the shape, we see that we have a six by 768 array.

So the six refers to our six sentences here. And the 768 refers to the hidden state size within the BERT model that we're using. So each one of these sentences is now being represented by a dense vector containing 768 values. And that means that we are ready to take those and compare similarity between them.

So to do that, we're going to be using the sklearn implementation of cosine similarity, which we can import like this. So sklearn pairwise or metrics pairwise. And we import cosine similarity. And to calculate our cosine similarity, all we do is take that function. And inside here, we pass our first sentence.

So this three years later, the coffin is still full of jello. I want to pass that sentence vector, which is just an index zero of our sentence vects array. And because we are extracting that single array value. So if we just have a look at this, you see that we have a almost like a list of lists here.

If we just extract this, we only get a list. So what we want to do is actually keep that inside a list. Otherwise, we'll get dimension error. And then we do sentence vects one onwards. So this will be the remaining sentences. Okay, so let's take these, or let's just bring them down here.

Calculate this. And we can see that our highest similarity by quite a bit is this 0.72. Now, that means that between this sentence, and this sentence, we have a similarity score of 0.72. So clearly, it's working, it's scoring high similarity. And you can play around this and, and test multiple different words and sentences and just see how it works.

But that's the easy way putting all this together. So I think it's really cool that we can do that so easily. But I don't think there's really anything else to say about it. So thank you for watching, and I'll see you in the next one.