3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

We're going to cover three different vector-based similarity search methods. We're going to explain how they work and try and get an intuition for why they work. And we're also going to actually go ahead and implement each of these in Python. Now for TF-IDF and BM25, they're both pretty similar.

And BM25 is actually an improved version of TF-IDF. And we can classify both of these as being sparse vector methods. So these both use sparse vectors, which essentially means that vectors with a lot of zeros in with the occasional value in there. And then our bottom one down here, which is sentencebert, that's an example of a dense vector.

Now dense vectors are quite interesting as they allow us to consider the semantics or the meaning behind what we are saying, rather than just the syntax and the words that we're using. So that's pretty interesting. But all of these can work better than the others in different situations. I've had dense representations of language in some cases work worse than sparse representations, and of course, the other way around as well.

But we'll have a look and we'll cover all of those. So let's get started with TF-IDF. Now, TF-IDF consists of two different components, as you can see the name TF-IDF. Now, the first of those components is the TF, which is the term frequency. And term frequency does what it says on the tin.

It looks at a sentence or a paragraph and given a certain query, which is the Q here, it will tell you, compared to the length of that document, how frequent your query is. So what we have here is this D, which is our document. So we're saying, okay, what is the frequency of the query Q in our document D?

Then on the bottom down here, we're saying, what is the frequency of all terms in our document D? So if we were to calculate this, we would get one, because we only have our query in this case is bananas, divided by 18, which is the total number of terms in that document, and that will give us 0.056.

So that's our term frequency, which is half of TF-IDF. We still have IDF. So what is IDF? How do we calculate that? So IDF is the inverse document frequency, and it's calculated for each document. So when we say document here, we mean it can be a sentence or a paragraph.

Essentially what you see here, we have A, B, C, each one of those are documents. In this case, each one of them is a sentence. It could be a paragraph. It could be a mix of different things. In this case, we have these sentences, and what we're saying here is the inverse document frequency is the log of the number of documents, so in this case, we would have three, over the number of documents that contain the query is, or the word is, which in this case is all of them.

So we have all three documents there. And so what we get from that is the value one. Now, don't forget that this is a log here, so it's not actually one. It's, if I left myself a little bit of space, it's log, and then one log one, which is equal to zero.

So is, is a very common word, and because it's so common, we literally do not care about it. And this IDF is multiplied by our term frequency. So we'd have TF down here. So if our term that we were searching for was is, then all of these sequences would get a value of zero, which might seem counterintuitive, but if your, if your query is something like, is, is the best city in the forest, right, you know, from the, from the top sentence there, you wouldn't want to be prioritizing the word is, or the, or in, because those words are everywhere.

So we don't really care about those words. We wouldn't care about the more unique words. And that's what you'd get down here. So we have a query for forest, the number of documents, exactly the same, still three. And how many documents contain the word forest? So only one of them, we have A at the top.

So we would have log three. And what is log three? Log three is 0.48. So then we times that by TF and we get a larger number than we would for is. So that is where the IDF or inverse document frequency comes in. It essentially gives us a multiplier for less common words, because chances are, they're more relevant to our query.

Now, what we can see here is just a work through of what we just looked at. So we're calculating the TF for each document and the word is up here. We calculate the IDF and then we multiply those two together. And because is is everywhere, the IDF is equal to zero and therefore the outcome is always zero because just not a relevant word that we care about.

Now, on the other hand, the word forest, when we calculate, when we calculate the TF IDF for each of these, the TF for B and C is zero because the term just doesn't appear in those documents and then the IDF is higher because forest is a rare, rarer word.

So it's 0.48. Now the outcome of that is that only this value here. So this is the TF IDF for the document A is not zero. And that's essentially how TF IDF works. Now let's jump across to Python. Let's have a quick look at how we'd write our code.

So we have, we have those three sentences again up here. So let's run that. And then just here, we can see, okay, we import NumPy, we're using NumPy here. We're merging all the documents here. Now, the reason we do this, so they merge into a list of lists is just so we can more easily calculate the terms N and also the term NQ equals our query, which you can see just here in the IDF section.

Now, once we do get to TF IDF here, so we calculate the IDF, which I just explained anyway, and then the, also the term frequency, which is just the number of matches for our specific word in a given sentence. So let's run that and let's have a look. TF IDF for the word.

Let's go over is like we did before in a, what do we get? We get zero as we, as we calculated before. And how about forests? And here we get the 0.06, obviously it's more precise here, but before we got 0.06 in our other screen. Now that's TF IDF, reasonably straightforward.

However, I did say that these are vectors, right? These are sparse vectors. Obviously this doesn't look much like a vector. So how do we turn this into a vector? Well, what we do is we take our vocab. So the, all of the words that we have in all of our documents, so A, B and C.

So if we, so we create our vocab and all we want to do is we take a set of our three documents, A plus B plus C, and this is just going to create a set containing all of the words across all of our documents. So let's have a quick look at what we, what we get from that.

So our vocab, and this is just all, every single word that we have in there. Now to create our vector, what we want to do is take that vocab and mirror it into a vector. So for every word here, we're going to calculate the TF IDF for each document and store that in a list, which creates our TF IDF vector.

So let me show you how that works. So let's first initialize our vector. So vector A, and what we're going to do is say for word in vocab, vectorA.append. And then here we're going to call TF IDF, and then we have our word. So we define that as word and our sentence.

In this case, we're doing it sentence A, so write sentence A like that. And we can do this for each of our vectors. Let's just do A and B for now. So B here, and here, and here. Okay, and let's have a look at what we get. So we have vector A, and we see that we get essentially a vector.

Now, we don't have that many documents or that many words here. So we're just getting the same number here, which is meaning that it appears in one document, and that document is A. Now, let's have a look at B, and we see we get a few different values here.

And that's our TF IDF vector. That's how we build it. So that's TF IDF. Let's move on to BM 25. Now, BM 25 is a optimized version of TF IDF. So one of the problems of TF IDF is as the frequency of queries found in a document increase, the score increase linearly.

So imagine you have an article, and in that 1,000-word article, it has the word "dog" maybe 10 times. Now, there's a good chance that that article is talking about dogs, right? Now, if you double the number of words that are "dog" to 20, the TF IDF score for that document will double as well.

And that, if you think about it logically, is a document or an article that has the word "dog" 10 times, half as relevant as this same article that has the word "dog" 20 times. Probably not. Maybe it is slightly less relevant, but not quite that much. And that's why BM 25 comes in.

It almost normalizes that value. And we'll visualize that soon as well. But let's explain this because it looks pretty horrific. Now, here at the top, we have very similar to TF IDF. We have our essentially term frequency here, and then we have our IDF here. Now, in our term frequency, we can see that we have these two values here.

And this is the frequency of the query, frequency of terms. And this is pulled straight from TF IDF. But then we have a few new terms around here. So we have K, which you can also see here as well. Now, K and B here are adjustable special parameters. K would typically be in the range of 1.25 by default, and B around 0.75.

And we can modify these based on our use case to optimize the algorithm. Now, another new one that we have over here is this D average. Now, this D here, that is the current document length. So I'll call it D length. And then over here, we have the average document length.

So that's all of our documents. So it's all of the documents, their length divided by the number of documents. So that would be N. Now, that is the TF component and how it's been modified. And then let's have a look at the IDF component. Now, the IDF component, we see these terms here and here, which are exactly the same as the two terms that we have here.

So it's the number of documents versus the number of documents that contain that query. And then over here, we just have a few constant values. So although the BM25 algorithm looks pretty complex, it's not that much different from the TF IDF algorithm. So let's go ahead and implement that in Python.

Now, over in Python, we have these sentences or documents here. We're going to add all these together in docs, just makes it a little bit easier for us. And we still have this TF IDF function. Now, I want to look at this and compare it to what we would do for BM25.

So the first thing is, let me just make this a little more readable. So sentence account here, we turn it into frequency. Okay, so change that to frequency, like so. Now, we do the same here. We still want our frequency. And then this is our term frequency component of TF IDF and BM25.

Now, let's have a look at these. So we have K and B. Those are our adjustable parameters. The only sort of new thing we have here really is the average DL, which is the average length of all documents. So how do we calculate that? We just calculate it outside the function, because this is for all documents.

It doesn't matter which sentence or which word we're looking at. It's going to be the same for each one. So we just sum the length of each sentence for every sentence in our docs. In fact, this here, we can change that to docs, like so. So we change that.

Okay. Now, let's just run it quickly, see what we get. So BM25, purple. And in sentence B, there is no purple. So we would expect zero, which is what we get. So we know this is at least working, right? So let's do the same for sentence A, and we should get a score.

So 1.76, which is what we'd expect. So that looks good. Now, the other bit that is slightly different, this is really quite simple, is this bit here. So the IDF component. Now, IDF up here. What do we do? We are just taking the length of the documents, dividing by the sum here.

And that's exactly what we do here. So this here, we could rewrite if we want to match to the below as n_q, like so. And that's the only real difference. The other part is, so here we're using log to the base 10, and down here we're using the natural logarithm.

So there's a small, there's also that difference as well. But otherwise, we're just adding the 0.5 and the 1s and rearranging the n and the n_q. And of course, this here would be n. Now, that's BM25 and TF-IDF, but what does that actually look like? So when we compare these two algorithms, well, TF-IDF, it's the score for TF-IDF increases linearly with the frequency for number of words or number of matching terms.

So it goes up like this. OK, so that's TF-IDF. Now, BM25 is slightly different. BM25, instead of going up like that, it does something which looks more like this. So it increases a lot quickly, and then it kind of levels off. So if you have four words for BM25 or you have eight, eight will become slightly more relevant.

The score will show that it's slightly more relevant, but not that much. Now, for TF-IDF, the difference is the relevance is doubled. And depending on your use case, maybe that is more along the lines of what you want. But I do believe in most cases, BM25 is probably more realistic.

So plus those two, let's move on to our final one, dense vectors with Stanton-Spert. Now, our final algorithm is Spert. So Spert uses something called dense representations of language. And what that means is rather than having a vector like we saw with TF-IDF, and which is the same for BM25, where it has a lot of zeros and then the odd number here and there, which is called a sparse vector, we have something which has many more values in there.

So there's essentially more information crammed into a smaller space. And the effect of this is that you can represent language in a more meaningful manner. So the word, for example, 'hi' would be in a very similar space to the word 'hello'. And so would words like 'days of the week', so 'Monday', 'Tuesday' and so on.

And the way SentenceBERT works is that we have a transform model. We have a BERT, and our words or our query is processed, it comes in through here. It's processed by many BERT encoding layers, and then we get this dense vector. Now, as well as our query, so Q, so are our documents.

They are also processed through the same encoder network, and that produces dense vectors. Now, once we have both of those dense vectors, so we have, this can be Q, and then we also have D. We use cosine similarity between both of those to calculate how similar they are. So how close to each other they are, the angle between them.

And that works pretty much like this. So we have over here, we have sort of a lone vector. And then over here, we have two vectors which are much more similar. Or at least they share the same direction. And these two have a much smaller angle between them than either of those two do with this other vector, which is all the way over there.

And that's essentially how cosine similarity works. It just finds the angle between two vectors, and where the angle is smaller, they are more similar. Where the angle is greater, they are less similar. Now, we're not going to implement that from scratch, because that would take a long time, and be pretty hard to do, to be honest.

So what we're going to do is use the Sentence Transformers library, which is a very, very good library that uses Hugging Faces Transformers under the hood, and has super easy implementations of Sentence Transformers, which is essentially what we just saw, but I was using the example of this specific one called SentenceBERT, which is also what we're going to be using here.

So we run this, we're using this BERTBaseNLIMeanTokens model. And we have all of our sentences here. So first thing we need to do is we initialize our model here, and then after that we encode all of our sentences with model.encode. So we do that, and we come down here.

So here we've encoded, so what this has produced is the sentence embedding. So once the text has been processed by our BERT model, it outputs these sentence embeddings. And they are vectors that represent the full sentence or the full document that we have input. Now, once we get those out into our sentence embeddings variable there, we use the cosine similarity function.

Now, here we're just going to use it, we're going to use scikit-learn's implementation. And what we're going to do, because we have many sentence embeddings here, let's just have a look. So we have sentence embeddings, let's have a look at shape. And you see that we have seven sets of embeddings here.

This here, this 768 is the size of a single sentence embedding vector. Now, we have seven, why do we have seven? Because we have seven sentences up here. So, what we're going to do here is loop through each one, and calculate the cosine similarity between that sentence and all of the other sentences.

So, run that. And we will get this array, which shows us all of our scores between all of our sentences, all possible combinations. Now, that's pretty hard to look at, so let's visualize it using matplotlib. Okay, and we get this nice visual here. We can see here we have A on the left, and it has a very high similarity to the vector A.

So, I haven't filtered those out or anything, we are comparing the same vector to the same vector. And, as you would expect, they get a very good score, because they are exactly the same. They all score one. So, there's no problem there, we just ignore those ones in the middle.

Now, let's go on to the next one. So, we have these here, so C and B. Okay, those two seem to have a very high similarity, 0.72, super high. So, let's have a look. So, B and C. Now, these are the two that both talk about throwing bananas. Actually, no, B talks about throwing bananas onto the street, C talks about finding bananas on the street.

So, yeah, they're pretty similar, right? But they have a lot of the same words, so fair enough. TF-IDF and BM25 both identify those as being similar as well. So, it's good, but it doesn't blow you away. Now, how about B here and G? So, what I've done for G here is I've just taken the sentence B and swapped the words around, so that we don't really have the matching words.

So, throwing bananas, instead of throwing, I'm using the word "bombard". Instead of the word "street", I'm using "road". Instead of "bananas", I'm using "yellow fruit", right? So, it has a similar sort of meaning, but they're not using the same words. So, TF-IDF and BM25 would really struggle here. Now, let's go down and let's have a look.

So, B and G. So, we have B here, G at the end, and yeah, they have a really good score. 0.66 is the second highest behind B and C. So, that just goes to show that sentence BERT, or sentence transformers in general, don't require the same words to be used.

They rely more on the semantic meaning of those words, which is, I think, incredibly cool. So, that's it for this video. We've covered quite a lot. The sparse vectors, TF-IDF and BM25, and also dense vector sentence BERT, or sentence transformers in general. So, I hope you've enjoyed the video.

Thank you for watching, and I'll see you in the next one.

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

Chapters

Transcript