back to index3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
Chapters
0:0 Intro
1:37 TF-IDF
11:44 BM25
20:30 SBERT
00:00:00.000 |
We're going to cover three different vector-based similarity search methods. 00:00:06.180 |
We're going to explain how they work and try and get an intuition for why they work. 00:00:12.820 |
And we're also going to actually go ahead and implement each of these in Python. 00:00:21.220 |
Now for TF-IDF and BM25, they're both pretty similar. 00:00:29.640 |
And BM25 is actually an improved version of TF-IDF. 00:00:35.180 |
And we can classify both of these as being sparse vector methods. 00:00:43.820 |
So these both use sparse vectors, which essentially means that vectors with a lot 00:00:49.660 |
of zeros in with the occasional value in there. 00:00:53.460 |
And then our bottom one down here, which is sentencebert, that's an example 00:01:01.280 |
Now dense vectors are quite interesting as they allow us to consider the semantics or 00:01:09.400 |
the meaning behind what we are saying, rather than just the syntax and the words 00:01:18.240 |
But all of these can work better than the others in different situations. 00:01:23.560 |
I've had dense representations of language in some cases work worse than sparse 00:01:30.740 |
representations, and of course, the other way around as well. 00:01:33.580 |
But we'll have a look and we'll cover all of those. 00:01:40.860 |
Now, TF-IDF consists of two different components, as you can see the name TF-IDF. 00:01:50.900 |
Now, the first of those components is the TF, which is the term frequency. 00:01:56.400 |
And term frequency does what it says on the tin. 00:02:01.760 |
It looks at a sentence or a paragraph and given a certain query, which is the Q here, 00:02:11.480 |
it will tell you, compared to the length of that document, how frequent your query is. 00:02:19.160 |
So what we have here is this D, which is our document. 00:02:22.980 |
So we're saying, okay, what is the frequency of the query Q in our document D? 00:02:31.260 |
Then on the bottom down here, we're saying, what is the frequency of all 00:02:42.380 |
So if we were to calculate this, we would get one, because we only have our query in 00:02:51.620 |
this case is bananas, divided by 18, which is the total number of terms in that 00:03:06.160 |
So that's our term frequency, which is half of TF-IDF. 00:03:16.800 |
So IDF is the inverse document frequency, and it's calculated for each document. 00:03:25.760 |
So when we say document here, we mean it can be a sentence or a paragraph. 00:03:31.360 |
Essentially what you see here, we have A, B, C, each one of those are documents. 00:03:35.320 |
In this case, each one of them is a sentence. 00:03:41.760 |
In this case, we have these sentences, and what we're saying here is the inverse 00:03:48.400 |
document frequency is the log of the number of documents, so in this case, we 00:03:58.080 |
would have three, over the number of documents that contain the query is, or 00:04:06.680 |
the word is, which in this case is all of them. 00:04:14.120 |
And so what we get from that is the value one. 00:04:19.600 |
Now, don't forget that this is a log here, so it's not actually one. 00:04:28.120 |
It's, if I left myself a little bit of space, it's log, and then one log 00:04:37.440 |
So is, is a very common word, and because it's so common, we 00:04:43.520 |
And this IDF is multiplied by our term frequency. 00:04:49.800 |
So if our term that we were searching for was is, then all of these sequences 00:04:56.640 |
would get a value of zero, which might seem counterintuitive, but if your, if 00:05:03.320 |
your query is something like, is, is the best city in the forest, right, you know, 00:05:11.040 |
from the, from the top sentence there, you wouldn't want to be prioritizing 00:05:16.600 |
the word is, or the, or in, because those words are everywhere. 00:05:23.920 |
We wouldn't care about the more unique words. 00:05:28.440 |
So we have a query for forest, the number of documents, exactly the same, still three. 00:05:34.720 |
And how many documents contain the word forest? 00:05:52.560 |
So then we times that by TF and we get a larger number than we would for is. 00:05:59.640 |
So that is where the IDF or inverse document frequency comes in. 00:06:05.440 |
It essentially gives us a multiplier for less common words, because chances 00:06:14.760 |
Now, what we can see here is just a work through of what we just looked at. 00:06:21.280 |
So we're calculating the TF for each document and the word is up here. 00:06:25.680 |
We calculate the IDF and then we multiply those two together. 00:06:31.000 |
And because is is everywhere, the IDF is equal to zero and therefore 00:06:35.600 |
the outcome is always zero because just not a relevant word that we care about. 00:06:39.880 |
Now, on the other hand, the word forest, when we calculate, when we 00:06:50.120 |
calculate the TF IDF for each of these, the TF for B and C is zero because the 00:06:59.360 |
term just doesn't appear in those documents and then the IDF is higher 00:07:08.240 |
Now the outcome of that is that only this value here. 00:07:13.720 |
So this is the TF IDF for the document A is not zero. 00:07:27.760 |
Let's have a quick look at how we'd write our code. 00:07:31.200 |
So we have, we have those three sentences again up here. 00:07:36.680 |
And then just here, we can see, okay, we import NumPy, we're using NumPy here. 00:07:45.520 |
Now, the reason we do this, so they merge into a list of lists is just so we can 00:07:52.320 |
more easily calculate the terms N and also the term NQ equals our query, which 00:08:07.360 |
Now, once we do get to TF IDF here, so we calculate the IDF, which 00:08:14.720 |
I just explained anyway, and then the, also the term frequency, which is just 00:08:19.520 |
the number of matches for our specific word in a given sentence. 00:08:33.400 |
Let's go over is like we did before in a, what do we get? 00:08:44.720 |
And here we get the 0.06, obviously it's more precise here, but before 00:08:55.560 |
Now that's TF IDF, reasonably straightforward. 00:08:59.720 |
However, I did say that these are vectors, right? 00:09:05.640 |
Obviously this doesn't look much like a vector. 00:09:17.480 |
So the, all of the words that we have in all of our documents, so A, B and C. 00:09:24.160 |
So if we, so we create our vocab and all we want to do is we take a set of 00:09:32.360 |
our three documents, A plus B plus C, and this is just going to create a set 00:09:40.840 |
containing all of the words across all of our documents. 00:09:43.240 |
So let's have a quick look at what we, what we get from that. 00:09:47.400 |
So our vocab, and this is just all, every single word that we have in there. 00:09:52.800 |
Now to create our vector, what we want to do is take that 00:10:01.880 |
So for every word here, we're going to calculate the TF IDF for each document 00:10:08.240 |
and store that in a list, which creates our TF IDF vector. 00:10:21.600 |
So vector A, and what we're going to do is say for word in vocab, vectorA.append. 00:10:34.200 |
And then here we're going to call TF IDF, and then we have our word. 00:10:42.760 |
In this case, we're doing it sentence A, so write sentence A like that. 00:11:13.160 |
So we have vector A, and we see that we get essentially a vector. 00:11:17.400 |
Now, we don't have that many documents or that many words here. 00:11:20.920 |
So we're just getting the same number here, which is meaning that it appears 00:11:29.880 |
Now, let's have a look at B, and we see we get a few different values here. 00:11:50.400 |
So one of the problems of TF IDF is as the frequency of queries found 00:11:58.520 |
in a document increase, the score increase linearly. 00:12:02.480 |
So imagine you have an article, and in that 1,000-word article, 00:12:14.360 |
Now, there's a good chance that that article is talking about dogs, right? 00:12:20.680 |
Now, if you double the number of words that are "dog" to 20, 00:12:26.240 |
the TF IDF score for that document will double as well. 00:12:31.760 |
And that, if you think about it logically, is a document or an article 00:12:39.960 |
that has the word "dog" 10 times, half as relevant as this same article 00:12:49.360 |
Maybe it is slightly less relevant, but not quite that much. 00:13:01.240 |
But let's explain this because it looks pretty horrific. 00:13:04.360 |
Now, here at the top, we have very similar to TF IDF. 00:13:13.040 |
We have our essentially term frequency here, and then we have our IDF here. 00:13:19.760 |
Now, in our term frequency, we can see that we have these two values here. 00:13:31.360 |
And this is the frequency of the query, frequency of terms. 00:13:39.960 |
But then we have a few new terms around here. 00:13:45.160 |
So we have K, which you can also see here as well. 00:13:48.560 |
Now, K and B here are adjustable special parameters. 00:13:55.760 |
K would typically be in the range of 1.25 by default, and B around 0.75. 00:14:07.760 |
And we can modify these based on our use case to optimize the algorithm. 00:14:15.760 |
Now, another new one that we have over here is this D average. 00:14:19.360 |
Now, this D here, that is the current document length. 00:14:26.960 |
And then over here, we have the average document length. 00:14:36.160 |
So it's all of the documents, their length divided by the number of documents. 00:14:50.760 |
Now, that is the TF component and how it's been modified. 00:14:56.960 |
And then let's have a look at the IDF component. 00:15:01.960 |
Now, the IDF component, we see these terms here and here, 00:15:07.760 |
which are exactly the same as the two terms that we have here. 00:15:11.360 |
So it's the number of documents versus the number of documents that contain that query. 00:15:21.160 |
And then over here, we just have a few constant values. 00:15:27.160 |
So although the BM25 algorithm looks pretty complex, 00:15:32.960 |
it's not that much different from the TF IDF algorithm. 00:15:38.960 |
So let's go ahead and implement that in Python. 00:15:45.160 |
Now, over in Python, we have these sentences or documents here. 00:15:52.560 |
We're going to add all these together in docs, just makes it a little bit easier for us. 00:16:00.960 |
Now, I want to look at this and compare it to what we would do for BM25. 00:16:08.360 |
So the first thing is, let me just make this a little more readable. 00:16:15.960 |
So sentence account here, we turn it into frequency. 00:16:26.960 |
Now, we do the same here. We still want our frequency. 00:16:30.160 |
And then this is our term frequency component of TF IDF and BM25. 00:16:39.560 |
So we have K and B. Those are our adjustable parameters. 00:16:46.560 |
The only sort of new thing we have here really is the average DL, 00:16:52.160 |
which is the average length of all documents. 00:16:56.960 |
So how do we calculate that? We just calculate it outside the function, 00:17:02.360 |
It doesn't matter which sentence or which word we're looking at. 00:17:07.360 |
So we just sum the length of each sentence for every sentence in our docs. 00:17:14.560 |
In fact, this here, we can change that to docs, like so. 00:17:27.960 |
Now, let's just run it quickly, see what we get. 00:17:31.560 |
So BM25, purple. And in sentence B, there is no purple. 00:17:39.760 |
So we would expect zero, which is what we get. 00:17:47.360 |
So let's do the same for sentence A, and we should get a score. 00:17:52.760 |
So 1.76, which is what we'd expect. So that looks good. 00:17:59.760 |
Now, the other bit that is slightly different, this is really quite simple, is this bit here. 00:18:07.560 |
So the IDF component. Now, IDF up here. What do we do? 00:18:12.560 |
We are just taking the length of the documents, dividing by the sum here. 00:18:20.960 |
So this here, we could rewrite if we want to match to the below as n_q, like so. 00:18:35.160 |
The other part is, so here we're using log to the base 10, and down here we're using the natural logarithm. 00:18:42.560 |
So there's a small, there's also that difference as well. 00:18:48.760 |
But otherwise, we're just adding the 0.5 and the 1s and rearranging the n and the n_q. 00:18:59.660 |
Now, that's BM25 and TF-IDF, but what does that actually look like? 00:19:09.260 |
So when we compare these two algorithms, well, TF-IDF, it's the score for TF-IDF increases linearly 00:19:20.660 |
with the frequency for number of words or number of matching terms. So it goes up like this. 00:19:29.660 |
OK, so that's TF-IDF. Now, BM25 is slightly different. 00:19:35.760 |
BM25, instead of going up like that, it does something which looks more like this. 00:19:41.760 |
So it increases a lot quickly, and then it kind of levels off. 00:19:46.260 |
So if you have four words for BM25 or you have eight, eight will become slightly more relevant. 00:19:57.960 |
The score will show that it's slightly more relevant, but not that much. 00:20:03.160 |
Now, for TF-IDF, the difference is the relevance is doubled. 00:20:08.560 |
And depending on your use case, maybe that is more along the lines of what you want. 00:20:15.660 |
But I do believe in most cases, BM25 is probably more realistic. 00:20:23.560 |
So plus those two, let's move on to our final one, dense vectors with Stanton-Spert. 00:20:38.160 |
So Spert uses something called dense representations of language. 00:20:47.560 |
And what that means is rather than having a vector like we saw with TF-IDF, 00:20:52.860 |
and which is the same for BM25, where it has a lot of zeros and then the odd number here and there, 00:21:03.660 |
which is called a sparse vector, we have something which has many more values in there. 00:21:10.160 |
So there's essentially more information crammed into a smaller space. 00:21:19.160 |
And the effect of this is that you can represent language in a more meaningful manner. 00:21:26.160 |
So the word, for example, 'hi' would be in a very similar space to the word 'hello'. 00:21:34.660 |
And so would words like 'days of the week', so 'Monday', 'Tuesday' and so on. 00:21:44.360 |
And the way SentenceBERT works is that we have a transform model. 00:21:49.060 |
We have a BERT, and our words or our query is processed, it comes in through here. 00:21:59.060 |
It's processed by many BERT encoding layers, and then we get this dense vector. 00:22:06.860 |
Now, as well as our query, so Q, so are our documents. 00:22:12.860 |
They are also processed through the same encoder network, and that produces dense vectors. 00:22:22.360 |
Now, once we have both of those dense vectors, so we have, this can be Q, and then we also have D. 00:22:35.560 |
We use cosine similarity between both of those to calculate how similar they are. 00:22:46.060 |
So how close to each other they are, the angle between them. 00:22:55.060 |
And that works pretty much like this. So we have over here, we have sort of a lone vector. 00:23:06.060 |
And then over here, we have two vectors which are much more similar. 00:23:16.060 |
And these two have a much smaller angle between them than either of those two do with this other vector, 00:23:27.060 |
And that's essentially how cosine similarity works. 00:23:29.060 |
It just finds the angle between two vectors, and where the angle is smaller, they are more similar. 00:23:38.060 |
Where the angle is greater, they are less similar. 00:23:46.060 |
Now, we're not going to implement that from scratch, because that would take a long time, and be pretty hard to do, to be honest. 00:23:58.560 |
So what we're going to do is use the Sentence Transformers library, 00:24:04.060 |
which is a very, very good library that uses Hugging Faces Transformers under the hood, 00:24:10.560 |
and has super easy implementations of Sentence Transformers, which is essentially what we just saw, 00:24:17.560 |
but I was using the example of this specific one called SentenceBERT, 00:24:24.560 |
which is also what we're going to be using here. 00:24:26.560 |
So we run this, we're using this BERTBaseNLIMeanTokens model. 00:24:37.560 |
So first thing we need to do is we initialize our model here, 00:24:42.560 |
and then after that we encode all of our sentences with model.encode. 00:24:52.560 |
So here we've encoded, so what this has produced is the sentence embedding. 00:24:58.560 |
So once the text has been processed by our BERT model, it outputs these sentence embeddings. 00:25:05.560 |
And they are vectors that represent the full sentence or the full document that we have input. 00:25:14.560 |
Now, once we get those out into our sentence embeddings variable there, 00:25:24.560 |
Now, here we're just going to use it, we're going to use scikit-learn's implementation. 00:25:29.560 |
And what we're going to do, because we have many sentence embeddings here, 00:25:39.560 |
So we have sentence embeddings, let's have a look at shape. 00:25:44.560 |
And you see that we have seven sets of embeddings here. 00:25:48.560 |
This here, this 768 is the size of a single sentence embedding vector. 00:25:54.560 |
Now, we have seven, why do we have seven? Because we have seven sentences up here. 00:26:03.560 |
So, what we're going to do here is loop through each one, 00:26:07.560 |
and calculate the cosine similarity between that sentence and all of the other sentences. 00:26:18.560 |
And we will get this array, which shows us all of our scores between all of our sentences, 00:26:30.560 |
Now, that's pretty hard to look at, so let's visualize it using matplotlib. 00:26:39.560 |
We can see here we have A on the left, and it has a very high similarity to the vector A. 00:26:48.560 |
So, I haven't filtered those out or anything, we are comparing the same vector to the same vector. 00:26:54.560 |
And, as you would expect, they get a very good score, because they are exactly the same. 00:27:02.560 |
So, there's no problem there, we just ignore those ones in the middle. 00:27:13.560 |
Okay, those two seem to have a very high similarity, 0.72, super high. 00:27:25.560 |
Now, these are the two that both talk about throwing bananas. 00:27:29.560 |
Actually, no, B talks about throwing bananas onto the street, C talks about finding bananas on the street. 00:27:38.560 |
But they have a lot of the same words, so fair enough. 00:27:42.560 |
TF-IDF and BM25 both identify those as being similar as well. 00:27:54.560 |
So, what I've done for G here is I've just taken the sentence B and swapped the words around, 00:28:04.560 |
so that we don't really have the matching words. 00:28:07.560 |
So, throwing bananas, instead of throwing, I'm using the word "bombard". 00:28:11.560 |
Instead of the word "street", I'm using "road". 00:28:14.560 |
Instead of "bananas", I'm using "yellow fruit", right? 00:28:17.560 |
So, it has a similar sort of meaning, but they're not using the same words. 00:28:23.560 |
So, TF-IDF and BM25 would really struggle here. 00:28:34.560 |
So, we have B here, G at the end, and yeah, they have a really good score. 00:28:44.560 |
So, that just goes to show that sentence BERT, or sentence transformers in general, don't require the same words to be used. 00:28:53.560 |
They rely more on the semantic meaning of those words, which is, I think, incredibly cool. 00:29:08.560 |
The sparse vectors, TF-IDF and BM25, and also dense vector sentence BERT, or sentence transformers in general. 00:29:20.560 |
Thank you for watching, and I'll see you in the next one.