back to index

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)


Chapters

0:0 Intro
1:37 TF-IDF
11:44 BM25
20:30 SBERT

Whisper Transcript | Transcript Only Page

00:00:00.000 | We're going to cover three different vector-based similarity search methods.
00:00:06.180 | We're going to explain how they work and try and get an intuition for why they work.
00:00:12.820 | And we're also going to actually go ahead and implement each of these in Python.
00:00:21.220 | Now for TF-IDF and BM25, they're both pretty similar.
00:00:29.640 | And BM25 is actually an improved version of TF-IDF.
00:00:35.180 | And we can classify both of these as being sparse vector methods.
00:00:43.820 | So these both use sparse vectors, which essentially means that vectors with a lot
00:00:49.660 | of zeros in with the occasional value in there.
00:00:53.460 | And then our bottom one down here, which is sentencebert, that's an example
00:00:59.260 | of a dense vector.
00:01:01.280 | Now dense vectors are quite interesting as they allow us to consider the semantics or
00:01:09.400 | the meaning behind what we are saying, rather than just the syntax and the words
00:01:15.920 | that we're using.
00:01:16.560 | So that's pretty interesting.
00:01:18.240 | But all of these can work better than the others in different situations.
00:01:23.560 | I've had dense representations of language in some cases work worse than sparse
00:01:30.740 | representations, and of course, the other way around as well.
00:01:33.580 | But we'll have a look and we'll cover all of those.
00:01:36.420 | So let's get started with TF-IDF.
00:01:40.860 | Now, TF-IDF consists of two different components, as you can see the name TF-IDF.
00:01:50.900 | Now, the first of those components is the TF, which is the term frequency.
00:01:56.400 | And term frequency does what it says on the tin.
00:02:01.760 | It looks at a sentence or a paragraph and given a certain query, which is the Q here,
00:02:11.480 | it will tell you, compared to the length of that document, how frequent your query is.
00:02:19.160 | So what we have here is this D, which is our document.
00:02:22.980 | So we're saying, okay, what is the frequency of the query Q in our document D?
00:02:31.260 | Then on the bottom down here, we're saying, what is the frequency of all
00:02:37.180 | terms in our document D?
00:02:42.380 | So if we were to calculate this, we would get one, because we only have our query in
00:02:51.620 | this case is bananas, divided by 18, which is the total number of terms in that
00:02:59.120 | document, and that will give us 0.056.
00:03:06.160 | So that's our term frequency, which is half of TF-IDF.
00:03:12.080 | We still have IDF.
00:03:13.000 | So what is IDF?
00:03:14.960 | How do we calculate that?
00:03:16.800 | So IDF is the inverse document frequency, and it's calculated for each document.
00:03:25.760 | So when we say document here, we mean it can be a sentence or a paragraph.
00:03:31.360 | Essentially what you see here, we have A, B, C, each one of those are documents.
00:03:35.320 | In this case, each one of them is a sentence.
00:03:39.080 | It could be a paragraph.
00:03:40.040 | It could be a mix of different things.
00:03:41.760 | In this case, we have these sentences, and what we're saying here is the inverse
00:03:48.400 | document frequency is the log of the number of documents, so in this case, we
00:03:58.080 | would have three, over the number of documents that contain the query is, or
00:04:06.680 | the word is, which in this case is all of them.
00:04:10.200 | So we have all three documents there.
00:04:14.120 | And so what we get from that is the value one.
00:04:19.600 | Now, don't forget that this is a log here, so it's not actually one.
00:04:28.120 | It's, if I left myself a little bit of space, it's log, and then one log
00:04:35.280 | one, which is equal to zero.
00:04:37.440 | So is, is a very common word, and because it's so common, we
00:04:41.800 | literally do not care about it.
00:04:43.520 | And this IDF is multiplied by our term frequency.
00:04:48.080 | So we'd have TF down here.
00:04:49.800 | So if our term that we were searching for was is, then all of these sequences
00:04:56.640 | would get a value of zero, which might seem counterintuitive, but if your, if
00:05:03.320 | your query is something like, is, is the best city in the forest, right, you know,
00:05:11.040 | from the, from the top sentence there, you wouldn't want to be prioritizing
00:05:16.600 | the word is, or the, or in, because those words are everywhere.
00:05:21.680 | So we don't really care about those words.
00:05:23.920 | We wouldn't care about the more unique words.
00:05:26.000 | And that's what you'd get down here.
00:05:28.440 | So we have a query for forest, the number of documents, exactly the same, still three.
00:05:34.720 | And how many documents contain the word forest?
00:05:38.720 | So only one of them, we have A at the top.
00:05:41.160 | So we would have log three.
00:05:46.560 | And what is log three?
00:05:48.360 | Log three is 0.48.
00:05:52.560 | So then we times that by TF and we get a larger number than we would for is.
00:05:59.640 | So that is where the IDF or inverse document frequency comes in.
00:06:05.440 | It essentially gives us a multiplier for less common words, because chances
00:06:13.080 | are, they're more relevant to our query.
00:06:14.760 | Now, what we can see here is just a work through of what we just looked at.
00:06:21.280 | So we're calculating the TF for each document and the word is up here.
00:06:25.680 | We calculate the IDF and then we multiply those two together.
00:06:31.000 | And because is is everywhere, the IDF is equal to zero and therefore
00:06:35.600 | the outcome is always zero because just not a relevant word that we care about.
00:06:39.880 | Now, on the other hand, the word forest, when we calculate, when we
00:06:50.120 | calculate the TF IDF for each of these, the TF for B and C is zero because the
00:06:59.360 | term just doesn't appear in those documents and then the IDF is higher
00:07:03.440 | because forest is a rare, rarer word.
00:07:06.280 | So it's 0.48.
00:07:08.240 | Now the outcome of that is that only this value here.
00:07:13.720 | So this is the TF IDF for the document A is not zero.
00:07:20.600 | And that's essentially how TF IDF works.
00:07:24.040 | Now let's jump across to Python.
00:07:27.760 | Let's have a quick look at how we'd write our code.
00:07:31.200 | So we have, we have those three sentences again up here.
00:07:35.080 | So let's run that.
00:07:36.680 | And then just here, we can see, okay, we import NumPy, we're using NumPy here.
00:07:43.200 | We're merging all the documents here.
00:07:45.520 | Now, the reason we do this, so they merge into a list of lists is just so we can
00:07:52.320 | more easily calculate the terms N and also the term NQ equals our query, which
00:08:02.680 | you can see just here in the IDF section.
00:08:07.360 | Now, once we do get to TF IDF here, so we calculate the IDF, which
00:08:14.720 | I just explained anyway, and then the, also the term frequency, which is just
00:08:19.520 | the number of matches for our specific word in a given sentence.
00:08:25.640 | So let's run that and let's have a look.
00:08:29.960 | TF IDF for the word.
00:08:33.400 | Let's go over is like we did before in a, what do we get?
00:08:39.240 | We get zero as we, as we calculated before.
00:08:42.280 | And how about forests?
00:08:44.720 | And here we get the 0.06, obviously it's more precise here, but before
00:08:51.400 | we got 0.06 in our other screen.
00:08:55.560 | Now that's TF IDF, reasonably straightforward.
00:08:59.720 | However, I did say that these are vectors, right?
00:09:04.400 | These are sparse vectors.
00:09:05.640 | Obviously this doesn't look much like a vector.
00:09:08.160 | So how do we turn this into a vector?
00:09:12.040 | Well, what we do is we take our vocab.
00:09:17.480 | So the, all of the words that we have in all of our documents, so A, B and C.
00:09:24.160 | So if we, so we create our vocab and all we want to do is we take a set of
00:09:32.360 | our three documents, A plus B plus C, and this is just going to create a set
00:09:40.840 | containing all of the words across all of our documents.
00:09:43.240 | So let's have a quick look at what we, what we get from that.
00:09:47.400 | So our vocab, and this is just all, every single word that we have in there.
00:09:52.800 | Now to create our vector, what we want to do is take that
00:09:59.080 | vocab and mirror it into a vector.
00:10:01.880 | So for every word here, we're going to calculate the TF IDF for each document
00:10:08.240 | and store that in a list, which creates our TF IDF vector.
00:10:13.160 | So let me show you how that works.
00:10:15.400 | So let's first initialize our vector.
00:10:21.600 | So vector A, and what we're going to do is say for word in vocab, vectorA.append.
00:10:34.200 | And then here we're going to call TF IDF, and then we have our word.
00:10:39.240 | So we define that as word and our sentence.
00:10:42.760 | In this case, we're doing it sentence A, so write sentence A like that.
00:10:47.160 | And we can do this for each of our vectors.
00:10:50.880 | Let's just do A and B for now.
00:11:01.800 | So B here, and here, and here.
00:11:06.360 | Okay, and let's have a look at what we get.
00:11:13.160 | So we have vector A, and we see that we get essentially a vector.
00:11:17.400 | Now, we don't have that many documents or that many words here.
00:11:20.920 | So we're just getting the same number here, which is meaning that it appears
00:11:26.160 | in one document, and that document is A.
00:11:29.880 | Now, let's have a look at B, and we see we get a few different values here.
00:11:34.560 | And that's our TF IDF vector.
00:11:37.360 | That's how we build it.
00:11:39.200 | So that's TF IDF.
00:11:41.240 | Let's move on to BM 25.
00:11:45.080 | Now, BM 25 is a optimized version of TF IDF.
00:11:50.400 | So one of the problems of TF IDF is as the frequency of queries found
00:11:58.520 | in a document increase, the score increase linearly.
00:12:02.480 | So imagine you have an article, and in that 1,000-word article,
00:12:08.280 | it has the word "dog" maybe 10 times.
00:12:14.360 | Now, there's a good chance that that article is talking about dogs, right?
00:12:20.680 | Now, if you double the number of words that are "dog" to 20,
00:12:26.240 | the TF IDF score for that document will double as well.
00:12:31.760 | And that, if you think about it logically, is a document or an article
00:12:39.960 | that has the word "dog" 10 times, half as relevant as this same article
00:12:45.760 | that has the word "dog" 20 times.
00:12:48.800 | Probably not.
00:12:49.360 | Maybe it is slightly less relevant, but not quite that much.
00:12:53.560 | And that's why BM 25 comes in.
00:12:55.600 | It almost normalizes that value.
00:12:58.840 | And we'll visualize that soon as well.
00:13:01.240 | But let's explain this because it looks pretty horrific.
00:13:04.360 | Now, here at the top, we have very similar to TF IDF.
00:13:13.040 | We have our essentially term frequency here, and then we have our IDF here.
00:13:19.760 | Now, in our term frequency, we can see that we have these two values here.
00:13:31.360 | And this is the frequency of the query, frequency of terms.
00:13:35.560 | And this is pulled straight from TF IDF.
00:13:39.960 | But then we have a few new terms around here.
00:13:45.160 | So we have K, which you can also see here as well.
00:13:48.560 | Now, K and B here are adjustable special parameters.
00:13:55.760 | K would typically be in the range of 1.25 by default, and B around 0.75.
00:14:07.760 | And we can modify these based on our use case to optimize the algorithm.
00:14:15.760 | Now, another new one that we have over here is this D average.
00:14:19.360 | Now, this D here, that is the current document length.
00:14:23.960 | So I'll call it D length.
00:14:26.960 | And then over here, we have the average document length.
00:14:33.360 | So that's all of our documents.
00:14:36.160 | So it's all of the documents, their length divided by the number of documents.
00:14:46.960 | So that would be N.
00:14:50.760 | Now, that is the TF component and how it's been modified.
00:14:56.960 | And then let's have a look at the IDF component.
00:15:01.960 | Now, the IDF component, we see these terms here and here,
00:15:07.760 | which are exactly the same as the two terms that we have here.
00:15:11.360 | So it's the number of documents versus the number of documents that contain that query.
00:15:21.160 | And then over here, we just have a few constant values.
00:15:27.160 | So although the BM25 algorithm looks pretty complex,
00:15:32.960 | it's not that much different from the TF IDF algorithm.
00:15:38.960 | So let's go ahead and implement that in Python.
00:15:45.160 | Now, over in Python, we have these sentences or documents here.
00:15:52.560 | We're going to add all these together in docs, just makes it a little bit easier for us.
00:15:57.760 | And we still have this TF IDF function.
00:16:00.960 | Now, I want to look at this and compare it to what we would do for BM25.
00:16:08.360 | So the first thing is, let me just make this a little more readable.
00:16:15.960 | So sentence account here, we turn it into frequency.
00:16:21.960 | Okay, so change that to frequency, like so.
00:16:26.960 | Now, we do the same here. We still want our frequency.
00:16:30.160 | And then this is our term frequency component of TF IDF and BM25.
00:16:37.560 | Now, let's have a look at these.
00:16:39.560 | So we have K and B. Those are our adjustable parameters.
00:16:46.560 | The only sort of new thing we have here really is the average DL,
00:16:52.160 | which is the average length of all documents.
00:16:56.960 | So how do we calculate that? We just calculate it outside the function,
00:17:00.960 | because this is for all documents.
00:17:02.360 | It doesn't matter which sentence or which word we're looking at.
00:17:04.960 | It's going to be the same for each one.
00:17:07.360 | So we just sum the length of each sentence for every sentence in our docs.
00:17:14.560 | In fact, this here, we can change that to docs, like so.
00:17:24.360 | So we change that. Okay.
00:17:27.960 | Now, let's just run it quickly, see what we get.
00:17:31.560 | So BM25, purple. And in sentence B, there is no purple.
00:17:39.760 | So we would expect zero, which is what we get.
00:17:42.360 | So we know this is at least working, right?
00:17:47.360 | So let's do the same for sentence A, and we should get a score.
00:17:52.760 | So 1.76, which is what we'd expect. So that looks good.
00:17:59.760 | Now, the other bit that is slightly different, this is really quite simple, is this bit here.
00:18:07.560 | So the IDF component. Now, IDF up here. What do we do?
00:18:12.560 | We are just taking the length of the documents, dividing by the sum here.
00:18:19.560 | And that's exactly what we do here.
00:18:20.960 | So this here, we could rewrite if we want to match to the below as n_q, like so.
00:18:31.260 | And that's the only real difference.
00:18:35.160 | The other part is, so here we're using log to the base 10, and down here we're using the natural logarithm.
00:18:42.560 | So there's a small, there's also that difference as well.
00:18:48.760 | But otherwise, we're just adding the 0.5 and the 1s and rearranging the n and the n_q.
00:18:55.460 | And of course, this here would be n.
00:18:59.660 | Now, that's BM25 and TF-IDF, but what does that actually look like?
00:19:09.260 | So when we compare these two algorithms, well, TF-IDF, it's the score for TF-IDF increases linearly
00:19:20.660 | with the frequency for number of words or number of matching terms. So it goes up like this.
00:19:29.660 | OK, so that's TF-IDF. Now, BM25 is slightly different.
00:19:35.760 | BM25, instead of going up like that, it does something which looks more like this.
00:19:41.760 | So it increases a lot quickly, and then it kind of levels off.
00:19:46.260 | So if you have four words for BM25 or you have eight, eight will become slightly more relevant.
00:19:57.960 | The score will show that it's slightly more relevant, but not that much.
00:20:03.160 | Now, for TF-IDF, the difference is the relevance is doubled.
00:20:08.560 | And depending on your use case, maybe that is more along the lines of what you want.
00:20:15.660 | But I do believe in most cases, BM25 is probably more realistic.
00:20:23.560 | So plus those two, let's move on to our final one, dense vectors with Stanton-Spert.
00:20:31.960 | Now, our final algorithm is Spert.
00:20:38.160 | So Spert uses something called dense representations of language.
00:20:47.560 | And what that means is rather than having a vector like we saw with TF-IDF,
00:20:52.860 | and which is the same for BM25, where it has a lot of zeros and then the odd number here and there,
00:21:03.660 | which is called a sparse vector, we have something which has many more values in there.
00:21:10.160 | So there's essentially more information crammed into a smaller space.
00:21:19.160 | And the effect of this is that you can represent language in a more meaningful manner.
00:21:26.160 | So the word, for example, 'hi' would be in a very similar space to the word 'hello'.
00:21:34.660 | And so would words like 'days of the week', so 'Monday', 'Tuesday' and so on.
00:21:44.360 | And the way SentenceBERT works is that we have a transform model.
00:21:49.060 | We have a BERT, and our words or our query is processed, it comes in through here.
00:21:59.060 | It's processed by many BERT encoding layers, and then we get this dense vector.
00:22:06.860 | Now, as well as our query, so Q, so are our documents.
00:22:12.860 | They are also processed through the same encoder network, and that produces dense vectors.
00:22:22.360 | Now, once we have both of those dense vectors, so we have, this can be Q, and then we also have D.
00:22:35.560 | We use cosine similarity between both of those to calculate how similar they are.
00:22:46.060 | So how close to each other they are, the angle between them.
00:22:55.060 | And that works pretty much like this. So we have over here, we have sort of a lone vector.
00:23:06.060 | And then over here, we have two vectors which are much more similar.
00:23:13.060 | Or at least they share the same direction.
00:23:16.060 | And these two have a much smaller angle between them than either of those two do with this other vector,
00:23:24.560 | which is all the way over there.
00:23:27.060 | And that's essentially how cosine similarity works.
00:23:29.060 | It just finds the angle between two vectors, and where the angle is smaller, they are more similar.
00:23:38.060 | Where the angle is greater, they are less similar.
00:23:46.060 | Now, we're not going to implement that from scratch, because that would take a long time, and be pretty hard to do, to be honest.
00:23:58.560 | So what we're going to do is use the Sentence Transformers library,
00:24:04.060 | which is a very, very good library that uses Hugging Faces Transformers under the hood,
00:24:10.560 | and has super easy implementations of Sentence Transformers, which is essentially what we just saw,
00:24:17.560 | but I was using the example of this specific one called SentenceBERT,
00:24:24.560 | which is also what we're going to be using here.
00:24:26.560 | So we run this, we're using this BERTBaseNLIMeanTokens model.
00:24:35.560 | And we have all of our sentences here.
00:24:37.560 | So first thing we need to do is we initialize our model here,
00:24:42.560 | and then after that we encode all of our sentences with model.encode.
00:24:47.560 | So we do that, and we come down here.
00:24:52.560 | So here we've encoded, so what this has produced is the sentence embedding.
00:24:58.560 | So once the text has been processed by our BERT model, it outputs these sentence embeddings.
00:25:05.560 | And they are vectors that represent the full sentence or the full document that we have input.
00:25:14.560 | Now, once we get those out into our sentence embeddings variable there,
00:25:20.560 | we use the cosine similarity function.
00:25:24.560 | Now, here we're just going to use it, we're going to use scikit-learn's implementation.
00:25:29.560 | And what we're going to do, because we have many sentence embeddings here,
00:25:37.560 | let's just have a look.
00:25:39.560 | So we have sentence embeddings, let's have a look at shape.
00:25:44.560 | And you see that we have seven sets of embeddings here.
00:25:48.560 | This here, this 768 is the size of a single sentence embedding vector.
00:25:54.560 | Now, we have seven, why do we have seven? Because we have seven sentences up here.
00:26:03.560 | So, what we're going to do here is loop through each one,
00:26:07.560 | and calculate the cosine similarity between that sentence and all of the other sentences.
00:26:13.560 | So, run that.
00:26:18.560 | And we will get this array, which shows us all of our scores between all of our sentences,
00:26:25.560 | all possible combinations.
00:26:30.560 | Now, that's pretty hard to look at, so let's visualize it using matplotlib.
00:26:36.560 | Okay, and we get this nice visual here.
00:26:39.560 | We can see here we have A on the left, and it has a very high similarity to the vector A.
00:26:48.560 | So, I haven't filtered those out or anything, we are comparing the same vector to the same vector.
00:26:54.560 | And, as you would expect, they get a very good score, because they are exactly the same.
00:26:59.560 | They all score one.
00:27:02.560 | So, there's no problem there, we just ignore those ones in the middle.
00:27:06.560 | Now, let's go on to the next one.
00:27:09.560 | So, we have these here, so C and B.
00:27:13.560 | Okay, those two seem to have a very high similarity, 0.72, super high.
00:27:19.560 | So, let's have a look.
00:27:23.560 | So, B and C.
00:27:25.560 | Now, these are the two that both talk about throwing bananas.
00:27:29.560 | Actually, no, B talks about throwing bananas onto the street, C talks about finding bananas on the street.
00:27:36.560 | So, yeah, they're pretty similar, right?
00:27:38.560 | But they have a lot of the same words, so fair enough.
00:27:42.560 | TF-IDF and BM25 both identify those as being similar as well.
00:27:46.560 | So, it's good, but it doesn't blow you away.
00:27:50.560 | Now, how about B here and G?
00:27:54.560 | So, what I've done for G here is I've just taken the sentence B and swapped the words around,
00:28:04.560 | so that we don't really have the matching words.
00:28:07.560 | So, throwing bananas, instead of throwing, I'm using the word "bombard".
00:28:11.560 | Instead of the word "street", I'm using "road".
00:28:14.560 | Instead of "bananas", I'm using "yellow fruit", right?
00:28:17.560 | So, it has a similar sort of meaning, but they're not using the same words.
00:28:23.560 | So, TF-IDF and BM25 would really struggle here.
00:28:27.560 | Now, let's go down and let's have a look.
00:28:32.560 | So, B and G.
00:28:34.560 | So, we have B here, G at the end, and yeah, they have a really good score.
00:28:39.560 | 0.66 is the second highest behind B and C.
00:28:44.560 | So, that just goes to show that sentence BERT, or sentence transformers in general, don't require the same words to be used.
00:28:53.560 | They rely more on the semantic meaning of those words, which is, I think, incredibly cool.
00:29:00.560 | So, that's it for this video.
00:29:05.560 | We've covered quite a lot.
00:29:08.560 | The sparse vectors, TF-IDF and BM25, and also dense vector sentence BERT, or sentence transformers in general.
00:29:17.560 | So, I hope you've enjoyed the video.
00:29:20.560 | Thank you for watching, and I'll see you in the next one.