back to index

3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)


Chapters

0:0 Intro
0:23 Jaccard Similarity
2:39 w-shingling
7:17 Levenshtein Distance

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi, welcome to this video. We're going to have a look at three different traditional
00:00:05.580 | similarity search methods that we can use for comparing two separate chunks of text.
00:00:15.760 | Those three methods, as you can see on the screen, we have Jacquard Similarity, WShingling
00:00:21.680 | and Levenshtein Distance. So let's jump straight into Jacquard. So Jacquard is actually super
00:00:33.180 | simple. All we have is the intersection between two sets. So we have A and B up here. These
00:00:38.880 | can be something like this. So we'll just use numbers for now, 1, 2, 3. And let's say
00:00:48.840 | this is our set A. And we want to have a look at the intersection between that and another
00:00:55.080 | set which is set B. And maybe this is just 2 and 4. So our intersection here are simply
00:01:04.520 | the shared items that both of these sets have. So we just have 1 here. So the intersection
00:01:10.480 | would be 1. And then we also have the union up here. Now the union is still a set. So
00:01:16.780 | if we union all of these, we would get a total of 4 characters. Because we have 1, 2 and
00:01:25.600 | then we have 2 over here again. So that just creates 1 item. Then we have 3 and 4. So all
00:01:31.840 | together that's 4. We divide 1 by 4 and we get a Jacquard Similarity of 0.25. Now what
00:01:43.040 | does that look like in Python? Let's have a look. So this is our implementation in Python.
00:01:53.560 | Now it's pretty simple. All we have is we simply take our two sentences, X and Y. We
00:02:02.880 | convert them into sets. And then we calculate intersection and union between them. And Python
00:02:09.400 | has built in methods for both of those operations for sets which is pretty useful. So if we
00:02:17.240 | were to take our previous example. So we had Jacquard and we need to pass strings here.
00:02:24.680 | It's fine. All we're going to do is 1, 2 and 3. And then here I'm just going to put 2 and
00:02:32.680 | 4. Same as we had before. We get 0.25 which is what we got before. So that is Jacquard
00:02:39.480 | Similarity. Let's move on to WShingling. Now WShingling is very similar to Jacquard but
00:02:50.560 | rather than feeding in single words, we are feeding in n-grams. So an n-gram is essentially
00:03:01.640 | one or more words together. So for example, we place n with 1. That gives us a unigram.
00:03:15.880 | And a unigram would just be what we did before where we're just taking each word like that
00:03:22.000 | as it is. So that would give us a list of single words. Now if we move one up from that,
00:03:29.400 | we can do 2-grams which is a bigram. And a bigram is obviously taking two at a time.
00:03:39.240 | So we would take these two words together and then we would also take these two words
00:03:45.080 | together. And we would keep going through taking two by two all the way until we get
00:03:50.760 | to the end. And what we would get from that is a set where we have his thought. And then
00:04:04.960 | next we would have thought process. And so on and so on. So imagine we do the same as
00:04:22.360 | we did with Jacquard Similarity but we do it with these n-grams. So bigrams, trigrams
00:04:30.480 | or even unigrams, that is w-shingling. And if we were to write that out in Python, we
00:04:39.240 | would get something like this. So we use this comprehension here. And what we're going to
00:04:44.920 | do is iterate through each item in sentence A. So this one up here. We're splitting that
00:04:53.280 | by space characters. And what I want to do is access each item in that list. So what
00:05:04.520 | I'm going to do is put in range. And we're going to use the indexing to get each item
00:05:13.080 | within the list. So for A, I, for I in that range. So that's going to go through each
00:05:24.880 | item. So we run that and we get each character there. So I just need to make sure that we
00:05:35.120 | actually split the list before we do that. So we'll just do here, A.split. And that means
00:05:43.600 | I can also remove that. So now we have all the words. But what we want to do is get sets
00:05:52.000 | of two words at a time. So what we can do is we'll add a list here and we'll do A to
00:06:00.620 | I plus one. So we get both of those together. And because we're doing that, we also need
00:06:07.080 | to take a minus one at the end here so that we don't get an index error running through
00:06:13.320 | the whole list with this plus one. And then we get two lists of words. Now, the way that
00:06:22.800 | we want to implement it here, I mean, you don't have to do this, but I think it looks
00:06:28.800 | a bit cleaner, is we can use this join. And then we put both of those words together.
00:06:34.500 | And then at the end of that, we'll be taking a set. Okay. And then once we have that set,
00:06:43.680 | we just run that through this exact same function here. So we would pass our new W Shingling
00:06:51.300 | set to X and Y and calculate the Jaccard similarity between them. And that would be W Shingling.
00:07:00.020 | So that's why I said a very similar because the actual calculation itself is the same.
00:07:04.880 | It's just the approach that we take or the pre-processing on our strings before we actually
00:07:12.820 | feed it into the calculation. Now that's it for Jaccard and W Shingling. Let's move on
00:07:19.460 | to Levenshtein distance. Now, this is the formula for Levenshtein distance. And I know
00:07:26.940 | it looks quite confusing, or at least for me, looking at this the first time, it looks
00:07:33.920 | pretty complicated. If it doesn't, then that's great. But if it does look confusing, don't
00:07:39.220 | worry, we're going to break it down and work through it. And it will seem really simple
00:07:42.580 | once we're done. So this is the formula for Levenshtein distance. Now, an easy way to
00:07:52.580 | understand what is going on is using this matrix here. Now, with this matrix, I've sort
00:08:01.500 | of already filled in a few up here, but imagine they're not filled in because I'm going to
00:08:06.660 | explain why in a minute. So with this matrix, what we're going to do is we're going to
00:08:12.500 | work through each value or each cell within this matrix and calculate this formula for
00:08:21.660 | it. So we start up here. And at this point, we have we have our two values. So we're going
00:08:28.820 | to put i here. So this is these values or these rows are going to be i indices and here
00:08:35.980 | we're going to have j. Now at this point, j is equal to zero. And so is i, okay, because
00:08:43.380 | we are at cell zero, zero. So we start there, we come down to here. And we say, okay, if
00:08:56.300 | the minimum of i or j is zero, then you need to pass over to this point here. And then
00:09:05.500 | this tells you, you need to take the maximum value between either i or j. Now, i and j,
00:09:11.740 | they're both zero. So the maximum value between both of those is zero. So that means we remove
00:09:21.380 | this x here. And that means that this value, so we pass that back here, the Levenshtein
00:09:29.860 | distance for this first item here is zero. So that's what we do. Now, at this point,
00:09:37.140 | we move on to the next value. So we go this way. And then once we get to the end here,
00:09:44.300 | we move on to the next row and keep going that way. And all of these bits I've filled
00:09:55.340 | in here, they either have i equal to zero, or they're going to be j on this column here
00:10:05.980 | is equal to zero. So that means you pass it through here. And you take the maximum value
00:10:12.620 | of either i or j. So for example, if we let's do it here, we're just ignoring these two
00:10:20.500 | words and I'll explain how we or why those two words are relevant in a minute. At this
00:10:28.020 | point, j is equal to seven. So j is equal to seven. Now, that's because of we're on
00:10:36.580 | the seventh column here. So we've gone 0, 1, 2, and so on and so on till we get seven.
00:10:44.260 | So here j is equal to seven. So we have 0 and 7. We pass that through down here, we
00:10:52.900 | get to if the minimum of either of those is equal to zero, which it is, because i is equal
00:10:59.620 | to zero, we come over here. And we say, what's the maximum value of either of those? It's
00:11:07.860 | seven. So we remove all of this. And that means that this value here is equal to seven.
00:11:20.020 | Now that's easy enough. But what happens when we start getting into these values here. So
00:11:27.340 | this is where things change a little bit. Now, with Levenstein's distance, what we're
00:11:35.260 | essentially doing is taking these two words and saying, what is the minimum or optimum
00:11:40.420 | number of operations for us to convert one word into another word? And when we look at
00:11:46.900 | this equation down here, we actually see three alternative operations. And that is deletion.
00:11:55.100 | So here, delete. So this, maybe we have, for example, here, i, and over here, we have e.
00:12:09.660 | If we wanted to convert Levenstein into this Livingston at the top, we would probably want
00:12:15.980 | to delete this i, because then we would get, at the end, t-e-n, which matches up to what
00:12:25.380 | we have up here. So we would want to delete that character. And so we would process it
00:12:33.820 | through that delete operation. Below that, we have the insert operation. So this is inserting
00:12:42.460 | a new character. So maybe we're going the other way. Maybe we were here at n, and we
00:12:50.140 | were comparing against i. And we saw, OK, the best way for us to do this is to insert
00:12:56.340 | a new character, which would be i. And then here is the substitution operation. And this
00:13:04.940 | is where you're swapping one character for another character. So in the example here,
00:13:14.780 | going from e to i up here, we might want to substitute that e for an i. So they're the
00:13:29.780 | three operations that we have there. And whether you want to remember that or not, it's not
00:13:38.180 | really that important, because we can implement this. And we don't really need to think too
00:13:42.820 | much about what each of those operations is actually doing. So let's just remove those
00:13:49.540 | for now. And let's work through an example. So we have this first example here at j equals
00:13:58.020 | 1, and i equals 1. So we have 1 and 1. Now we come down here. We have this. Is the minimum
00:14:08.980 | of i or j equal to 0? No, they're both 1. So that means we have to go to the else here.
00:14:15.140 | And then we pass through this. So what we are saying here is, what is the minimum value
00:14:26.460 | of all of these? So what is the Levenshtein value that we already calculated at i minus
00:14:34.080 | 1 j? So i minus 1 j is at position 0, 1, because it's 1 minus 1 for the i, and just 1 for the
00:14:46.180 | j. So what do we have there? So 0, 1. So i, 0, and j, 1. Here, we have this 1 value. Because
00:14:59.020 | at this point, we've already calculated, because we go in that order. We go start from the
00:15:04.820 | top, and go all the way right, onto the next row, and so on. So we always have these numbers
00:15:09.740 | calculated by the time we get to comparing the minimum value.
00:15:15.700 | So we have 0 and 1 there. And the number is 1. So let's remove that. And we say, OK, this
00:15:28.360 | is equal to 1. What's the next part? We have i and j minus 1. So at position 1, and j minus
00:15:42.460 | 1, which is 0. So what do we have there? So we have here, and here, we again have another
00:15:49.580 | 1. OK, so we say that's equal to 1. Then we come to this final one here, i minus 1, j
00:15:58.140 | minus 1. OK, so that means 0, 0. So position 0, 0. What do we have there? OK, we have this
00:16:07.340 | number 0. There we go.
00:16:12.420 | So we take the minimum value of those, which is obviously this 0 value here. And we have--
00:16:19.780 | so this is going to be equal to 0. Then we have this, this here. And what this part of
00:16:27.060 | the formula is telling us is we need to add 1 if ai is not equal to bi. So up here, something
00:16:38.380 | I didn't explain, but a is this word. And b is the other word. So at ai, we have l.
00:16:52.260 | And at jb, we have l. Are both of those the same? Yes, they are. So that means we don't
00:16:59.120 | need to add the 1. OK, so then that means that our value here is equal to 0. And this
00:17:11.220 | looks like a 0. I didn't mean for it to. It's actually supposed to be a circle. So this
00:17:15.580 | is a 0. Now, let's remove all that for now. And let's do one more.
00:17:29.660 | So let's try and make this one quick. So here, we have i is equal to 1, j is equal to 2.
00:17:37.140 | We come down. We know that neither of those are 0. We come to here, come to here. OK,
00:17:43.540 | what is at position 0, 2? 0, 2, we have a 2. So this is equal to 2. At position 1, 1,
00:17:59.300 | what do we have? So 1, 1. So this is a 0. And then we have position i minus 1, j minus
00:18:13.420 | 1. So that is i equals 0 and j equals 1, which is this 1 here. So we have 1. What's the minimum
00:18:23.180 | value of all those? It's a 0. And are ai and bj the same here? Well, ai is our l again.
00:18:37.580 | And bj is an i this time. This is i up here. So they're not the same. So this time, we
00:18:46.460 | do 1 plus 0. So then, up here, we put a 1. And then we would just keep doing that for
00:18:56.080 | every single value going in this direction. We'd go to the end, and then we'd move on
00:19:01.740 | to the next row, going in this direction again, until we get all the way to the end. By the
00:19:06.900 | time we get to the end, it's going to look something like this. And what we do when we
00:19:15.860 | get this is we look at the value that we see in the bottom right corner, which is here.
00:19:24.620 | Or if you're in Python, we would get that value by indexing at minus 1, minus 1. And
00:19:35.420 | that is our Levenshtein value. And you can see this is almost like the optimal path through
00:19:46.380 | our matrix. So we're just following this line all the way over here. So here, we had to
00:19:51.980 | move a little bit. And then we find our optimum number of operations is 3. So there's 3 deletions,
00:20:02.180 | insertions, or substitutions needed to go from our initial word over to our other word.
00:20:10.860 | Now how does that work in Python? Well, in Python, it looks like this. So at the start
00:20:20.500 | here, we add an additional character. So if you remember from the matrix, you see we had
00:20:31.260 | this first initial part here. These are just empty characters. And we replicate that in
00:20:39.260 | our code by adding these spaces here. And then after that, what we do is initialize
00:20:48.260 | a empty array. So that is our first stage that we saw before. So we have that empty
00:20:55.420 | array. It's just going to be 0s at this point. And then we begin iterating through each value
00:21:00.860 | and finding the best path. So we have i and j here, as we saw before, exactly the same.
00:21:11.500 | So we have i, which corresponds to our word or sentence a, and j, which corresponds to
00:21:17.100 | our word or sentence b. And then we say, if the minimum of i or j is equal to 0, then
00:21:26.000 | we calculate that position as the maximum of i or j. So exactly the same as what we
00:21:32.500 | covered before with that matrix and the formula. So that is the top line here. So this bit
00:21:45.780 | followed by this. So the exact same. We're saying, if the minimum, then choose the maximum.
00:21:53.620 | And then, in our logic, we move on to the else. So if the minimum of i or j is not equal
00:21:59.500 | to 0, just like we did in that formula, we do this. So we say, we calculate those three
00:22:07.100 | rows that we had. So those three possible operations with deletion, insertion, or substitution.
00:22:14.660 | And then we take the minimum value of all of those, so the best path. And then, if our
00:22:24.380 | two current characters don't match, we add 1. So again, we saw that in the formula before.
00:22:35.340 | Right here. So we're saying, if the characters a and b don't match, add 1. And if so, we
00:22:44.420 | do the same. So we run through. We do that for every position in our matrix. And then,
00:22:51.680 | what we do down here is return. I return matrix. And then, I also return the actual Levenstein
00:22:56.900 | distance from the final point in that matrix. So the bottom right value. So we have that
00:23:06.060 | here. So let's see how it does on our example. So we have these two words. I think this should
00:23:17.940 | be a h. Let's have a look. Yep. So we need that to be a h. So we're going to run both
00:23:28.060 | of those through our formula here. Let's have a look at what we get. OK. So run all the
00:23:38.840 | way through. So I can see here. So this is supposed to be an i Levenstein. OK. And then,
00:23:45.780 | we get this value. So we get, in our matrix, see that the final value is at 3. And then,
00:23:51.220 | we also pull that out here as well. So those are our three traditional similarity search
00:24:04.760 | techniques and how we can use them in language. Now, all of them are, I think, they're all
00:24:14.020 | very popular and with good reason. And it's definitely worth being able to use them. Now,
00:24:19.620 | the one place where these sort of struggle is where we are maybe looking at more semantical
00:24:27.120 | meaning. So for example, if I were to say the words "hi" and "hello," although they have
00:24:33.600 | the same meaning, none of these techniques would be able to identify those two words
00:24:39.480 | as being similar or matching. And in that case, we would kind of need to move into the
00:24:46.400 | more vector-based similarity search methods. And in particular, where we start using dense
00:24:54.320 | vectors for representing those words. But for simple comparisons, where we're comparing
00:25:02.780 | the syntax between two sentences or words, these traditional approaches are incredibly
00:25:09.400 | useful and incredibly powerful. But that's it for this video. So I hope you've enjoyed
00:25:17.160 | it. And I will see you in the next one.