back to index3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
Chapters
0:0 Intro
0:23 Jaccard Similarity
2:39 w-shingling
7:17 Levenshtein Distance
00:00:00.000 |
Hi, welcome to this video. We're going to have a look at three different traditional 00:00:05.580 |
similarity search methods that we can use for comparing two separate chunks of text. 00:00:15.760 |
Those three methods, as you can see on the screen, we have Jacquard Similarity, WShingling 00:00:21.680 |
and Levenshtein Distance. So let's jump straight into Jacquard. So Jacquard is actually super 00:00:33.180 |
simple. All we have is the intersection between two sets. So we have A and B up here. These 00:00:38.880 |
can be something like this. So we'll just use numbers for now, 1, 2, 3. And let's say 00:00:48.840 |
this is our set A. And we want to have a look at the intersection between that and another 00:00:55.080 |
set which is set B. And maybe this is just 2 and 4. So our intersection here are simply 00:01:04.520 |
the shared items that both of these sets have. So we just have 1 here. So the intersection 00:01:10.480 |
would be 1. And then we also have the union up here. Now the union is still a set. So 00:01:16.780 |
if we union all of these, we would get a total of 4 characters. Because we have 1, 2 and 00:01:25.600 |
then we have 2 over here again. So that just creates 1 item. Then we have 3 and 4. So all 00:01:31.840 |
together that's 4. We divide 1 by 4 and we get a Jacquard Similarity of 0.25. Now what 00:01:43.040 |
does that look like in Python? Let's have a look. So this is our implementation in Python. 00:01:53.560 |
Now it's pretty simple. All we have is we simply take our two sentences, X and Y. We 00:02:02.880 |
convert them into sets. And then we calculate intersection and union between them. And Python 00:02:09.400 |
has built in methods for both of those operations for sets which is pretty useful. So if we 00:02:17.240 |
were to take our previous example. So we had Jacquard and we need to pass strings here. 00:02:24.680 |
It's fine. All we're going to do is 1, 2 and 3. And then here I'm just going to put 2 and 00:02:32.680 |
4. Same as we had before. We get 0.25 which is what we got before. So that is Jacquard 00:02:39.480 |
Similarity. Let's move on to WShingling. Now WShingling is very similar to Jacquard but 00:02:50.560 |
rather than feeding in single words, we are feeding in n-grams. So an n-gram is essentially 00:03:01.640 |
one or more words together. So for example, we place n with 1. That gives us a unigram. 00:03:15.880 |
And a unigram would just be what we did before where we're just taking each word like that 00:03:22.000 |
as it is. So that would give us a list of single words. Now if we move one up from that, 00:03:29.400 |
we can do 2-grams which is a bigram. And a bigram is obviously taking two at a time. 00:03:39.240 |
So we would take these two words together and then we would also take these two words 00:03:45.080 |
together. And we would keep going through taking two by two all the way until we get 00:03:50.760 |
to the end. And what we would get from that is a set where we have his thought. And then 00:04:04.960 |
next we would have thought process. And so on and so on. So imagine we do the same as 00:04:22.360 |
we did with Jacquard Similarity but we do it with these n-grams. So bigrams, trigrams 00:04:30.480 |
or even unigrams, that is w-shingling. And if we were to write that out in Python, we 00:04:39.240 |
would get something like this. So we use this comprehension here. And what we're going to 00:04:44.920 |
do is iterate through each item in sentence A. So this one up here. We're splitting that 00:04:53.280 |
by space characters. And what I want to do is access each item in that list. So what 00:05:04.520 |
I'm going to do is put in range. And we're going to use the indexing to get each item 00:05:13.080 |
within the list. So for A, I, for I in that range. So that's going to go through each 00:05:24.880 |
item. So we run that and we get each character there. So I just need to make sure that we 00:05:35.120 |
actually split the list before we do that. So we'll just do here, A.split. And that means 00:05:43.600 |
I can also remove that. So now we have all the words. But what we want to do is get sets 00:05:52.000 |
of two words at a time. So what we can do is we'll add a list here and we'll do A to 00:06:00.620 |
I plus one. So we get both of those together. And because we're doing that, we also need 00:06:07.080 |
to take a minus one at the end here so that we don't get an index error running through 00:06:13.320 |
the whole list with this plus one. And then we get two lists of words. Now, the way that 00:06:22.800 |
we want to implement it here, I mean, you don't have to do this, but I think it looks 00:06:28.800 |
a bit cleaner, is we can use this join. And then we put both of those words together. 00:06:34.500 |
And then at the end of that, we'll be taking a set. Okay. And then once we have that set, 00:06:43.680 |
we just run that through this exact same function here. So we would pass our new W Shingling 00:06:51.300 |
set to X and Y and calculate the Jaccard similarity between them. And that would be W Shingling. 00:07:00.020 |
So that's why I said a very similar because the actual calculation itself is the same. 00:07:04.880 |
It's just the approach that we take or the pre-processing on our strings before we actually 00:07:12.820 |
feed it into the calculation. Now that's it for Jaccard and W Shingling. Let's move on 00:07:19.460 |
to Levenshtein distance. Now, this is the formula for Levenshtein distance. And I know 00:07:26.940 |
it looks quite confusing, or at least for me, looking at this the first time, it looks 00:07:33.920 |
pretty complicated. If it doesn't, then that's great. But if it does look confusing, don't 00:07:39.220 |
worry, we're going to break it down and work through it. And it will seem really simple 00:07:42.580 |
once we're done. So this is the formula for Levenshtein distance. Now, an easy way to 00:07:52.580 |
understand what is going on is using this matrix here. Now, with this matrix, I've sort 00:08:01.500 |
of already filled in a few up here, but imagine they're not filled in because I'm going to 00:08:06.660 |
explain why in a minute. So with this matrix, what we're going to do is we're going to 00:08:12.500 |
work through each value or each cell within this matrix and calculate this formula for 00:08:21.660 |
it. So we start up here. And at this point, we have we have our two values. So we're going 00:08:28.820 |
to put i here. So this is these values or these rows are going to be i indices and here 00:08:35.980 |
we're going to have j. Now at this point, j is equal to zero. And so is i, okay, because 00:08:43.380 |
we are at cell zero, zero. So we start there, we come down to here. And we say, okay, if 00:08:56.300 |
the minimum of i or j is zero, then you need to pass over to this point here. And then 00:09:05.500 |
this tells you, you need to take the maximum value between either i or j. Now, i and j, 00:09:11.740 |
they're both zero. So the maximum value between both of those is zero. So that means we remove 00:09:21.380 |
this x here. And that means that this value, so we pass that back here, the Levenshtein 00:09:29.860 |
distance for this first item here is zero. So that's what we do. Now, at this point, 00:09:37.140 |
we move on to the next value. So we go this way. And then once we get to the end here, 00:09:44.300 |
we move on to the next row and keep going that way. And all of these bits I've filled 00:09:55.340 |
in here, they either have i equal to zero, or they're going to be j on this column here 00:10:05.980 |
is equal to zero. So that means you pass it through here. And you take the maximum value 00:10:12.620 |
of either i or j. So for example, if we let's do it here, we're just ignoring these two 00:10:20.500 |
words and I'll explain how we or why those two words are relevant in a minute. At this 00:10:28.020 |
point, j is equal to seven. So j is equal to seven. Now, that's because of we're on 00:10:36.580 |
the seventh column here. So we've gone 0, 1, 2, and so on and so on till we get seven. 00:10:44.260 |
So here j is equal to seven. So we have 0 and 7. We pass that through down here, we 00:10:52.900 |
get to if the minimum of either of those is equal to zero, which it is, because i is equal 00:10:59.620 |
to zero, we come over here. And we say, what's the maximum value of either of those? It's 00:11:07.860 |
seven. So we remove all of this. And that means that this value here is equal to seven. 00:11:20.020 |
Now that's easy enough. But what happens when we start getting into these values here. So 00:11:27.340 |
this is where things change a little bit. Now, with Levenstein's distance, what we're 00:11:35.260 |
essentially doing is taking these two words and saying, what is the minimum or optimum 00:11:40.420 |
number of operations for us to convert one word into another word? And when we look at 00:11:46.900 |
this equation down here, we actually see three alternative operations. And that is deletion. 00:11:55.100 |
So here, delete. So this, maybe we have, for example, here, i, and over here, we have e. 00:12:09.660 |
If we wanted to convert Levenstein into this Livingston at the top, we would probably want 00:12:15.980 |
to delete this i, because then we would get, at the end, t-e-n, which matches up to what 00:12:25.380 |
we have up here. So we would want to delete that character. And so we would process it 00:12:33.820 |
through that delete operation. Below that, we have the insert operation. So this is inserting 00:12:42.460 |
a new character. So maybe we're going the other way. Maybe we were here at n, and we 00:12:50.140 |
were comparing against i. And we saw, OK, the best way for us to do this is to insert 00:12:56.340 |
a new character, which would be i. And then here is the substitution operation. And this 00:13:04.940 |
is where you're swapping one character for another character. So in the example here, 00:13:14.780 |
going from e to i up here, we might want to substitute that e for an i. So they're the 00:13:29.780 |
three operations that we have there. And whether you want to remember that or not, it's not 00:13:38.180 |
really that important, because we can implement this. And we don't really need to think too 00:13:42.820 |
much about what each of those operations is actually doing. So let's just remove those 00:13:49.540 |
for now. And let's work through an example. So we have this first example here at j equals 00:13:58.020 |
1, and i equals 1. So we have 1 and 1. Now we come down here. We have this. Is the minimum 00:14:08.980 |
of i or j equal to 0? No, they're both 1. So that means we have to go to the else here. 00:14:15.140 |
And then we pass through this. So what we are saying here is, what is the minimum value 00:14:26.460 |
of all of these? So what is the Levenshtein value that we already calculated at i minus 00:14:34.080 |
1 j? So i minus 1 j is at position 0, 1, because it's 1 minus 1 for the i, and just 1 for the 00:14:46.180 |
j. So what do we have there? So 0, 1. So i, 0, and j, 1. Here, we have this 1 value. Because 00:14:59.020 |
at this point, we've already calculated, because we go in that order. We go start from the 00:15:04.820 |
top, and go all the way right, onto the next row, and so on. So we always have these numbers 00:15:09.740 |
calculated by the time we get to comparing the minimum value. 00:15:15.700 |
So we have 0 and 1 there. And the number is 1. So let's remove that. And we say, OK, this 00:15:28.360 |
is equal to 1. What's the next part? We have i and j minus 1. So at position 1, and j minus 00:15:42.460 |
1, which is 0. So what do we have there? So we have here, and here, we again have another 00:15:49.580 |
1. OK, so we say that's equal to 1. Then we come to this final one here, i minus 1, j 00:15:58.140 |
minus 1. OK, so that means 0, 0. So position 0, 0. What do we have there? OK, we have this 00:16:12.420 |
So we take the minimum value of those, which is obviously this 0 value here. And we have-- 00:16:19.780 |
so this is going to be equal to 0. Then we have this, this here. And what this part of 00:16:27.060 |
the formula is telling us is we need to add 1 if ai is not equal to bi. So up here, something 00:16:38.380 |
I didn't explain, but a is this word. And b is the other word. So at ai, we have l. 00:16:52.260 |
And at jb, we have l. Are both of those the same? Yes, they are. So that means we don't 00:16:59.120 |
need to add the 1. OK, so then that means that our value here is equal to 0. And this 00:17:11.220 |
looks like a 0. I didn't mean for it to. It's actually supposed to be a circle. So this 00:17:15.580 |
is a 0. Now, let's remove all that for now. And let's do one more. 00:17:29.660 |
So let's try and make this one quick. So here, we have i is equal to 1, j is equal to 2. 00:17:37.140 |
We come down. We know that neither of those are 0. We come to here, come to here. OK, 00:17:43.540 |
what is at position 0, 2? 0, 2, we have a 2. So this is equal to 2. At position 1, 1, 00:17:59.300 |
what do we have? So 1, 1. So this is a 0. And then we have position i minus 1, j minus 00:18:13.420 |
1. So that is i equals 0 and j equals 1, which is this 1 here. So we have 1. What's the minimum 00:18:23.180 |
value of all those? It's a 0. And are ai and bj the same here? Well, ai is our l again. 00:18:37.580 |
And bj is an i this time. This is i up here. So they're not the same. So this time, we 00:18:46.460 |
do 1 plus 0. So then, up here, we put a 1. And then we would just keep doing that for 00:18:56.080 |
every single value going in this direction. We'd go to the end, and then we'd move on 00:19:01.740 |
to the next row, going in this direction again, until we get all the way to the end. By the 00:19:06.900 |
time we get to the end, it's going to look something like this. And what we do when we 00:19:15.860 |
get this is we look at the value that we see in the bottom right corner, which is here. 00:19:24.620 |
Or if you're in Python, we would get that value by indexing at minus 1, minus 1. And 00:19:35.420 |
that is our Levenshtein value. And you can see this is almost like the optimal path through 00:19:46.380 |
our matrix. So we're just following this line all the way over here. So here, we had to 00:19:51.980 |
move a little bit. And then we find our optimum number of operations is 3. So there's 3 deletions, 00:20:02.180 |
insertions, or substitutions needed to go from our initial word over to our other word. 00:20:10.860 |
Now how does that work in Python? Well, in Python, it looks like this. So at the start 00:20:20.500 |
here, we add an additional character. So if you remember from the matrix, you see we had 00:20:31.260 |
this first initial part here. These are just empty characters. And we replicate that in 00:20:39.260 |
our code by adding these spaces here. And then after that, what we do is initialize 00:20:48.260 |
a empty array. So that is our first stage that we saw before. So we have that empty 00:20:55.420 |
array. It's just going to be 0s at this point. And then we begin iterating through each value 00:21:00.860 |
and finding the best path. So we have i and j here, as we saw before, exactly the same. 00:21:11.500 |
So we have i, which corresponds to our word or sentence a, and j, which corresponds to 00:21:17.100 |
our word or sentence b. And then we say, if the minimum of i or j is equal to 0, then 00:21:26.000 |
we calculate that position as the maximum of i or j. So exactly the same as what we 00:21:32.500 |
covered before with that matrix and the formula. So that is the top line here. So this bit 00:21:45.780 |
followed by this. So the exact same. We're saying, if the minimum, then choose the maximum. 00:21:53.620 |
And then, in our logic, we move on to the else. So if the minimum of i or j is not equal 00:21:59.500 |
to 0, just like we did in that formula, we do this. So we say, we calculate those three 00:22:07.100 |
rows that we had. So those three possible operations with deletion, insertion, or substitution. 00:22:14.660 |
And then we take the minimum value of all of those, so the best path. And then, if our 00:22:24.380 |
two current characters don't match, we add 1. So again, we saw that in the formula before. 00:22:35.340 |
Right here. So we're saying, if the characters a and b don't match, add 1. And if so, we 00:22:44.420 |
do the same. So we run through. We do that for every position in our matrix. And then, 00:22:51.680 |
what we do down here is return. I return matrix. And then, I also return the actual Levenstein 00:22:56.900 |
distance from the final point in that matrix. So the bottom right value. So we have that 00:23:06.060 |
here. So let's see how it does on our example. So we have these two words. I think this should 00:23:17.940 |
be a h. Let's have a look. Yep. So we need that to be a h. So we're going to run both 00:23:28.060 |
of those through our formula here. Let's have a look at what we get. OK. So run all the 00:23:38.840 |
way through. So I can see here. So this is supposed to be an i Levenstein. OK. And then, 00:23:45.780 |
we get this value. So we get, in our matrix, see that the final value is at 3. And then, 00:23:51.220 |
we also pull that out here as well. So those are our three traditional similarity search 00:24:04.760 |
techniques and how we can use them in language. Now, all of them are, I think, they're all 00:24:14.020 |
very popular and with good reason. And it's definitely worth being able to use them. Now, 00:24:19.620 |
the one place where these sort of struggle is where we are maybe looking at more semantical 00:24:27.120 |
meaning. So for example, if I were to say the words "hi" and "hello," although they have 00:24:33.600 |
the same meaning, none of these techniques would be able to identify those two words 00:24:39.480 |
as being similar or matching. And in that case, we would kind of need to move into the 00:24:46.400 |
more vector-based similarity search methods. And in particular, where we start using dense 00:24:54.320 |
vectors for representing those words. But for simple comparisons, where we're comparing 00:25:02.780 |
the syntax between two sentences or words, these traditional approaches are incredibly 00:25:09.400 |
useful and incredibly powerful. But that's it for this video. So I hope you've enjoyed