Back to Index

3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)


Chapters

0:0 Intro
0:23 Jaccard Similarity
2:39 w-shingling
7:17 Levenshtein Distance

Transcript

Hi, welcome to this video. We're going to have a look at three different traditional similarity search methods that we can use for comparing two separate chunks of text. Those three methods, as you can see on the screen, we have Jacquard Similarity, WShingling and Levenshtein Distance. So let's jump straight into Jacquard.

So Jacquard is actually super simple. All we have is the intersection between two sets. So we have A and B up here. These can be something like this. So we'll just use numbers for now, 1, 2, 3. And let's say this is our set A. And we want to have a look at the intersection between that and another set which is set B.

And maybe this is just 2 and 4. So our intersection here are simply the shared items that both of these sets have. So we just have 1 here. So the intersection would be 1. And then we also have the union up here. Now the union is still a set.

So if we union all of these, we would get a total of 4 characters. Because we have 1, 2 and then we have 2 over here again. So that just creates 1 item. Then we have 3 and 4. So all together that's 4. We divide 1 by 4 and we get a Jacquard Similarity of 0.25.

Now what does that look like in Python? Let's have a look. So this is our implementation in Python. Now it's pretty simple. All we have is we simply take our two sentences, X and Y. We convert them into sets. And then we calculate intersection and union between them. And Python has built in methods for both of those operations for sets which is pretty useful.

So if we were to take our previous example. So we had Jacquard and we need to pass strings here. It's fine. All we're going to do is 1, 2 and 3. And then here I'm just going to put 2 and 4. Same as we had before. We get 0.25 which is what we got before.

So that is Jacquard Similarity. Let's move on to WShingling. Now WShingling is very similar to Jacquard but rather than feeding in single words, we are feeding in n-grams. So an n-gram is essentially one or more words together. So for example, we place n with 1. That gives us a unigram.

And a unigram would just be what we did before where we're just taking each word like that as it is. So that would give us a list of single words. Now if we move one up from that, we can do 2-grams which is a bigram. And a bigram is obviously taking two at a time.

So we would take these two words together and then we would also take these two words together. And we would keep going through taking two by two all the way until we get to the end. And what we would get from that is a set where we have his thought.

And then next we would have thought process. And so on and so on. So imagine we do the same as we did with Jacquard Similarity but we do it with these n-grams. So bigrams, trigrams or even unigrams, that is w-shingling. And if we were to write that out in Python, we would get something like this.

So we use this comprehension here. And what we're going to do is iterate through each item in sentence A. So this one up here. We're splitting that by space characters. And what I want to do is access each item in that list. So what I'm going to do is put in range.

And we're going to use the indexing to get each item within the list. So for A, I, for I in that range. So that's going to go through each item. So we run that and we get each character there. So I just need to make sure that we actually split the list before we do that.

So we'll just do here, A.split. And that means I can also remove that. So now we have all the words. But what we want to do is get sets of two words at a time. So what we can do is we'll add a list here and we'll do A to I plus one.

So we get both of those together. And because we're doing that, we also need to take a minus one at the end here so that we don't get an index error running through the whole list with this plus one. And then we get two lists of words. Now, the way that we want to implement it here, I mean, you don't have to do this, but I think it looks a bit cleaner, is we can use this join.

And then we put both of those words together. And then at the end of that, we'll be taking a set. Okay. And then once we have that set, we just run that through this exact same function here. So we would pass our new W Shingling set to X and Y and calculate the Jaccard similarity between them.

And that would be W Shingling. So that's why I said a very similar because the actual calculation itself is the same. It's just the approach that we take or the pre-processing on our strings before we actually feed it into the calculation. Now that's it for Jaccard and W Shingling.

Let's move on to Levenshtein distance. Now, this is the formula for Levenshtein distance. And I know it looks quite confusing, or at least for me, looking at this the first time, it looks pretty complicated. If it doesn't, then that's great. But if it does look confusing, don't worry, we're going to break it down and work through it.

And it will seem really simple once we're done. So this is the formula for Levenshtein distance. Now, an easy way to understand what is going on is using this matrix here. Now, with this matrix, I've sort of already filled in a few up here, but imagine they're not filled in because I'm going to explain why in a minute.

So with this matrix, what we're going to do is we're going to work through each value or each cell within this matrix and calculate this formula for it. So we start up here. And at this point, we have we have our two values. So we're going to put i here.

So this is these values or these rows are going to be i indices and here we're going to have j. Now at this point, j is equal to zero. And so is i, okay, because we are at cell zero, zero. So we start there, we come down to here.

And we say, okay, if the minimum of i or j is zero, then you need to pass over to this point here. And then this tells you, you need to take the maximum value between either i or j. Now, i and j, they're both zero. So the maximum value between both of those is zero.

So that means we remove this x here. And that means that this value, so we pass that back here, the Levenshtein distance for this first item here is zero. So that's what we do. Now, at this point, we move on to the next value. So we go this way.

And then once we get to the end here, we move on to the next row and keep going that way. And all of these bits I've filled in here, they either have i equal to zero, or they're going to be j on this column here is equal to zero.

So that means you pass it through here. And you take the maximum value of either i or j. So for example, if we let's do it here, we're just ignoring these two words and I'll explain how we or why those two words are relevant in a minute. At this point, j is equal to seven.

So j is equal to seven. Now, that's because of we're on the seventh column here. So we've gone 0, 1, 2, and so on and so on till we get seven. So here j is equal to seven. So we have 0 and 7. We pass that through down here, we get to if the minimum of either of those is equal to zero, which it is, because i is equal to zero, we come over here.

And we say, what's the maximum value of either of those? It's seven. So we remove all of this. And that means that this value here is equal to seven. Now that's easy enough. But what happens when we start getting into these values here. So this is where things change a little bit.

Now, with Levenstein's distance, what we're essentially doing is taking these two words and saying, what is the minimum or optimum number of operations for us to convert one word into another word? And when we look at this equation down here, we actually see three alternative operations. And that is deletion.

So here, delete. So this, maybe we have, for example, here, i, and over here, we have e. If we wanted to convert Levenstein into this Livingston at the top, we would probably want to delete this i, because then we would get, at the end, t-e-n, which matches up to what we have up here.

So we would want to delete that character. And so we would process it through that delete operation. Below that, we have the insert operation. So this is inserting a new character. So maybe we're going the other way. Maybe we were here at n, and we were comparing against i.

And we saw, OK, the best way for us to do this is to insert a new character, which would be i. And then here is the substitution operation. And this is where you're swapping one character for another character. So in the example here, going from e to i up here, we might want to substitute that e for an i.

So they're the three operations that we have there. And whether you want to remember that or not, it's not really that important, because we can implement this. And we don't really need to think too much about what each of those operations is actually doing. So let's just remove those for now.

And let's work through an example. So we have this first example here at j equals 1, and i equals 1. So we have 1 and 1. Now we come down here. We have this. Is the minimum of i or j equal to 0? No, they're both 1. So that means we have to go to the else here.

And then we pass through this. So what we are saying here is, what is the minimum value of all of these? So what is the Levenshtein value that we already calculated at i minus 1 j? So i minus 1 j is at position 0, 1, because it's 1 minus 1 for the i, and just 1 for the j.

So what do we have there? So 0, 1. So i, 0, and j, 1. Here, we have this 1 value. Because at this point, we've already calculated, because we go in that order. We go start from the top, and go all the way right, onto the next row, and so on.

So we always have these numbers calculated by the time we get to comparing the minimum value. So we have 0 and 1 there. And the number is 1. So let's remove that. And we say, OK, this is equal to 1. What's the next part? We have i and j minus 1.

So at position 1, and j minus 1, which is 0. So what do we have there? So we have here, and here, we again have another 1. OK, so we say that's equal to 1. Then we come to this final one here, i minus 1, j minus 1. OK, so that means 0, 0.

So position 0, 0. What do we have there? OK, we have this number 0. There we go. So we take the minimum value of those, which is obviously this 0 value here. And we have-- so this is going to be equal to 0. Then we have this, this here.

And what this part of the formula is telling us is we need to add 1 if ai is not equal to bi. So up here, something I didn't explain, but a is this word. And b is the other word. So at ai, we have l. And at jb, we have l.

Are both of those the same? Yes, they are. So that means we don't need to add the 1. OK, so then that means that our value here is equal to 0. And this looks like a 0. I didn't mean for it to. It's actually supposed to be a circle.

So this is a 0. Now, let's remove all that for now. And let's do one more. So let's try and make this one quick. So here, we have i is equal to 1, j is equal to 2. We come down. We know that neither of those are 0. We come to here, come to here.

OK, what is at position 0, 2? 0, 2, we have a 2. So this is equal to 2. At position 1, 1, what do we have? So 1, 1. So this is a 0. And then we have position i minus 1, j minus 1. So that is i equals 0 and j equals 1, which is this 1 here.

So we have 1. What's the minimum value of all those? It's a 0. And are ai and bj the same here? Well, ai is our l again. And bj is an i this time. This is i up here. So they're not the same. So this time, we do 1 plus 0.

So then, up here, we put a 1. And then we would just keep doing that for every single value going in this direction. We'd go to the end, and then we'd move on to the next row, going in this direction again, until we get all the way to the end.

By the time we get to the end, it's going to look something like this. And what we do when we get this is we look at the value that we see in the bottom right corner, which is here. Or if you're in Python, we would get that value by indexing at minus 1, minus 1.

And that is our Levenshtein value. And you can see this is almost like the optimal path through our matrix. So we're just following this line all the way over here. So here, we had to move a little bit. And then we find our optimum number of operations is 3.

So there's 3 deletions, insertions, or substitutions needed to go from our initial word over to our other word. Now how does that work in Python? Well, in Python, it looks like this. So at the start here, we add an additional character. So if you remember from the matrix, you see we had this first initial part here.

These are just empty characters. And we replicate that in our code by adding these spaces here. And then after that, what we do is initialize a empty array. So that is our first stage that we saw before. So we have that empty array. It's just going to be 0s at this point.

And then we begin iterating through each value and finding the best path. So we have i and j here, as we saw before, exactly the same. So we have i, which corresponds to our word or sentence a, and j, which corresponds to our word or sentence b. And then we say, if the minimum of i or j is equal to 0, then we calculate that position as the maximum of i or j.

So exactly the same as what we covered before with that matrix and the formula. So that is the top line here. So this bit followed by this. So the exact same. We're saying, if the minimum, then choose the maximum. And then, in our logic, we move on to the else.

So if the minimum of i or j is not equal to 0, just like we did in that formula, we do this. So we say, we calculate those three rows that we had. So those three possible operations with deletion, insertion, or substitution. And then we take the minimum value of all of those, so the best path.

And then, if our two current characters don't match, we add 1. So again, we saw that in the formula before. Right here. So we're saying, if the characters a and b don't match, add 1. And if so, we do the same. So we run through. We do that for every position in our matrix.

And then, what we do down here is return. I return matrix. And then, I also return the actual Levenstein distance from the final point in that matrix. So the bottom right value. So we have that here. So let's see how it does on our example. So we have these two words.

I think this should be a h. Let's have a look. Yep. So we need that to be a h. So we're going to run both of those through our formula here. Let's have a look at what we get. OK. So run all the way through. So I can see here.

So this is supposed to be an i Levenstein. OK. And then, we get this value. So we get, in our matrix, see that the final value is at 3. And then, we also pull that out here as well. So those are our three traditional similarity search techniques and how we can use them in language.

Now, all of them are, I think, they're all very popular and with good reason. And it's definitely worth being able to use them. Now, the one place where these sort of struggle is where we are maybe looking at more semantical meaning. So for example, if I were to say the words "hi" and "hello," although they have the same meaning, none of these techniques would be able to identify those two words as being similar or matching.

And in that case, we would kind of need to move into the more vector-based similarity search methods. And in particular, where we start using dense vectors for representing those words. But for simple comparisons, where we're comparing the syntax between two sentences or words, these traditional approaches are incredibly useful and incredibly powerful.

But that's it for this video. So I hope you've enjoyed it. And I will see you in the next one.