Back to Index

Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)


Chapters

0:0 Intro
1:21 Overview
5:58 Shingling
8:45 Vocab
9:27 One-hot Encoding
11:10 MinHash
15:51 Signature Info
18:8 LSH
22:20 Tuning LSH

Transcript

Hi and welcome to the video. Today we're going to be covering another technique in similarity search called locality sensitive hashing or LSH. Now LSH is a hugely popular technique used in efficient similarity search. Now there are a huge number of companies that use similarity search. I mean you have big names like Google.

I mean Google is built from similarity search and then you have Netflix, Amazon, Spotify. All of them are constantly recommending you different products, films, music and they do that by comparing you to other customers. So they are performing a similar search between you other customers and identifying the most similar ones.

Now you have two approaches. You have exhaustive which is comparing all of the data points. I'm just going to call them vectors from now on because that's what we'll be using. So comparing all these vectors and obviously it's slow. Approximate search allows us to approximate those vectors, restrict our scope to a more relevant range of vectors and so on.

So it covers a lot of different techniques. It's not just one technique here. The one we're going to be covering today is locality sensitive hashing. So at its core LSH is a hashing algorithm which attempts to maximize hash collisions. So what we see on screen right now is a dictionary, like a typical Python dictionary in the way that it hashes different items.

So we have our keys which are items that we're hashing. We process them through a hashing function and that hashing function attempts to minimize hashing collisions, e.g. to not put keys in the same bucket. It wants every key to go to a separate bucket. And then these are connected.

They don't contain the values but they're connected to the values that we relate back to our keys. So that's a Python dictionary. That's our Python dictionary. But we're not wanting to minimize collisions. We are wanting to maximize the collisions. So what we see here is a hashing function that maximizes those collisions.

So this is essentially what LSH is doing. So we are attempting to, for any similar keys, so these here and these here, they're all similar enough for us to want to put them into the same bucket. So we put two of them into here and then the other three into this bucket.

Now, there are quite a few different ways of doing this and there are a lot of different LSH methods. In fact, LSH is a very generic term that applies to a lot of different algorithms. And the one that we will be covering is what I see as the traditional version.

So it's the original version of LSH. And what we'll be covering in this video is shingling, minhashing, and that LSH function. So we'll get to understand why very soon. So here is the overview of the process that we're going to be walking through. So we have shingling. So we have at the very start, we have this text.

So flying fish flew by the space station. Now, that's just a string. And what we want to do is extract all of the unique pairs of text. So when we say shingling, it's K shingling. And in this case, our K value is two because we're taking two characters at once.

If we were to take K equals four, for example, then we would take like pace and then move on. We'd take ace and a space and so on. So that's the shingling. And from that, we create a set. So if we have duplicate shingles, we remove those. So we just end up with one.

So in this, I don't know if we do have any duplicates, but say maybe down here, we had IN again, because we also have it up here. We would end up with just a single IN in the set. We wouldn't have two. And then we one-hot encode those. So that means we take a vocabulary from all of our text.

So not just this one sentence, but we'll have more than one sentence, obviously, that we're comparing. And we'll use that to build a one-hot vector from the vocab and our shingle set. Then we process that through something called a min hash function, which produces this dense vector or signature.

So this thing down here, that is what's called a signature. And then we band that into this final bit here. This is our actual LSH process. So we band that vector into multiple sub-vectors, and then we hash them. So where we find that we have any two sub-vectors go to the same hash bucket, then that means that the full vector that they both come from is considered, or the two full vectors that they both come from are considered a candidate pair.

And we take those and we then calculate some other, we calculate the similarity between them. Okay. So the first step in our process, like we discussed, is the shingling operation. So shingling is simply where we take a window of length K characters, and we simply move that down through our text, like you can see here.

And from that, we create the shingle set. So in Python, what we would do to shingle these three sentences we have here is we'll create a shingle function here. And this is going to take some text, which is a string. And we're going to say, we're going to define the K values of number of characters we take within each window, which is obviously an integer.

Now we initialize our shingle set here, we'll make a string initially. And then what we do is for i in range, and then here we want to go from the, or we want to go to the length of our text, minus K. So minus that window length, plus one, because we want to go right up to the end of that.

And then here, all we do is shingle set.append. And then we write, so we have the text and we want to go from i up until i plus K. Okay, that's our shingle list, I suppose. And then we want to return a set. So this will remove any duplicates that we have.

So shingle set. Okay, so that's our shingle function. And we just want to process each one of our sentences through that. So we'll go a equals shingle, a. Also, we need to define K, which can be two. I'll just define K here. Okay, and then let's have a look at what we have.

And we see that we have this, it's shuffled, there's no order to our set here. And we see that we have all of the pairs of words in there. So we have S for the start of the space part here, or station actually, could be either. And if we try and find, okay, so here we have the very sorts of fly, or flying of the ly there as well, i n.

So that's, that's our shingle set. And with this, we have all of our shingles. So the next step is to create our vocabulary, which is just all of our shingles, our shingle sets, a union together. So to create that, all we do is go a union, b.union. Like that.

And we can see again, we have just a lot more text in there now, or a lot more, many more shingles. That is our vocab. So now we have our shingle set, and we have our vocab. So we can tick both of those off. Now what we need to do is create our one hot encoding over here.

And the only other thing we need is a zero vector. So there's two, well, I mean, there's more than two ways to do this. But I think there's two ways of thinking about it. Normally, the more efficient way would be to create a numpy array full of zeros, and then just add the ones in where we have matches between our vocab and shingle set.

But I'm not going to do that. I'm just going to keep things incredibly simple in the code that we're writing. So I'm going to do a, there's one hot. Or the one thing we should do is make this a list, because we want order in our vocab, and not have it shuffled.

So the, what we do here is we say one for x in a, or sorry, no. One if x is in a, else zero for x in vocab. So what we're doing here is looping through the vocab, and every single shingle within there, we're saying, if that exists in our signature, make that point in our list a one.

Otherwise, make it a zero. So that's simply our one hot encoding. So if we do a, b, c, and then we have a look at our a one hot, we see that we have this one hot encoded, or this sparse array. Now, min-hashing is the next step in our process, and it allows us to convert our, what are currently sparse vectors, into dense vectors, which we call signatures.

Now, what you see here is a run-through of how we do this for maybe one signature. We want to do that for multiple signatures though, so we would actually run through this process multiple times. So what we're doing here is we're creating a randomly permuted array, which counts from one to the length of our vocab.

And then what we are essentially doing, I know, so in this we're basically shuffling it, and then we're counting through until we find the first alignment to one within our vector. In reality, you just take all of your values, and you find a minimum one that aligns to one.

So that's if you're using NumPy, which we'll see later on. I'll just show you the code, I'm not going to actually write all of it though. So in code, that would look something like this. So we would start with a list, which is the range from one to the length of our vocab.

And if we have a look at that, we just see a count. We're going to shuffle that, so from random import shuffle, and we just do it like this. So it modifies it in place, so we don't need to do anything there. So let's view that. Okay, so now we've shuffled that, shuffled it twice now, but that's fine.

And let's just loop through five of those. So four, we can loop through more. For i in range from one to 10, what we're going to say is I just want to print i, which aligns to the hash example index for that value. Okay, if we print that, we see so one, the value one, where is it?

Here, is at index 85, two is at 53, and so on. And essentially what we're doing here is saying loop through these, identify this index, and this index in our one-hot vector, does it align to a one? You can see that here, we find the first one at eight.

And that means that our signature value for this point is, or for this min hash vector and our one-hot sparse vector here, that signature value will be eight. And we repeat that for multiple min hash vectors, which is what you can see here. So if we were to work through this, so we start at one here, that does not align to a one.

So we work up to two, and we find that it does align to a one. So that is why we have this here. And then we go on to this one here, we find one does not align, two still does not align, three does not align, and four does align.

So then we assign a four in our min hash function. We go along and keep doing that to create our signature. Okay, so if we, I'm going to use these functions here, it's just what we wrote before, but put into a cleaner format. And what I'm going to do is create 20 min hash vectors, run that.

And then here we are going to run each of our one-hot sparse vectors through our create hash function, which is here. And it's going to convert them into our signatures as we described before. And we see here that we have also what I meant. So here we have 20 min hash vectors, which means we have a length of 20 for each signature.

So what we see here are our dense vectors. And these are just compressed versions of our sparse vectors. And we can check that that is true by, we'll define a, we'll create a Jaccard similarity function. So we take, and here we take x and y, both will be sets.

And we just return the length of the intersection between both of those. So the intersection between those divided by the union of both of those. So that is how you calculate Jaccard similarity. This should be a y. Okay. And then if we do Jaccard on both of those, so we have a sig, b sig.

These will have to be converted into sets, I forgot. So like that. And then if we also take the Jaccard for, I think it's just a and b, right? So I'm going to copy that. Okay. So we get, this is 0.6 and this is 1.4. Now, if we look up here, I think it's a and b are not supposed to be very similar.

So that's fine. And then b and c should be similar. So if we swap this for c and then c here, we should both get higher values. And they should be roughly in the same ballpark. I mean, they're not perfect because we're using a very low number here. We're only using 20 values and typically use a lot more.

But that's fine. So you can see that they, they're both aligned, right? So despite converting these into the signature vectors, it recognizes that they are pretty similar. And converting these into signature vectors, it still recognizes that they are reasonably similar. So that's good. That's what we want. Now, the final step in our whole LHS process is the LHS function itself.

So this is essentially what it does. So we have our signature over here, which we built using the steps that we just went through, which you can, you can see here. And from that signature, we take a certain number of equal length subvectors. So we define that using this here, this b.

So b is three. So that means we split our signature into three different subvectors, which we see over here. And ideally, what we want to be doing here is saying, okay, we process our subvectors each through a, either a different hash function, or it can be the same hash function, just as long as we use that same hash function for the equivalent subvector in another signature, which you'll see in a moment, it'll make sense.

And, you know, once we have multiple signatures going together through those hash functions, you can see here that they're equivalent on both sides, hash one, hash one here. These can all just be a single hash function as well, which is what we're going to do. We're not really going to use a hash function.

And what we get here is three opportunities to identify these signatures as being potential candidate pairs, which is where we consider it for further similarity comparisons. In this case, hash threes both collide down here. So we say, okay, that means that a and b are candidate pairs. I'm just going to put canned pairs.

So this active of splitting our signatures up into multiple subvectors just gives us more opportunities to identify similarities, because if we were to use the full vectors, the full vector would have to be very similar for them to be put into the same hash bucket. With this, we only part of it to be very similar.

So increases the chances of us finding those similar signatures. So we're going to implement a very simple version of this. I'm going to keep this very simple. Here, we're just splitting our signature vector. So we add our signature and b, which is the number of bands. And the first thing we do is just make sure that our signature can be split into b bands equally.

So where we take the remainder after the division here, it must be equal to zero. And then we say, we need to calculate the rows. So the number of rows within each band, we should just see the length of the signature divided by b. And then we initialize a subvector array or list.

And then we loop through and append subvectors. Really simple, simple implementation. And let's apply that to b and c. So we have said that we want 10 bands. So we only have 20 items or 20 numbers within our signature vectors. So obviously, we only get bands of two rows at a time.

And we should find that at least one of those match. So what we do is we loop through and we say, if b rows equals c rows, break. And we find very quickly that there is a candidate pair there. So that means that b and c, the full vectors, would be considered as a candidate pair.

Let's do the same for a. And we should find, OK, so for both a and b and a and c, it's not considered a candidate pair because there's just no similarity there. So that's good. That's exactly what we wanted to happen. That is our implementation of this. So the LSH, traditional LSH approach.

Now, a few other things that we haven't covered but we should just touch on quickly. And you can find-- so there's an article link in the description which covers this. I walk through all of this. And there will also be a notebook where I'm getting these results from in the first place.

So you can also look at that. That includes the NumPy implementations of what we've just done, which is slightly more efficient, although not super efficient because I want it to still be readable. So what we have here is a visualization that shows the similarity, the cosine similarity, of our signature vectors and whether they were considered as candidate pairs or not.

So these up here, these are our candidate pairs. This is just a random sample. I think the actual full data set is really big. So running this, all of them, is super inefficient because we're also running everything else through. So I can actually have the visualization here. But if you run just LSH on it, it does work just fine.

So at the top there, we have our candidates. At the bottom, we have our non-candidates. We have similarities. So you can see that high similarity does correlate with them being classified as candidate pairs, which is good. It's obviously what we want. And there is this formula that I did not write down, which I should have done, which is p equals 1 minus 1 minus s, which is our similarity down here, to the power of r, which is the number of rows in each band, and all of this to the power of b, which is the number of bands.

Now, that correlates to this line here, this probability. Obviously, it's P, capital P. So that's where it's coming from. And if we run this with different similarity values, this is the pattern that we get. And obviously, that correlates, you can see, with whether something is classified as a candidate pair or not.

And what we can do is we can modify b to push the number of candidate pair classifications either up or down. So here, we have different b values. At the side, we have black, which is 50. Then we go 25, 20, which is what we used before, and 5.

So let's say we found that we're not identifying enough candidate pairs. We could push that down a little bit. Maybe we don't do too much. So we could change b from 20 to 25. And if we do that, we see this. So in green, you have our old results and our old probability line.

And then in blue and pink, we have the new ones again, or blue and magenta. So what we see here is we've pushed that down. So we've changed b to 25. And now we're returning more results. So over here, we have these, for example, which we're not returning before.

And there are also more values in here as well. And there are less values down here. So that's the result of us modifying b. So we can visualize that. So if we increase b, we move it in this direction, which increases the number of candidate pairs, which also increases the number of false positives that we're going to return.

This line, by the way, is our threshold. It's a similarity threshold. This is basically where we want the cutoff to be between things being identified as candidate pairs and not candidate pairs. It's like our target, almost. Or if we wanted to reduce the number of candidate pairs, because maybe we're getting too many false positives, we can push it this way, which will result in less candidate pairs, but also results in more false negatives.

So non-candidate pairs where we should have candidate pairs. So it's just a case of balancing both of those. But that's everything for this video. I hope it's been useful. And I will see you in the next one.