back to index

Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)


Chapters

0:0 Intro
1:21 Overview
5:58 Shingling
8:45 Vocab
9:27 One-hot Encoding
11:10 MinHash
15:51 Signature Info
18:8 LSH
22:20 Tuning LSH

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi and welcome to the video. Today we're going to be covering another technique in
00:00:05.280 | similarity search called locality sensitive hashing or LSH. Now LSH is a hugely popular
00:00:13.680 | technique used in efficient similarity search. Now there are a huge number of companies that use
00:00:21.600 | similarity search. I mean you have big names like Google. I mean Google is built from similarity
00:00:27.760 | search and then you have Netflix, Amazon, Spotify. All of them are constantly recommending you
00:00:33.840 | different products, films, music and they do that by comparing you to other customers. So
00:00:41.520 | they are performing a similar search between you other customers and identifying the most similar
00:00:45.920 | ones. Now you have two approaches. You have exhaustive which is comparing all of the data
00:00:54.400 | points. I'm just going to call them vectors from now on because that's what we'll be using. So
00:00:58.240 | comparing all these vectors and obviously it's slow. Approximate search allows us to
00:01:06.080 | approximate those vectors, restrict our scope to a more relevant range of vectors
00:01:13.280 | and so on. So it covers a lot of different techniques. It's not just one technique here.
00:01:17.760 | The one we're going to be covering today is locality sensitive hashing. So at its core LSH
00:01:26.000 | is a hashing algorithm which attempts to maximize hash collisions. So
00:01:33.520 | what we see on screen right now is a dictionary, like a typical Python dictionary in the way that
00:01:42.080 | it hashes different items. So we have our keys which are items that we're hashing. We process
00:01:49.440 | them through a hashing function and that hashing function attempts to minimize hashing collisions,
00:01:57.600 | e.g. to not put keys in the same bucket. It wants every key to go to a separate bucket.
00:02:05.600 | And then these are connected. They don't contain the values but they're connected
00:02:10.080 | to the values that we relate back to our keys. So that's a Python dictionary. That's our Python
00:02:17.440 | dictionary. But we're not wanting to minimize collisions. We are wanting to maximize the
00:02:24.240 | collisions. So what we see here is a hashing function that maximizes those collisions. So
00:02:31.920 | this is essentially what LSH is doing. So we are attempting to, for any similar keys, so these here
00:02:41.040 | and these here, they're all similar enough for us to want to put them into the same bucket. So we
00:02:48.240 | put two of them into here and then the other three into this bucket. Now, there are quite
00:02:54.880 | a few different ways of doing this and there are a lot of different LSH methods. In fact, LSH is
00:03:01.760 | a very generic term that applies to a lot of different algorithms. And the one that we will
00:03:07.280 | be covering is what I see as the traditional version. So it's the original version of LSH.
00:03:14.400 | And what we'll be covering in this video is shingling, minhashing, and that LSH function.
00:03:22.560 | So we'll get to understand why very soon. So here is the overview of the process that we're going to
00:03:32.640 | be walking through. So we have shingling. So we have at the very start, we have this text. So
00:03:40.720 | flying fish flew by the space station. Now, that's just a string. And what we want to do is extract
00:03:50.400 | all of the unique pairs of text. So when we say shingling, it's K shingling. And in this case,
00:03:59.840 | our K value is two because we're taking two characters at once. If we were to take
00:04:04.720 | K equals four, for example, then we would take like pace and then move on. We'd take ace and a
00:04:13.200 | space and so on. So that's the shingling. And from that, we create a set. So if we have duplicate
00:04:23.120 | shingles, we remove those. So we just end up with one. So in this, I don't know if we do have any
00:04:31.360 | duplicates, but say maybe down here, we had IN again, because we also have it up here.
00:04:38.400 | We would end up with just a single IN in the set. We wouldn't have two. And then we one-hot encode
00:04:44.800 | those. So that means we take a vocabulary from all of our text. So not just this one sentence,
00:04:52.080 | but we'll have more than one sentence, obviously, that we're comparing. And we'll use that to build
00:04:56.560 | a one-hot vector from the vocab and our shingle set. Then we process that through something called
00:05:04.080 | a min hash function, which produces this dense vector or signature. So this thing down here,
00:05:12.560 | that is what's called a signature. And then we band that into this final bit here. This is our
00:05:26.080 | actual LSH process. So we band that vector into multiple sub-vectors, and then we hash them. So
00:05:37.280 | where we find that we have any two sub-vectors go to the same hash bucket, then that means that
00:05:44.160 | the full vector that they both come from is considered, or the two full vectors that they
00:05:51.040 | both come from are considered a candidate pair. And we take those and we then calculate some other,
00:05:56.560 | we calculate the similarity between them. Okay. So the first step in our process,
00:06:02.960 | like we discussed, is the shingling operation. So shingling is simply where we take a window
00:06:10.880 | of length K characters, and we simply move that down through our text, like you can see here.
00:06:18.800 | And from that, we create the shingle set. So in Python, what we would do to shingle these three
00:06:28.720 | sentences we have here is we'll create a shingle function here. And this is going to take some
00:06:36.000 | text, which is a string. And we're going to say, we're going to define the K values of number of
00:06:43.200 | characters we take within each window, which is obviously an integer. Now we initialize our
00:06:49.600 | shingle set here, we'll make a string initially. And then what we do is for i in range, and then
00:06:59.280 | here we want to go from the, or we want to go to the length of our text, minus K. So minus that
00:07:10.240 | window length, plus one, because we want to go right up to the end of that. And then here, all
00:07:16.000 | we do is shingle set.append. And then we write, so we have the text and we want to go from i up until
00:07:26.960 | i plus K. Okay, that's our shingle list, I suppose. And then we want to return a set. So this will
00:07:38.400 | remove any duplicates that we have. So shingle set. Okay, so that's our shingle function. And
00:07:46.160 | we just want to process each one of our sentences through that. So we'll go a equals shingle,
00:07:55.520 | a. Also, we need to define K, which can be two. I'll just define K here.
00:08:06.640 | Okay, and then let's have a look at what we have. And we see that we have this, it's shuffled,
00:08:12.560 | there's no order to our set here. And we see that we have all of the pairs of words in there.
00:08:21.440 | So we have S for the start of the space part here, or station actually, could be either.
00:08:30.480 | And if we try and find, okay, so here we have the very sorts of fly, or flying of the ly there as
00:08:37.600 | well, i n. So that's, that's our shingle set. And with this, we have all of our shingles.
00:08:46.480 | So the next step is to create our vocabulary, which is just all of our shingles,
00:08:54.000 | our shingle sets, a union together. So to create that, all we do is go a union, b.union.
00:09:03.600 | Like that.
00:09:06.720 | And we can see again, we have just a lot more text in there now, or a lot more,
00:09:15.120 | many more shingles. That is our vocab. So now we have our shingle set, and we have our vocab.
00:09:23.840 | So we can tick both of those off. Now what we need to do is create our one hot encoding
00:09:30.960 | over here. And the only other thing we need is a zero vector. So there's two, well, I mean,
00:09:39.200 | there's more than two ways to do this. But I think there's two ways of thinking about it. Normally,
00:09:43.760 | the more efficient way would be to create a numpy array full of zeros, and then just
00:09:49.680 | add the ones in where we have matches between our vocab and shingle set. But I'm not going to do
00:09:55.440 | that. I'm just going to keep things incredibly simple in the code that we're writing. So I'm
00:10:03.600 | going to do a, there's one hot. Or the one thing we should do is make this a list, because we want
00:10:10.640 | order in our vocab, and not have it shuffled. So the, what we do here is we say one for x in a,
00:10:24.080 | or sorry, no. One if x is in a, else zero for x in vocab. So what we're doing here is looping
00:10:38.640 | through the vocab, and every single shingle within there, we're saying, if that exists in our
00:10:45.600 | signature, make that point in our list a one. Otherwise, make it a zero. So that's simply our
00:10:55.680 | one hot encoding. So if we do a, b, c, and then we have a look at our a one hot, we see that we have
00:11:06.000 | this one hot encoded, or this sparse array. Now, min-hashing is the next step in our process,
00:11:14.960 | and it allows us to convert our, what are currently sparse vectors, into dense vectors,
00:11:20.240 | which we call signatures. Now, what you see here is a run-through of how we do this for maybe one
00:11:27.440 | signature. We want to do that for multiple signatures though, so we would actually run
00:11:32.400 | through this process multiple times. So what we're doing here is we're creating a randomly
00:11:37.680 | permuted array, which counts from one to the length of our vocab. And then what we are essentially
00:11:46.560 | doing, I know, so in this we're basically shuffling it, and then we're counting through until we find
00:11:52.080 | the first alignment to one within our vector. In reality, you just take all of your values,
00:12:00.000 | and you find a minimum one that aligns to one. So that's if you're using NumPy, which we'll see
00:12:07.280 | later on. I'll just show you the code, I'm not going to actually write all of it though.
00:12:11.520 | So in code, that would look something like this. So we would start with a list, which is the
00:12:22.000 | range from one to the length of our vocab. And if we have a look at that, we just see
00:12:28.320 | a count. We're going to shuffle that, so from random import shuffle,
00:12:35.600 | and we just do it like this. So it modifies it in place, so we don't need to do anything there. So
00:12:49.520 | let's view that. Okay, so now we've shuffled that, shuffled it twice now, but that's fine.
00:13:02.160 | And let's just loop through five of those. So four, we can loop through more. For i in range
00:13:12.320 | from one to 10, what we're going to say is I just want to print
00:13:17.200 | i, which aligns to the hash example index for that value. Okay, if we print that, we see
00:13:33.840 | so one, the value one, where is it? Here, is at index 85, two is at 53, and so on.
00:13:46.560 | And essentially what we're doing here is saying loop through these, identify this index, and
00:13:54.640 | this index in our one-hot vector, does it align to a one? You can see that here, we find the first
00:14:05.520 | one at eight. And that means that our signature value for this point is, or for this min hash
00:14:15.920 | vector and our one-hot sparse vector here, that signature value will be eight. And we repeat that
00:14:24.080 | for multiple min hash vectors, which is what you can see here. So if we were to work through this,
00:14:32.320 | so we start at one here, that does not align to a one. So we work up to two, and we find that it
00:14:40.880 | does align to a one. So that is why we have this here. And then we go on to this one here, we
00:14:48.720 | find one does not align, two still does not align, three does not align, and four does align. So
00:14:59.360 | then we assign a four in our min hash function. We go along and keep doing that to create our
00:15:08.000 | signature. Okay, so if we, I'm going to use these functions here, it's just what we wrote before,
00:15:13.920 | but put into a cleaner format. And what I'm going to do is create 20 min hash vectors,
00:15:20.400 | run that. And then here we are going to run each of our one-hot sparse vectors through our create
00:15:32.240 | hash function, which is here. And it's going to convert them into our signatures as we described
00:15:38.080 | before. And we see here that we have also what I meant. So here we have 20 min hash vectors,
00:15:45.760 | which means we have a length of 20 for each signature. So what we see here are our dense
00:15:55.680 | vectors. And these are just compressed versions of our sparse vectors. And we can check that that
00:16:04.080 | is true by, we'll define a, we'll create a Jaccard similarity function. So we take,
00:16:12.080 | and here we take x and y, both will be sets. And we just return the length of the intersection
00:16:21.520 | between both of those. So the intersection between those divided by the union of both of those. So
00:16:29.600 | that is how you calculate Jaccard similarity. This should be a y.
00:16:39.120 | Okay. And then if we do Jaccard on both of those, so we have a sig, b sig.
00:16:52.240 | These will have to be converted into sets, I forgot. So like that.
00:16:57.680 | And then if we also take the Jaccard for, I think it's just a and b, right?
00:17:08.880 | So I'm going to copy that. Okay. So we get, this is 0.6 and this is 1.4. Now, if we look up here,
00:17:22.400 | I think it's a and b are not supposed to be very similar. So that's fine. And then b and c should
00:17:28.320 | be similar. So if we swap this for c and then c here, we should both get higher values. And they
00:17:37.520 | should be roughly in the same ballpark. I mean, they're not perfect because we're using a very
00:17:41.920 | low number here. We're only using 20 values and typically use a lot more.
00:17:48.960 | But that's fine. So you can see that they, they're both aligned, right? So despite converting these
00:17:56.720 | into the signature vectors, it recognizes that they are pretty similar. And converting these
00:18:03.360 | into signature vectors, it still recognizes that they are reasonably similar. So that's good.
00:18:09.600 | That's what we want. Now, the final step in our whole LHS process is the LHS function itself.
00:18:19.760 | So this is essentially what it does. So we have our signature over here, which we built
00:18:28.880 | using the steps that we just went through, which you can, you can see here. And from that signature,
00:18:34.880 | we take a certain number of equal length subvectors. So we define that using this here,
00:18:43.440 | this b. So b is three. So that means we split our signature into three different subvectors,
00:18:49.040 | which we see over here. And ideally, what we want to be doing here is saying, okay,
00:18:59.680 | we process our subvectors each through a, either a different hash function, or it can be the same
00:19:06.320 | hash function, just as long as we use that same hash function for the equivalent subvector in
00:19:11.920 | another signature, which you'll see in a moment, it'll make sense. And, you know, once we have
00:19:17.920 | multiple signatures going together through those hash functions, you can see here that they're
00:19:22.240 | equivalent on both sides, hash one, hash one here. These can all just be a single hash function as
00:19:28.080 | well, which is what we're going to do. We're not really going to use a hash function. And what we
00:19:34.000 | get here is three opportunities to identify these signatures as being potential candidate pairs,
00:19:41.760 | which is where we consider it for further similarity comparisons. In this case,
00:19:49.040 | hash threes both collide down here. So we say, okay, that means that a and b are candidate pairs.
00:19:59.360 | I'm just going to put canned pairs. So this active of splitting our signatures up into
00:20:08.400 | multiple subvectors just gives us more opportunities to identify similarities,
00:20:12.160 | because if we were to use the full vectors, the full vector would have to be very similar
00:20:19.600 | for them to be put into the same hash bucket. With this, we only part of it to be very similar. So
00:20:26.560 | increases the chances of us finding those similar signatures. So we're going to implement a very
00:20:32.720 | simple version of this. I'm going to keep this very simple. Here, we're just splitting our
00:20:38.320 | signature vector. So we add our signature and b, which is the number of bands.
00:20:42.800 | And the first thing we do is just make sure that our signature can be split into b bands
00:20:48.320 | equally. So where we take the remainder after the division here, it must be equal to zero.
00:20:56.640 | And then we say, we need to calculate the rows. So the number of rows within each band,
00:21:03.040 | we should just see the length of the signature divided by b. And then we initialize a subvector
00:21:09.520 | array or list. And then we loop through and append subvectors. Really simple, simple
00:21:18.160 | implementation. And let's apply that to b and c. So we have said that we want 10 bands. So
00:21:29.760 | we only have 20 items or 20 numbers within our signature vectors. So obviously, we only get
00:21:39.760 | bands of two rows at a time. And we should find that at least one of those match. So what we do
00:21:47.840 | is we loop through and we say, if b rows equals c rows, break. And we find very quickly that there
00:21:57.040 | is a candidate pair there. So that means that b and c, the full vectors, would be considered as
00:22:03.040 | a candidate pair. Let's do the same for a. And we should find, OK, so for both a and b and a and c,
00:22:12.720 | it's not considered a candidate pair because there's just no similarity there. So that's good.
00:22:18.320 | That's exactly what we wanted to happen. That is our implementation of this. So the LSH,
00:22:26.080 | traditional LSH approach. Now, a few other things that we haven't covered but we should
00:22:35.440 | just touch on quickly. And you can find-- so there's an article link in the description
00:22:41.840 | which covers this. I walk through all of this. And there will also be a notebook where I'm getting
00:22:47.920 | these results from in the first place. So you can also look at that. That includes the NumPy
00:22:52.800 | implementations of what we've just done, which is slightly more efficient, although not super
00:22:59.680 | efficient because I want it to still be readable. So what we have here is a visualization that shows
00:23:08.080 | the similarity, the cosine similarity, of our signature vectors and whether they were considered
00:23:14.960 | as candidate pairs or not. So these up here, these are our candidate pairs. This is just
00:23:25.840 | a random sample. I think the actual full data set is really big. So running this, all of them,
00:23:31.360 | is super inefficient because we're also running everything else through. So I can actually
00:23:36.640 | have the visualization here. But if you run just LSH on it, it does work just fine.
00:23:43.600 | So at the top there, we have our candidates. At the bottom, we have our non-candidates. We
00:23:49.280 | have similarities. So you can see that high similarity does correlate with them being
00:23:52.880 | classified as candidate pairs, which is good. It's obviously what we want. And there is this
00:23:59.600 | formula that I did not write down, which I should have done, which is p equals 1 minus 1 minus s,
00:24:09.120 | which is our similarity down here, to the power of r, which is the number of rows in each band,
00:24:14.960 | and all of this to the power of b, which is the number of bands. Now, that correlates to this
00:24:23.360 | line here, this probability. Obviously, it's P, capital P. So that's where it's coming from.
00:24:29.920 | And if we run this with different similarity values, this is the pattern that we get.
00:24:37.840 | And obviously, that correlates, you can see, with whether something is classified as a candidate
00:24:43.680 | pair or not. And what we can do is we can modify b to push the number of candidate pair
00:24:54.240 | classifications either up or down. So here, we have different b values. At the side, we have
00:25:01.360 | black, which is 50. Then we go 25, 20, which is what we used before, and 5. So let's say we found
00:25:10.000 | that we're not identifying enough candidate pairs. We could push that down a little bit. Maybe we
00:25:14.960 | don't do too much. So we could change b from 20 to 25. And if we do that, we see this. So in green,
00:25:22.880 | you have our old results and our old probability line. And then in blue and pink, we have the new
00:25:30.480 | ones again, or blue and magenta. So what we see here is we've pushed that down. So we've changed
00:25:40.400 | b to 25. And now we're returning more results. So over here, we have these, for example,
00:25:48.960 | which we're not returning before. And there are also more values in here as well. And there are
00:25:54.720 | less values down here. So that's the result of us modifying b. So we can visualize that. So if we
00:26:05.600 | increase b, we move it in this direction, which increases the number of candidate pairs,
00:26:17.680 | which also increases the number of false positives that we're going to return. This
00:26:22.320 | line, by the way, is our threshold. It's a similarity threshold. This is basically where
00:26:26.480 | we want the cutoff to be between things being identified as candidate pairs and not candidate
00:26:31.440 | pairs. It's like our target, almost. Or if we wanted to reduce the number of candidate pairs,
00:26:39.440 | because maybe we're getting too many false positives, we can push it this way, which will
00:26:45.360 | result in less candidate pairs, but also results in more false negatives. So non-candidate pairs
00:26:55.200 | where we should have candidate pairs. So it's just a case of balancing both of those. But that's
00:27:00.640 | everything for this video. I hope it's been useful. And I will see you in the next one.