back to index

IndexLSH for Fast Similarity Search in Faiss


Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi and welcome to the video. We're going to be covering how we can implement LSH in FICE.
00:00:07.440 | Now we covered LSH and how it works using random projection in a previous video which I'm going to
00:00:15.360 | release at the same time as this video so I'll link to that in the description if you want to
00:00:20.480 | understand how LSH works. But in this video we're just going to cover how we implement it in FICE.
00:00:25.840 | So we're going to be using the SIFT1M dataset which you can download using, I'll put a link
00:00:34.960 | to the script in the description, but it's just a big dataset of dense vectors. Now from this we'll
00:00:43.440 | get our 1 million samples xb and we also have our queries which I'm going to just, we're just going
00:00:52.480 | to use one query at a time. Okay so we'll have one query, 1 million samples, and we perform our search
00:00:59.360 | using that. So what we first want to do is import FICE. I'm going to get my dimensionality which is
00:01:07.520 | is it xb.shape and one I think d yep 128 so that's our dimensionality of each vector
00:01:17.440 | within the SIFT1M dataset and initially let's use an n bits value of 4 which gives us if we
00:01:24.880 | think about it how many how many buckets does that give us we can do this right so it gives us
00:01:31.840 | a total of 16 buckets. Okay these are also potential buckets our vectors aren't necessarily
00:01:39.840 | going to be separated between all of them they might we might end up only using like 10 of those
00:01:45.680 | buckets for example. Okay and then what we want to do is initialize our index so FICE index lsh
00:01:53.200 | and in here we need to pass our dimensionality and also our n bits value so how many binary values
00:02:02.320 | we're going to include within each bucketed item. So do we first thing it's always good to check
00:02:12.240 | do we need to train the index. Index is trained is true so that means we don't need to train it
00:02:17.920 | there's no optimizations or anything we need to do here so that's good don't need to
00:02:21.520 | worry about that and then so without needing training all we do is index add and we add our
00:02:31.360 | data. Okay and now we're actually just ready to go ahead and search so we do di equals index search
00:02:41.680 | xq and let's return let's set k up here actually so let's set k equal to 10. Okay so we do k
00:02:53.680 | and then we get the these are the apparently the most similar indexes to our query. Okay
00:03:05.200 | and there's one good thing to check here which makes it really easy to figure out if we have
00:03:10.640 | enough buckets or not is d which is the distance between those vectors. Now obviously if we have
00:03:18.080 | all of our vectors in a single bucket and we search and our query ends up in the same bucket
00:03:26.080 | all of the distances here are going to be zero because they're all going to be exactly the same
00:03:32.800 | hamming distance which is what we're returning so let's look and that's what we get. I mean we have
00:03:39.120 | very few numbers here so that's not that surprising. Let's push k up to 10k you'd
00:03:47.200 | think at this point you know surely we'd get some ones but no. So at least the first 10,000
00:03:55.200 | nearest neighbors are all within one bucket so it clearly there's an issue here we can't
00:04:02.400 | differentiate between all of those but if we just increase this to a very high number let's say like
00:04:10.240 | 500,000 so this is returning half of the index at this point we should I imagine see we see a few
00:04:17.840 | ones so at some point within the first 500,000 items we do actually we move from a single bucket
00:04:27.520 | to at least one other bucket. Now if we think about it we can we can actually estimate roughly
00:04:36.320 | how what n bits value we need in order to kind of get a better distribution between our buckets
00:04:45.840 | so to not have everything in a single bucket so let's try these we have for n bits in in these
00:04:52.880 | values let's give it a go see what we get and we see okay n bits equals two so this gives us an
00:05:00.320 | average of 250,000 items within each bucket so obviously that's not useful. Here we get 62k
00:05:10.000 | again not used to 4k 15k that's a lot better right but it's still not useful because we kind of want
00:05:19.840 | you know if we if we set k equal to 10 this isn't really going to give us anything particularly
00:05:27.040 | useful so and then we also need to consider the fact that we in our buckets not all of them are
00:05:34.320 | going to get used and what we will find is that and it's almost like there's there will be a
00:05:39.120 | majority of or there will be a minority of buckets that end up with the majority of vectors so we we
00:05:48.880 | need to go sub one uh for this to to work so i would say something probably n bits of 24 we're
00:05:58.560 | getting that i mean that value actually looks pretty good 0.05 so on average we should get
00:06:04.080 | 0.05 items within each bucket okay so i mean let's let's test that
00:06:14.640 | so we're gonna copy this and we'll just write a similar code to what we wrote before although
00:06:20.960 | one thing before we do that let's just check uh the average similarity let's uh so change k to
00:06:28.480 | like 100 or 10 let's let's go with 10 yeah let's do 10 and what i want to do is given that i want
00:06:39.440 | to calculate the the cosine similarity average cosine similarity so we'll do that we use from
00:06:46.960 | sk learn metrics pairwise import cosine similarity this is just a super quick way for us to calculate
00:06:56.320 | and then check sk learn and we'll do cosine similarity and we have i zero
00:07:15.760 | xq sure oh so sorry we need to uh sorry to get there the indexes so i i just contains the indexes
00:07:29.200 | uh we need to we need to pull them out of xb okay and we have this so we also need to take the mean
00:07:37.360 | of that so cos mean we'll get 0.44 which is just the average of those the average similarity though
00:07:48.960 | so it's pretty low let's compare that to if we did that for the whole um the whole data set so
00:07:56.480 | like this see what we get okay so actually these are yeah i mean these are slightly more similar
00:08:06.960 | um so we even you know get slightly better results even even with that terrible
00:08:12.400 | sort of bucketing and so what i want to do is is repeat that process we just did there but i want
00:08:20.720 | to do it for different n bits values so we need to take our index bring it down here and we're
00:08:31.120 | going to say this we do index add our data and then we do di equals index search as we as we
00:08:46.480 | usually do k is 10 and then we want to do cos equals cosine similarity again we're going to do
00:08:56.400 | xb i so we're pulling out those indexes that we've got um we're doing this different n bits values
00:09:03.360 | hence hence loop and xq okay and then what we're going to do is just print
00:09:11.200 | n bits is equal to whatever it's equal to at that point
00:09:20.960 | and then new line i'm going to put um what is similarity so the similarity is equal to
00:09:29.520 | cos dot mean like that and let's just see what we what we get so okay cool so so we look here
00:09:39.840 | and so n bits of two is it's terrible gets better it gets better and then at 24 so that bit where
00:09:48.720 | we had the 0.05 we get the maximum value so we get this uh 7 3 here 7 4 which is the best we
00:09:56.960 | get and then it kind of comes off a little bit here and i've kind of visualized that here so
00:10:02.320 | this is the same um so structure i've gone a little bit high with with n bits as well
00:10:08.160 | and this is essentially the sort of structure we get so the bit that we just kind of saw
00:10:15.040 | was this section here where we're increasing n bits and reducing the number of or reducing
00:10:22.640 | like heavily reducing the number of vectors within each bucket on average and the cosine
00:10:30.320 | similarity just shoots up really quick but then afterwards we still get this kind of slow increase
00:10:39.360 | and that's because so the buckets are now quite spread like the the vectors that are spread
00:10:47.040 | very nicely between different buckets but as we increase the resolution even further and further
00:10:53.520 | all it means is that we're adding more of the original information the original positional
00:10:59.120 | information of those vectors and essentially improving the resolution of those vectors so
00:11:03.600 | they're just getting slightly more accurate in the whether the hashed buckets are closer to other
00:11:12.080 | very similar or not similar vectors now essentially the binary vectors are more heavily representing
00:11:20.160 | the original dense vectors and that's why we see that slight increase that continues even after we
00:11:28.960 | get to two n bits of 32 now one final thing i just wanted to quickly cover um is say we do have
00:11:38.640 | issues and we want to sort of visualize the actual buckets themselves you know how how can we how can
00:11:44.800 | we extract those buckets and and view and view the number of vectors that we have in each one now
00:11:51.440 | there isn't any any way of doing this in um in vice so you we kind of have to put our own
00:12:00.400 | code together so what we need to do is i'm going to create first create another index
00:12:06.080 | because i want to visualize these so i'm going to do and if what if i visualize it with n bits
00:12:13.600 | of 32 there's going to be well how many how many buckets would we have there it would be
00:12:20.320 | 2 to the power 32 so yeah we can't visualize that many buckets so we're going to go with
00:12:25.920 | 4 and n bits of 4 which gives us 16 buckets so index lsh we have d and we're going to use
00:12:36.320 | a n bits value of 4 so n bits equals 4 just put that there
00:12:44.160 | okay we're going to add our vectors and now what we want to do is actually extract
00:12:55.200 | those binary vectors themselves so to do that we write array so we're going to store them in
00:13:02.480 | the array variable and we do vice dot vector to array like this and in here we do index dot codes
00:13:14.560 | so codes um just that word there is essentially that means you know the binary vectors that we've
00:13:22.720 | built they are called codes that's all it means but and we use it this word a lot in in similarity
00:13:28.800 | search so it's worth remembering also n bits is used a lot as well so in future videos we will
00:13:35.840 | in this series we you will see them quite a lot so then let's have a look at what we have we have
00:13:43.120 | okay so we wanted to pull out the binary codes and we're getting these these numbers which is
00:13:51.360 | not exactly what we expect or or is it right um well okay let's have a look at the the min value
00:13:58.400 | here zero okay array max okay so we get 0 to 15 which means we get 16 values and okay let's do
00:14:09.520 | n to the power of n bits okay so that means we have 16 unique buckets that all of our vectors
00:14:16.720 | are spread between so that's interesting let's see how many vectors we have we have 1 million
00:14:25.200 | so we have 1 million spread between 16 buckets so actually that sounds about right uh but how do we
00:14:30.480 | so these these values here these are just the integer versions of our binary vectors so um well
00:14:39.920 | like this is represented by a zero and this is represented by one right so what we can do is use
00:14:53.360 | this little little script here to convert them into uh the actual binary vectors so here this
00:15:00.880 | this value is our first one which is a five and the next one is a 12 which you see here
00:15:08.240 | so i mean that's that is our there are binary vectors and we can we can visualize how they are
00:15:16.480 | are spread um how many items are within each one of those buckets using this array and what we'll
00:15:23.520 | get is something that looks like this um so this is what i mean where we have so we have 16 buckets
00:15:30.000 | which you can see on the left here so all of these with 16 of those and not all of them even
00:15:38.800 | have any any vectors in so we have so we have one two three four that don't even contain any vectors
00:15:45.600 | and then the majority the vast majority are contained within these and these right so so these
00:15:53.760 | four buckets contain almost everything and they have loads they have uh so the count is
00:16:01.920 | is to the thousands there so here that that number
00:16:06.400 | add a few more zeros onto the end um so yeah the these four buckets are far too many
00:16:16.240 | far too many vectors in for it to be useful so that's why we before we increase it
00:16:22.480 | increase our end bits value up to 24 now i think i mean that's basically everything i wanted to
00:16:28.640 | cover but just be quickly in terms of performance um so visualize these using this is using device
00:16:35.600 | lh s lsh index um so we have n bits so obviously we just increase n bits loads
00:16:43.360 | we increase the recall performance so that's one thing about lsh is that to get good performance
00:16:51.680 | you really have to increase the n bits value a lot and at that point it actually gets it can
00:16:57.120 | get really inefficient um so this is a the time for each query as compared to a flight index
00:17:04.560 | so obviously i mean you you kind of want to be careful with a lsh index obviously it can be useful
00:17:15.600 | um but you know high dimensionality is difficult to to balance well with with speed and accuracy
00:17:24.720 | but obviously you know it can be done it just depends on how much accuracy you're willing to
00:17:31.840 | sacrifice because in most cases with lsh you will be sacrificing a reasonable amount of accuracy so
00:17:38.080 | the recall here is about 50 percent um so that's uh it's it's managing to to identify 50 percent
00:17:46.800 | around here of the same items that a flat index would would identify and it's only slightly faster
00:17:54.320 | so yeah you just have to sort of weigh you know what is most important for your use case with that
00:18:03.680 | but anyway that's it for lsh um this this is the last video we'll cover uh for lsh for now
00:18:13.040 | in the next few videos and articles we're going to be covering a few different indexes and
00:18:22.800 | and how we use them uh effectively so it should be pretty interesting but we're definitely getting a
00:18:30.320 | bit more advanced than lsh now so it should be good but that's it for this video thank
00:18:36.160 | you very much for watching and i'll see you in the next one