IndexLSH for Fast Similarity Search in Faiss

Hi and welcome to the video. We're going to be covering how we can implement LSH in FICE. Now we covered LSH and how it works using random projection in a previous video which I'm going to release at the same time as this video so I'll link to that in the description if you want to understand how LSH works.

But in this video we're just going to cover how we implement it in FICE. So we're going to be using the SIFT1M dataset which you can download using, I'll put a link to the script in the description, but it's just a big dataset of dense vectors. Now from this we'll get our 1 million samples xb and we also have our queries which I'm going to just, we're just going to use one query at a time.

Okay so we'll have one query, 1 million samples, and we perform our search using that. So what we first want to do is import FICE. I'm going to get my dimensionality which is is it xb.shape and one I think d yep 128 so that's our dimensionality of each vector within the SIFT1M dataset and initially let's use an n bits value of 4 which gives us if we think about it how many how many buckets does that give us we can do this right so it gives us a total of 16 buckets.

Okay these are also potential buckets our vectors aren't necessarily going to be separated between all of them they might we might end up only using like 10 of those buckets for example. Okay and then what we want to do is initialize our index so FICE index lsh and in here we need to pass our dimensionality and also our n bits value so how many binary values we're going to include within each bucketed item.

So do we first thing it's always good to check do we need to train the index. Index is trained is true so that means we don't need to train it there's no optimizations or anything we need to do here so that's good don't need to worry about that and then so without needing training all we do is index add and we add our data.

Okay and now we're actually just ready to go ahead and search so we do di equals index search xq and let's return let's set k up here actually so let's set k equal to 10. Okay so we do k and then we get the these are the apparently the most similar indexes to our query.

Okay and there's one good thing to check here which makes it really easy to figure out if we have enough buckets or not is d which is the distance between those vectors. Now obviously if we have all of our vectors in a single bucket and we search and our query ends up in the same bucket all of the distances here are going to be zero because they're all going to be exactly the same hamming distance which is what we're returning so let's look and that's what we get.

I mean we have very few numbers here so that's not that surprising. Let's push k up to 10k you'd think at this point you know surely we'd get some ones but no. So at least the first 10,000 nearest neighbors are all within one bucket so it clearly there's an issue here we can't differentiate between all of those but if we just increase this to a very high number let's say like 500,000 so this is returning half of the index at this point we should I imagine see we see a few ones so at some point within the first 500,000 items we do actually we move from a single bucket to at least one other bucket.

Now if we think about it we can we can actually estimate roughly how what n bits value we need in order to kind of get a better distribution between our buckets so to not have everything in a single bucket so let's try these we have for n bits in in these values let's give it a go see what we get and we see okay n bits equals two so this gives us an average of 250,000 items within each bucket so obviously that's not useful.

Here we get 62k again not used to 4k 15k that's a lot better right but it's still not useful because we kind of want you know if we if we set k equal to 10 this isn't really going to give us anything particularly useful so and then we also need to consider the fact that we in our buckets not all of them are going to get used and what we will find is that and it's almost like there's there will be a majority of or there will be a minority of buckets that end up with the majority of vectors so we we need to go sub one uh for this to to work so i would say something probably n bits of 24 we're getting that i mean that value actually looks pretty good 0.05 so on average we should get 0.05 items within each bucket okay so i mean let's let's test that so we're gonna copy this and we'll just write a similar code to what we wrote before although one thing before we do that let's just check uh the average similarity let's uh so change k to like 100 or 10 let's let's go with 10 yeah let's do 10 and what i want to do is given that i want to calculate the the cosine similarity average cosine similarity so we'll do that we use from sk learn metrics pairwise import cosine similarity this is just a super quick way for us to calculate and then check sk learn and we'll do cosine similarity and we have i zero xq sure oh so sorry we need to uh sorry to get there the indexes so i i just contains the indexes uh we need to we need to pull them out of xb okay and we have this so we also need to take the mean of that so cos mean we'll get 0.44 which is just the average of those the average similarity though so it's pretty low let's compare that to if we did that for the whole um the whole data set so like this see what we get okay so actually these are yeah i mean these are slightly more similar um so we even you know get slightly better results even even with that terrible sort of bucketing and so what i want to do is is repeat that process we just did there but i want to do it for different n bits values so we need to take our index bring it down here and we're going to say this we do index add our data and then we do di equals index search as we as we usually do k is 10 and then we want to do cos equals cosine similarity again we're going to do xb i so we're pulling out those indexes that we've got um we're doing this different n bits values hence hence loop and xq okay and then what we're going to do is just print n bits is equal to whatever it's equal to at that point and then new line i'm going to put um what is similarity so the similarity is equal to cos dot mean like that and let's just see what we what we get so okay cool so so we look here and so n bits of two is it's terrible gets better it gets better and then at 24 so that bit where we had the 0.05 we get the maximum value so we get this uh 7 3 here 7 4 which is the best we get and then it kind of comes off a little bit here and i've kind of visualized that here so this is the same um so structure i've gone a little bit high with with n bits as well and this is essentially the sort of structure we get so the bit that we just kind of saw was this section here where we're increasing n bits and reducing the number of or reducing like heavily reducing the number of vectors within each bucket on average and the cosine similarity just shoots up really quick but then afterwards we still get this kind of slow increase and that's because so the buckets are now quite spread like the the vectors that are spread very nicely between different buckets but as we increase the resolution even further and further all it means is that we're adding more of the original information the original positional information of those vectors and essentially improving the resolution of those vectors so they're just getting slightly more accurate in the whether the hashed buckets are closer to other very similar or not similar vectors now essentially the binary vectors are more heavily representing the original dense vectors and that's why we see that slight increase that continues even after we get to two n bits of 32 now one final thing i just wanted to quickly cover um is say we do have issues and we want to sort of visualize the actual buckets themselves you know how how can we how can we extract those buckets and and view and view the number of vectors that we have in each one now there isn't any any way of doing this in um in vice so you we kind of have to put our own code together so what we need to do is i'm going to create first create another index because i want to visualize these so i'm going to do and if what if i visualize it with n bits of 32 there's going to be well how many how many buckets would we have there it would be 2 to the power 32 so yeah we can't visualize that many buckets so we're going to go with 4 and n bits of 4 which gives us 16 buckets so index lsh we have d and we're going to use a n bits value of 4 so n bits equals 4 just put that there okay we're going to add our vectors and now what we want to do is actually extract those binary vectors themselves so to do that we write array so we're going to store them in the array variable and we do vice dot vector to array like this and in here we do index dot codes so codes um just that word there is essentially that means you know the binary vectors that we've built they are called codes that's all it means but and we use it this word a lot in in similarity search so it's worth remembering also n bits is used a lot as well so in future videos we will in this series we you will see them quite a lot so then let's have a look at what we have we have okay so we wanted to pull out the binary codes and we're getting these these numbers which is not exactly what we expect or or is it right um well okay let's have a look at the the min value here zero okay array max okay so we get 0 to 15 which means we get 16 values and okay let's do n to the power of n bits okay so that means we have 16 unique buckets that all of our vectors are spread between so that's interesting let's see how many vectors we have we have 1 million so we have 1 million spread between 16 buckets so actually that sounds about right uh but how do we so these these values here these are just the integer versions of our binary vectors so um well like this is represented by a zero and this is represented by one right so what we can do is use this little little script here to convert them into uh the actual binary vectors so here this this value is our first one which is a five and the next one is a 12 which you see here so i mean that's that is our there are binary vectors and we can we can visualize how they are are spread um how many items are within each one of those buckets using this array and what we'll get is something that looks like this um so this is what i mean where we have so we have 16 buckets which you can see on the left here so all of these with 16 of those and not all of them even have any any vectors in so we have so we have one two three four that don't even contain any vectors and then the majority the vast majority are contained within these and these right so so these four buckets contain almost everything and they have loads they have uh so the count is is to the thousands there so here that that number add a few more zeros onto the end um so yeah the these four buckets are far too many far too many vectors in for it to be useful so that's why we before we increase it increase our end bits value up to 24 now i think i mean that's basically everything i wanted to cover but just be quickly in terms of performance um so visualize these using this is using device lh s lsh index um so we have n bits so obviously we just increase n bits loads we increase the recall performance so that's one thing about lsh is that to get good performance you really have to increase the n bits value a lot and at that point it actually gets it can get really inefficient um so this is a the time for each query as compared to a flight index so obviously i mean you you kind of want to be careful with a lsh index obviously it can be useful um but you know high dimensionality is difficult to to balance well with with speed and accuracy but obviously you know it can be done it just depends on how much accuracy you're willing to sacrifice because in most cases with lsh you will be sacrificing a reasonable amount of accuracy so the recall here is about 50 percent um so that's uh it's it's managing to to identify 50 percent around here of the same items that a flat index would would identify and it's only slightly faster so yeah you just have to sort of weigh you know what is most important for your use case with that but anyway that's it for lsh um this this is the last video we'll cover uh for lsh for now in the next few videos and articles we're going to be covering a few different indexes and and how we use them uh effectively so it should be pretty interesting but we're definitely getting a bit more advanced than lsh now so it should be good but that's it for this video thank you very much for watching and i'll see you in the next one

IndexLSH for Fast Similarity Search in Faiss

Transcript