back to indexIndexLSH for Fast Similarity Search in Faiss
00:00:00.000 |
Hi and welcome to the video. We're going to be covering how we can implement LSH in FICE. 00:00:07.440 |
Now we covered LSH and how it works using random projection in a previous video which I'm going to 00:00:15.360 |
release at the same time as this video so I'll link to that in the description if you want to 00:00:20.480 |
understand how LSH works. But in this video we're just going to cover how we implement it in FICE. 00:00:25.840 |
So we're going to be using the SIFT1M dataset which you can download using, I'll put a link 00:00:34.960 |
to the script in the description, but it's just a big dataset of dense vectors. Now from this we'll 00:00:43.440 |
get our 1 million samples xb and we also have our queries which I'm going to just, we're just going 00:00:52.480 |
to use one query at a time. Okay so we'll have one query, 1 million samples, and we perform our search 00:00:59.360 |
using that. So what we first want to do is import FICE. I'm going to get my dimensionality which is 00:01:07.520 |
is it xb.shape and one I think d yep 128 so that's our dimensionality of each vector 00:01:17.440 |
within the SIFT1M dataset and initially let's use an n bits value of 4 which gives us if we 00:01:24.880 |
think about it how many how many buckets does that give us we can do this right so it gives us 00:01:31.840 |
a total of 16 buckets. Okay these are also potential buckets our vectors aren't necessarily 00:01:39.840 |
going to be separated between all of them they might we might end up only using like 10 of those 00:01:45.680 |
buckets for example. Okay and then what we want to do is initialize our index so FICE index lsh 00:01:53.200 |
and in here we need to pass our dimensionality and also our n bits value so how many binary values 00:02:02.320 |
we're going to include within each bucketed item. So do we first thing it's always good to check 00:02:12.240 |
do we need to train the index. Index is trained is true so that means we don't need to train it 00:02:17.920 |
there's no optimizations or anything we need to do here so that's good don't need to 00:02:21.520 |
worry about that and then so without needing training all we do is index add and we add our 00:02:31.360 |
data. Okay and now we're actually just ready to go ahead and search so we do di equals index search 00:02:41.680 |
xq and let's return let's set k up here actually so let's set k equal to 10. Okay so we do k 00:02:53.680 |
and then we get the these are the apparently the most similar indexes to our query. Okay 00:03:05.200 |
and there's one good thing to check here which makes it really easy to figure out if we have 00:03:10.640 |
enough buckets or not is d which is the distance between those vectors. Now obviously if we have 00:03:18.080 |
all of our vectors in a single bucket and we search and our query ends up in the same bucket 00:03:26.080 |
all of the distances here are going to be zero because they're all going to be exactly the same 00:03:32.800 |
hamming distance which is what we're returning so let's look and that's what we get. I mean we have 00:03:39.120 |
very few numbers here so that's not that surprising. Let's push k up to 10k you'd 00:03:47.200 |
think at this point you know surely we'd get some ones but no. So at least the first 10,000 00:03:55.200 |
nearest neighbors are all within one bucket so it clearly there's an issue here we can't 00:04:02.400 |
differentiate between all of those but if we just increase this to a very high number let's say like 00:04:10.240 |
500,000 so this is returning half of the index at this point we should I imagine see we see a few 00:04:17.840 |
ones so at some point within the first 500,000 items we do actually we move from a single bucket 00:04:27.520 |
to at least one other bucket. Now if we think about it we can we can actually estimate roughly 00:04:36.320 |
how what n bits value we need in order to kind of get a better distribution between our buckets 00:04:45.840 |
so to not have everything in a single bucket so let's try these we have for n bits in in these 00:04:52.880 |
values let's give it a go see what we get and we see okay n bits equals two so this gives us an 00:05:00.320 |
average of 250,000 items within each bucket so obviously that's not useful. Here we get 62k 00:05:10.000 |
again not used to 4k 15k that's a lot better right but it's still not useful because we kind of want 00:05:19.840 |
you know if we if we set k equal to 10 this isn't really going to give us anything particularly 00:05:27.040 |
useful so and then we also need to consider the fact that we in our buckets not all of them are 00:05:34.320 |
going to get used and what we will find is that and it's almost like there's there will be a 00:05:39.120 |
majority of or there will be a minority of buckets that end up with the majority of vectors so we we 00:05:48.880 |
need to go sub one uh for this to to work so i would say something probably n bits of 24 we're 00:05:58.560 |
getting that i mean that value actually looks pretty good 0.05 so on average we should get 00:06:04.080 |
0.05 items within each bucket okay so i mean let's let's test that 00:06:14.640 |
so we're gonna copy this and we'll just write a similar code to what we wrote before although 00:06:20.960 |
one thing before we do that let's just check uh the average similarity let's uh so change k to 00:06:28.480 |
like 100 or 10 let's let's go with 10 yeah let's do 10 and what i want to do is given that i want 00:06:39.440 |
to calculate the the cosine similarity average cosine similarity so we'll do that we use from 00:06:46.960 |
sk learn metrics pairwise import cosine similarity this is just a super quick way for us to calculate 00:06:56.320 |
and then check sk learn and we'll do cosine similarity and we have i zero 00:07:15.760 |
xq sure oh so sorry we need to uh sorry to get there the indexes so i i just contains the indexes 00:07:29.200 |
uh we need to we need to pull them out of xb okay and we have this so we also need to take the mean 00:07:37.360 |
of that so cos mean we'll get 0.44 which is just the average of those the average similarity though 00:07:48.960 |
so it's pretty low let's compare that to if we did that for the whole um the whole data set so 00:07:56.480 |
like this see what we get okay so actually these are yeah i mean these are slightly more similar 00:08:06.960 |
um so we even you know get slightly better results even even with that terrible 00:08:12.400 |
sort of bucketing and so what i want to do is is repeat that process we just did there but i want 00:08:20.720 |
to do it for different n bits values so we need to take our index bring it down here and we're 00:08:31.120 |
going to say this we do index add our data and then we do di equals index search as we as we 00:08:46.480 |
usually do k is 10 and then we want to do cos equals cosine similarity again we're going to do 00:08:56.400 |
xb i so we're pulling out those indexes that we've got um we're doing this different n bits values 00:09:03.360 |
hence hence loop and xq okay and then what we're going to do is just print 00:09:11.200 |
n bits is equal to whatever it's equal to at that point 00:09:20.960 |
and then new line i'm going to put um what is similarity so the similarity is equal to 00:09:29.520 |
cos dot mean like that and let's just see what we what we get so okay cool so so we look here 00:09:39.840 |
and so n bits of two is it's terrible gets better it gets better and then at 24 so that bit where 00:09:48.720 |
we had the 0.05 we get the maximum value so we get this uh 7 3 here 7 4 which is the best we 00:09:56.960 |
get and then it kind of comes off a little bit here and i've kind of visualized that here so 00:10:02.320 |
this is the same um so structure i've gone a little bit high with with n bits as well 00:10:08.160 |
and this is essentially the sort of structure we get so the bit that we just kind of saw 00:10:15.040 |
was this section here where we're increasing n bits and reducing the number of or reducing 00:10:22.640 |
like heavily reducing the number of vectors within each bucket on average and the cosine 00:10:30.320 |
similarity just shoots up really quick but then afterwards we still get this kind of slow increase 00:10:39.360 |
and that's because so the buckets are now quite spread like the the vectors that are spread 00:10:47.040 |
very nicely between different buckets but as we increase the resolution even further and further 00:10:53.520 |
all it means is that we're adding more of the original information the original positional 00:10:59.120 |
information of those vectors and essentially improving the resolution of those vectors so 00:11:03.600 |
they're just getting slightly more accurate in the whether the hashed buckets are closer to other 00:11:12.080 |
very similar or not similar vectors now essentially the binary vectors are more heavily representing 00:11:20.160 |
the original dense vectors and that's why we see that slight increase that continues even after we 00:11:28.960 |
get to two n bits of 32 now one final thing i just wanted to quickly cover um is say we do have 00:11:38.640 |
issues and we want to sort of visualize the actual buckets themselves you know how how can we how can 00:11:44.800 |
we extract those buckets and and view and view the number of vectors that we have in each one now 00:11:51.440 |
there isn't any any way of doing this in um in vice so you we kind of have to put our own 00:12:00.400 |
code together so what we need to do is i'm going to create first create another index 00:12:06.080 |
because i want to visualize these so i'm going to do and if what if i visualize it with n bits 00:12:13.600 |
of 32 there's going to be well how many how many buckets would we have there it would be 00:12:20.320 |
2 to the power 32 so yeah we can't visualize that many buckets so we're going to go with 00:12:25.920 |
4 and n bits of 4 which gives us 16 buckets so index lsh we have d and we're going to use 00:12:36.320 |
a n bits value of 4 so n bits equals 4 just put that there 00:12:44.160 |
okay we're going to add our vectors and now what we want to do is actually extract 00:12:55.200 |
those binary vectors themselves so to do that we write array so we're going to store them in 00:13:02.480 |
the array variable and we do vice dot vector to array like this and in here we do index dot codes 00:13:14.560 |
so codes um just that word there is essentially that means you know the binary vectors that we've 00:13:22.720 |
built they are called codes that's all it means but and we use it this word a lot in in similarity 00:13:28.800 |
search so it's worth remembering also n bits is used a lot as well so in future videos we will 00:13:35.840 |
in this series we you will see them quite a lot so then let's have a look at what we have we have 00:13:43.120 |
okay so we wanted to pull out the binary codes and we're getting these these numbers which is 00:13:51.360 |
not exactly what we expect or or is it right um well okay let's have a look at the the min value 00:13:58.400 |
here zero okay array max okay so we get 0 to 15 which means we get 16 values and okay let's do 00:14:09.520 |
n to the power of n bits okay so that means we have 16 unique buckets that all of our vectors 00:14:16.720 |
are spread between so that's interesting let's see how many vectors we have we have 1 million 00:14:25.200 |
so we have 1 million spread between 16 buckets so actually that sounds about right uh but how do we 00:14:30.480 |
so these these values here these are just the integer versions of our binary vectors so um well 00:14:39.920 |
like this is represented by a zero and this is represented by one right so what we can do is use 00:14:53.360 |
this little little script here to convert them into uh the actual binary vectors so here this 00:15:00.880 |
this value is our first one which is a five and the next one is a 12 which you see here 00:15:08.240 |
so i mean that's that is our there are binary vectors and we can we can visualize how they are 00:15:16.480 |
are spread um how many items are within each one of those buckets using this array and what we'll 00:15:23.520 |
get is something that looks like this um so this is what i mean where we have so we have 16 buckets 00:15:30.000 |
which you can see on the left here so all of these with 16 of those and not all of them even 00:15:38.800 |
have any any vectors in so we have so we have one two three four that don't even contain any vectors 00:15:45.600 |
and then the majority the vast majority are contained within these and these right so so these 00:15:53.760 |
four buckets contain almost everything and they have loads they have uh so the count is 00:16:01.920 |
is to the thousands there so here that that number 00:16:06.400 |
add a few more zeros onto the end um so yeah the these four buckets are far too many 00:16:16.240 |
far too many vectors in for it to be useful so that's why we before we increase it 00:16:22.480 |
increase our end bits value up to 24 now i think i mean that's basically everything i wanted to 00:16:28.640 |
cover but just be quickly in terms of performance um so visualize these using this is using device 00:16:35.600 |
lh s lsh index um so we have n bits so obviously we just increase n bits loads 00:16:43.360 |
we increase the recall performance so that's one thing about lsh is that to get good performance 00:16:51.680 |
you really have to increase the n bits value a lot and at that point it actually gets it can 00:16:57.120 |
get really inefficient um so this is a the time for each query as compared to a flight index 00:17:04.560 |
so obviously i mean you you kind of want to be careful with a lsh index obviously it can be useful 00:17:15.600 |
um but you know high dimensionality is difficult to to balance well with with speed and accuracy 00:17:24.720 |
but obviously you know it can be done it just depends on how much accuracy you're willing to 00:17:31.840 |
sacrifice because in most cases with lsh you will be sacrificing a reasonable amount of accuracy so 00:17:38.080 |
the recall here is about 50 percent um so that's uh it's it's managing to to identify 50 percent 00:17:46.800 |
around here of the same items that a flat index would would identify and it's only slightly faster 00:17:54.320 |
so yeah you just have to sort of weigh you know what is most important for your use case with that 00:18:03.680 |
but anyway that's it for lsh um this this is the last video we'll cover uh for lsh for now 00:18:13.040 |
in the next few videos and articles we're going to be covering a few different indexes and 00:18:22.800 |
and how we use them uh effectively so it should be pretty interesting but we're definitely getting a 00:18:30.320 |
bit more advanced than lsh now so it should be good but that's it for this video thank 00:18:36.160 |
you very much for watching and i'll see you in the next one