back to indexHow LSH Random Projection works in search (+Python)

00:00:00.000 | 
Okay, so we're on to lsh or locality sensitive hashing with random projection in the 00:00:07.360 | 
Previous video in a series if you're if you're following it, we covered lsh 00:00:12.800 | 
but we covered the more traditional version of it with 00:00:19.840 | 
What we're covering here is I suppose more of like a uh a modern implementation of it 00:00:27.200 | 
This is what you'll see in libraries like vice which we'll work through later on how to actually implement this in vice 00:00:34.240 | 
So what we're going to cover in this video is specifically lsh random projection 00:00:41.120 | 
We're going to work through a few visualizations to try and get a grasp of what it is actually doing 00:00:47.520 | 
And whilst we're doing that we'll also work through how we implement that in very simple python code for the vice 00:00:56.620 | 
Implementation of this which is obviously much more efficient than what we'll be covering in this video 00:01:04.220 | 
Cover that in another separate video to this because otherwise it's just huge 00:01:11.660 | 
So if you would like to see that i'll make sure there's a link to that in the description 00:01:22.780 | 
So at the top here we have this first kind of row if you like 00:01:28.540 | 
Here we're minimizing collisions. So this is a hash function this blue bit in the middle 00:01:38.380 | 
And we're passing through a hash function and it's making sure that we put both of those into separate buckets 00:01:49.260 | 
Tries to the opposite and what it does is maximizes collisions, but it only tries to maximize collisions for similar vectors 00:02:03.500 | 
In short is a hashing function that tries to book it similar vectors together 00:02:16.620 | 
I would say database vectors indexed already. So they've all been booked it imagine they've all been booketed 00:02:25.500 | 
Which would be our query vector and we process that through the same booketing mechanism and we see where it lands and then 00:02:35.900 | 
This has landed in this bucket and that means that the most similar 00:02:40.780 | 
Other vectors to it are the ones that are either in the same bucket or in the neighboring buckets 00:02:51.660 | 
our vectors into low resolution vectors, which makes 00:02:58.380 | 
And that is what you can see here so we have our dense vectors at the top typically we're using dense vectors for this 00:03:07.580 | 
And they can contain hundreds or thousands of values and they're all typically floating point numbers 00:03:18.860 | 
And what we do is convert them into a very small 00:03:23.340 | 
Binary vector which is what you can you can see the bottom. So it's a lot more memory efficient and 00:03:30.300 | 
Usually it should be faster to search although sometimes it can actually go the other way and it can 00:03:37.580 | 
And it can become slower to search than just comparing 00:03:40.700 | 
Um than just using a flat index now at the same time because we are 00:03:50.060 | 
Accuracy is obviously going to decrease but what we want to do really is to maintain decent 00:03:57.660 | 
Whilst speeding up the search and our aim is using xq which is our query vector 00:04:07.980 | 
As accurately as possible, obviously, we're approximating so it won't be perfect, but that's fine as long as you get a decent speed up 00:04:24.540 | 
Which is obviously highly dimensional vector space 00:04:29.340 | 
Hyperplanes, so I mean it's what you can you can see right here 00:04:36.300 | 
A single hyperplane on the positive side of that hyperplane if your if your vector 00:04:50.940 | 
Then we would we process that and in our binary vector. This would be assigned a one 00:04:55.660 | 
on the other side that you have the negative side of the 00:05:00.440 | 
Hyperplane and if your vector is on that side of it, it will be assigned a negative value with the dot product 00:05:10.840 | 
in our binary vector would be assigned a zero and 00:05:14.600 | 
The reason that that works is we're using the dot product value. So imagine this green line 00:05:20.360 | 
Down here imagine this is the hyperplane that I just showed you we have that normal vector 00:05:31.240 | 
And using the dot product if the dot product finds that both of these are in the same direction 00:05:45.080 | 
Then it will take that as a positive if it's on the other side, so 00:05:54.120 | 
Anywhere here it would take that as a negative dot product, which is what you can see 00:06:04.760 | 
single binary value isn't going to tell us much about the 00:06:08.040 | 
direction of our vector or the position of our vector 00:06:12.680 | 
so what we do is just add more hyperplanes in there, so we 00:06:18.440 | 
Add more hyperplanes and that gives us more binary values within our vector 00:06:23.080 | 
So what you can see here is we have those we have two 00:06:45.800 | 
What we would do is essentially just use loads of hyperplanes 00:06:52.600 | 
So these are the big arrows and the points that you see there the normal vectors so where we're actually calculating dot product 00:07:01.880 | 
These are our hyperplanes. Okay, so here for blue we would get 00:07:10.520 | 
And then here for example for blue we would get a value of one 00:07:17.400 | 
Start building this out in in code so we can actually 00:07:23.800 | 
So what i'm going to do is set the number of hyperplanes that we would like and we set that using a 00:07:33.080 | 
I'm going to say we have four hyperplanes just for this example in reality would use more 00:07:38.280 | 
But for now, we're going to go with four. So we have four binary values here 00:07:44.680 | 
And all i'm going to do is um create our vector dimensionality as well. So we're just going to use 2d vectors 00:07:51.400 | 
To make it easier and so we can sort of visualize stuff as well 00:07:55.400 | 
So all we need to do is we're going to create the plane norms and they are going to be numpy random 00:08:05.320 | 
And the dimensionality there will be n bits and d 00:08:09.080 | 
So the number of hyperplanes that we want so n bits 00:08:14.200 | 
And the dimensionality that we're actually using 00:08:16.760 | 
And we're just going to minus 0.5 because we want them to center 00:08:22.200 | 
Around the zero the origin zero axis. You don't need to do this. By the way 00:08:26.760 | 
It's it's just so we can kind of see the effect of it a little bit better 00:08:36.920 | 
To this visualization, but it's essentially the same thing. What what we have done is we've created four 2d 00:08:56.280 | 
Three vectors a b and c and what we're going to do is calculate the dot product for each one of those 00:09:06.360 | 
You know positive or negative behind each of our plane norms 00:09:09.960 | 
So to do that, we're just going to do np dot and then we just add our vector and we add plane norms 00:09:23.480 | 
Okay, so we see that we get negative negative positive positive. Okay, so 00:09:28.440 | 
When we convert this into our binary vector, that will be zero zero one one 00:09:46.040 | 
Okay, if it's negative it's zero if it's positive. It's a one so to do that 00:10:10.760 | 
Okay, so now we get false false true true so the final thing to do there 00:10:16.680 | 
Although I don't you don't necessarily need to is just convert them. In fact, we do need to uh purely to create our 00:10:34.360 | 
As type in so it's essentially easier for us to visualize 00:10:41.240 | 
Okay, and we should get something that more looks like this here, um, so 00:10:48.600 | 
You see here obviously the values the the positions are slightly different 00:10:53.400 | 
But we have a is on the positive side of the teal hyperplane. So one is of course 00:10:59.800 | 
One and then on c and b are both on the other side 00:11:13.240 | 
In a similar position. So we would hope that they kind of align in there 00:11:18.200 | 
In the values that they have a bit better than them 00:11:25.800 | 
Okay, so they're the same so that's good because they are 00:11:33.480 | 
Very much in a very similar place which you can see from here 00:11:37.560 | 
So a b and c in this case match up to what we're writing in the code. It's just the hyperplanes are different 00:11:43.160 | 
Now it's these binary vectors that we use to create our lsh lsh buckets 00:11:50.360 | 
So what we're going to do is actually implement that just using a python dictionary 00:11:58.760 | 
What we'll do is i'm going to put each of our vectors 00:12:09.420 | 
Vector list. It's just so we can iterate through them a little bit easier 00:12:13.020 | 
I'm going to initialize our buckets which is going to be like I said a python dictionary 00:12:17.680 | 
And here i'm just going to set i equal to zero. So this 00:12:21.420 | 
Is just so we can loop through each of those each of those vectors 00:12:34.380 | 
So the first thing we want to do is is create a hash string using the 00:12:40.540 | 
Vectors that we have up here. So i'm going to do hash string equals i'm going to do 00:12:48.380 | 
And what we want to do is join all of those numbers together, but we need to to join them as strings so 00:13:15.340 | 
Then what we want to do is say if the hash string is not already within our buckets is not in 00:13:21.900 | 
Buckets our keys we want to initialize a new list so buckets 00:13:35.820 | 
Essentially initializing a book. It's put our vectors in 00:13:43.420 | 
Initialize it and then after that we we just add we append it to that bucket 00:13:57.260 | 
Okay, yeah should it should be fine so let's print the buckets and see we'll get 00:14:13.980 | 
And one and two have both been been put into the same one. All right, and that's essentially how 00:14:29.420 | 
Now, let's say, you know, we we have our buckets 00:14:39.180 | 
Search using that I want to hash it and search. This is our query vector xq 00:14:49.020 | 
So we've got like two examples one on the left is an example 00:14:53.820 | 
So this is we're comparing our query vector against two 00:15:02.920 | 
So xq is this zero it's been hashed zero one one one 00:15:15.100 | 
This vector and we see and we're using hamming distance here, which is essentially, you know, do these two equivalent 00:15:24.140 | 
If they do it's what zero there's there's no distance between them 00:15:28.780 | 
If they do not match then the distance is one and then we add up 00:15:38.540 | 
With this one none of them match. So we get like one plus one plus one plus one, which is obviously four 00:15:45.180 | 
so the hamming distance between those two is four which is is 00:15:48.060 | 
And the biggest you're going to get with this dimensionality of binary vectors 00:15:53.180 | 
And then we have the second one and these ones match a bit better. So zero is equal to zero one to one one to one 00:16:03.340 | 
and then so like zero zero zero and then this final one is one so then 00:16:26.060 | 
We we also have to consider that there's a degree of 00:16:33.740 | 
This is how we're storing the vectors. We don't store the original vectors anymore. This is 00:16:45.980 | 
Say we have our query vector and it comes through a zero zero one zero like great. We get a perfect match, but 00:16:52.380 | 
In reality, does it is it close to one or is it closer to we don't we don't actually know 00:16:58.300 | 
So we have to be careful when we're building our buckets 00:17:05.020 | 
That there's not too many items within each bucket 00:17:08.620 | 
We need them to be reasonably spread out but not spread too thin because if we spread them too thin across 00:17:14.460 | 
Too many buckets the the index becomes absolutely huge. So it's 00:17:24.860 | 
enough granularity in there to differentiate between 00:17:30.540 | 
But not too granular that we just make the index bigger 00:17:36.860 | 
Because then it's slower than just doing a flat search 00:17:41.740 | 
What you what we see here is what we just discussed, right? So 00:17:48.380 | 
These two vectors a and b they're reasonably far from each other if we use a value of n bits value of two 00:17:53.900 | 
Our vectors are not big enough. They get booked it into the same place. We can't differentiate them 00:18:01.900 | 
But what we can do obviously increase the number of hyperplanes or increase n bits 00:18:06.700 | 
And then we can differentiate them. So so for you know, these two here 00:18:14.220 | 
Uh buckets these are not we have differences here 00:18:20.620 | 
And here so we can differentiate between them, which obviously is what we need 00:18:26.620 | 
But at the same time we are increasing the size of our index so it means 00:18:35.580 | 
More accurate, but we're also getting slower. So 00:18:37.900 | 
It's yeah finding the the middle ground between them both 00:18:42.540 | 
Now that's it for the implementation details behind vice 00:18:51.820 | 
Is we're going to leave this video here and we'll cover 00:18:56.140 | 
Device implementation in the next video, which I will leave a link to in the description so you can find that easily 00:19:02.540 | 
But for now, thank you very much for watching and i'll see you in the next one