back to indexHow LSH Random Projection works in search (+Python)
00:00:00.000 |
Okay, so we're on to lsh or locality sensitive hashing with random projection in the 00:00:07.360 |
Previous video in a series if you're if you're following it, we covered lsh 00:00:12.800 |
but we covered the more traditional version of it with 00:00:19.840 |
What we're covering here is I suppose more of like a uh a modern implementation of it 00:00:27.200 |
This is what you'll see in libraries like vice which we'll work through later on how to actually implement this in vice 00:00:34.240 |
So what we're going to cover in this video is specifically lsh random projection 00:00:41.120 |
We're going to work through a few visualizations to try and get a grasp of what it is actually doing 00:00:47.520 |
And whilst we're doing that we'll also work through how we implement that in very simple python code for the vice 00:00:56.620 |
Implementation of this which is obviously much more efficient than what we'll be covering in this video 00:01:04.220 |
Cover that in another separate video to this because otherwise it's just huge 00:01:11.660 |
So if you would like to see that i'll make sure there's a link to that in the description 00:01:22.780 |
So at the top here we have this first kind of row if you like 00:01:28.540 |
Here we're minimizing collisions. So this is a hash function this blue bit in the middle 00:01:38.380 |
And we're passing through a hash function and it's making sure that we put both of those into separate buckets 00:01:49.260 |
Tries to the opposite and what it does is maximizes collisions, but it only tries to maximize collisions for similar vectors 00:02:03.500 |
In short is a hashing function that tries to book it similar vectors together 00:02:16.620 |
I would say database vectors indexed already. So they've all been booked it imagine they've all been booketed 00:02:25.500 |
Which would be our query vector and we process that through the same booketing mechanism and we see where it lands and then 00:02:35.900 |
This has landed in this bucket and that means that the most similar 00:02:40.780 |
Other vectors to it are the ones that are either in the same bucket or in the neighboring buckets 00:02:51.660 |
our vectors into low resolution vectors, which makes 00:02:58.380 |
And that is what you can see here so we have our dense vectors at the top typically we're using dense vectors for this 00:03:07.580 |
And they can contain hundreds or thousands of values and they're all typically floating point numbers 00:03:18.860 |
And what we do is convert them into a very small 00:03:23.340 |
Binary vector which is what you can you can see the bottom. So it's a lot more memory efficient and 00:03:30.300 |
Usually it should be faster to search although sometimes it can actually go the other way and it can 00:03:37.580 |
And it can become slower to search than just comparing 00:03:40.700 |
Um than just using a flat index now at the same time because we are 00:03:50.060 |
Accuracy is obviously going to decrease but what we want to do really is to maintain decent 00:03:57.660 |
Whilst speeding up the search and our aim is using xq which is our query vector 00:04:07.980 |
As accurately as possible, obviously, we're approximating so it won't be perfect, but that's fine as long as you get a decent speed up 00:04:24.540 |
Which is obviously highly dimensional vector space 00:04:29.340 |
Hyperplanes, so I mean it's what you can you can see right here 00:04:36.300 |
A single hyperplane on the positive side of that hyperplane if your if your vector 00:04:50.940 |
Then we would we process that and in our binary vector. This would be assigned a one 00:04:55.660 |
on the other side that you have the negative side of the 00:05:00.440 |
Hyperplane and if your vector is on that side of it, it will be assigned a negative value with the dot product 00:05:10.840 |
in our binary vector would be assigned a zero and 00:05:14.600 |
The reason that that works is we're using the dot product value. So imagine this green line 00:05:20.360 |
Down here imagine this is the hyperplane that I just showed you we have that normal vector 00:05:31.240 |
And using the dot product if the dot product finds that both of these are in the same direction 00:05:45.080 |
Then it will take that as a positive if it's on the other side, so 00:05:54.120 |
Anywhere here it would take that as a negative dot product, which is what you can see 00:06:04.760 |
single binary value isn't going to tell us much about the 00:06:08.040 |
direction of our vector or the position of our vector 00:06:12.680 |
so what we do is just add more hyperplanes in there, so we 00:06:18.440 |
Add more hyperplanes and that gives us more binary values within our vector 00:06:23.080 |
So what you can see here is we have those we have two 00:06:45.800 |
What we would do is essentially just use loads of hyperplanes 00:06:52.600 |
So these are the big arrows and the points that you see there the normal vectors so where we're actually calculating dot product 00:07:01.880 |
These are our hyperplanes. Okay, so here for blue we would get 00:07:10.520 |
And then here for example for blue we would get a value of one 00:07:17.400 |
Start building this out in in code so we can actually 00:07:23.800 |
So what i'm going to do is set the number of hyperplanes that we would like and we set that using a 00:07:33.080 |
I'm going to say we have four hyperplanes just for this example in reality would use more 00:07:38.280 |
But for now, we're going to go with four. So we have four binary values here 00:07:44.680 |
And all i'm going to do is um create our vector dimensionality as well. So we're just going to use 2d vectors 00:07:51.400 |
To make it easier and so we can sort of visualize stuff as well 00:07:55.400 |
So all we need to do is we're going to create the plane norms and they are going to be numpy random 00:08:05.320 |
And the dimensionality there will be n bits and d 00:08:09.080 |
So the number of hyperplanes that we want so n bits 00:08:14.200 |
And the dimensionality that we're actually using 00:08:16.760 |
And we're just going to minus 0.5 because we want them to center 00:08:22.200 |
Around the zero the origin zero axis. You don't need to do this. By the way 00:08:26.760 |
It's it's just so we can kind of see the effect of it a little bit better 00:08:36.920 |
To this visualization, but it's essentially the same thing. What what we have done is we've created four 2d 00:08:56.280 |
Three vectors a b and c and what we're going to do is calculate the dot product for each one of those 00:09:06.360 |
You know positive or negative behind each of our plane norms 00:09:09.960 |
So to do that, we're just going to do np dot and then we just add our vector and we add plane norms 00:09:23.480 |
Okay, so we see that we get negative negative positive positive. Okay, so 00:09:28.440 |
When we convert this into our binary vector, that will be zero zero one one 00:09:46.040 |
Okay, if it's negative it's zero if it's positive. It's a one so to do that 00:10:10.760 |
Okay, so now we get false false true true so the final thing to do there 00:10:16.680 |
Although I don't you don't necessarily need to is just convert them. In fact, we do need to uh purely to create our 00:10:34.360 |
As type in so it's essentially easier for us to visualize 00:10:41.240 |
Okay, and we should get something that more looks like this here, um, so 00:10:48.600 |
You see here obviously the values the the positions are slightly different 00:10:53.400 |
But we have a is on the positive side of the teal hyperplane. So one is of course 00:10:59.800 |
One and then on c and b are both on the other side 00:11:13.240 |
In a similar position. So we would hope that they kind of align in there 00:11:18.200 |
In the values that they have a bit better than them 00:11:25.800 |
Okay, so they're the same so that's good because they are 00:11:33.480 |
Very much in a very similar place which you can see from here 00:11:37.560 |
So a b and c in this case match up to what we're writing in the code. It's just the hyperplanes are different 00:11:43.160 |
Now it's these binary vectors that we use to create our lsh lsh buckets 00:11:50.360 |
So what we're going to do is actually implement that just using a python dictionary 00:11:58.760 |
What we'll do is i'm going to put each of our vectors 00:12:09.420 |
Vector list. It's just so we can iterate through them a little bit easier 00:12:13.020 |
I'm going to initialize our buckets which is going to be like I said a python dictionary 00:12:17.680 |
And here i'm just going to set i equal to zero. So this 00:12:21.420 |
Is just so we can loop through each of those each of those vectors 00:12:34.380 |
So the first thing we want to do is is create a hash string using the 00:12:40.540 |
Vectors that we have up here. So i'm going to do hash string equals i'm going to do 00:12:48.380 |
And what we want to do is join all of those numbers together, but we need to to join them as strings so 00:13:15.340 |
Then what we want to do is say if the hash string is not already within our buckets is not in 00:13:21.900 |
Buckets our keys we want to initialize a new list so buckets 00:13:35.820 |
Essentially initializing a book. It's put our vectors in 00:13:43.420 |
Initialize it and then after that we we just add we append it to that bucket 00:13:57.260 |
Okay, yeah should it should be fine so let's print the buckets and see we'll get 00:14:13.980 |
And one and two have both been been put into the same one. All right, and that's essentially how 00:14:29.420 |
Now, let's say, you know, we we have our buckets 00:14:39.180 |
Search using that I want to hash it and search. This is our query vector xq 00:14:49.020 |
So we've got like two examples one on the left is an example 00:14:53.820 |
So this is we're comparing our query vector against two 00:15:02.920 |
So xq is this zero it's been hashed zero one one one 00:15:15.100 |
This vector and we see and we're using hamming distance here, which is essentially, you know, do these two equivalent 00:15:24.140 |
If they do it's what zero there's there's no distance between them 00:15:28.780 |
If they do not match then the distance is one and then we add up 00:15:38.540 |
With this one none of them match. So we get like one plus one plus one plus one, which is obviously four 00:15:45.180 |
so the hamming distance between those two is four which is is 00:15:48.060 |
And the biggest you're going to get with this dimensionality of binary vectors 00:15:53.180 |
And then we have the second one and these ones match a bit better. So zero is equal to zero one to one one to one 00:16:03.340 |
and then so like zero zero zero and then this final one is one so then 00:16:26.060 |
We we also have to consider that there's a degree of 00:16:33.740 |
This is how we're storing the vectors. We don't store the original vectors anymore. This is 00:16:45.980 |
Say we have our query vector and it comes through a zero zero one zero like great. We get a perfect match, but 00:16:52.380 |
In reality, does it is it close to one or is it closer to we don't we don't actually know 00:16:58.300 |
So we have to be careful when we're building our buckets 00:17:05.020 |
That there's not too many items within each bucket 00:17:08.620 |
We need them to be reasonably spread out but not spread too thin because if we spread them too thin across 00:17:14.460 |
Too many buckets the the index becomes absolutely huge. So it's 00:17:24.860 |
enough granularity in there to differentiate between 00:17:30.540 |
But not too granular that we just make the index bigger 00:17:36.860 |
Because then it's slower than just doing a flat search 00:17:41.740 |
What you what we see here is what we just discussed, right? So 00:17:48.380 |
These two vectors a and b they're reasonably far from each other if we use a value of n bits value of two 00:17:53.900 |
Our vectors are not big enough. They get booked it into the same place. We can't differentiate them 00:18:01.900 |
But what we can do obviously increase the number of hyperplanes or increase n bits 00:18:06.700 |
And then we can differentiate them. So so for you know, these two here 00:18:14.220 |
Uh buckets these are not we have differences here 00:18:20.620 |
And here so we can differentiate between them, which obviously is what we need 00:18:26.620 |
But at the same time we are increasing the size of our index so it means 00:18:35.580 |
More accurate, but we're also getting slower. So 00:18:37.900 |
It's yeah finding the the middle ground between them both 00:18:42.540 |
Now that's it for the implementation details behind vice 00:18:51.820 |
Is we're going to leave this video here and we'll cover 00:18:56.140 |
Device implementation in the next video, which I will leave a link to in the description so you can find that easily 00:19:02.540 |
But for now, thank you very much for watching and i'll see you in the next one