How LSH Random Projection works in search (+Python)

00:00:00.000 | Okay, so we're on to lsh or locality sensitive hashing with random projection in the

00:00:07.360 | Previous video in a series if you're if you're following it, we covered lsh

00:00:12.800 | but we covered the more traditional version of it with

00:00:15.280 | shingling min hashing

00:00:19.840 | What we're covering here is I suppose more of like a uh a modern implementation of it

00:00:27.200 | This is what you'll see in libraries like vice which we'll work through later on how to actually implement this in vice

00:00:34.240 | So what we're going to cover in this video is specifically lsh random projection

00:00:41.120 | We're going to work through a few visualizations to try and get a grasp of what it is actually doing

00:00:47.520 | And whilst we're doing that we'll also work through how we implement that in very simple python code for the vice

00:00:56.620 | Implementation of this which is obviously much more efficient than what we'll be covering in this video

00:01:01.340 | we will

00:01:04.220 | Cover that in another separate video to this because otherwise it's just huge

00:01:08.860 | a very long video

00:01:11.660 | So if you would like to see that i'll make sure there's a link to that in the description

00:01:17.280 | So let's just recap very quickly on

00:01:19.900 | You know, what is lsh?

00:01:22.780 | So at the top here we have this first kind of row if you like

00:01:28.540 | Here we're minimizing collisions. So this is a hash function this blue bit in the middle

00:01:35.340 | And these are vectors on the left

00:01:38.380 | And we're passing through a hash function and it's making sure that we put both of those into separate buckets

00:01:44.940 | That's minimizing collisions

00:01:47.980 | lsh

00:01:49.260 | Tries to the opposite and what it does is maximizes collisions, but it only tries to maximize collisions for similar vectors

00:01:56.860 | Not just everything

00:01:59.340 | so

00:02:01.500 | lsh

00:02:03.500 | In short is a hashing function that tries to book it similar vectors together

00:02:08.940 | Now when we're performing search with lsh

00:02:13.040 | we

00:02:14.940 | Have all of our

00:02:16.620 | I would say database vectors indexed already. So they've all been booked it imagine they've all been booketed

00:02:22.460 | And we introduce a new vector

00:02:25.500 | Which would be our query vector and we process that through the same booketing mechanism and we see where it lands and then

00:02:33.180 | Essentially what we can say is okay

00:02:35.900 | This has landed in this bucket and that means that the most similar

00:02:40.780 | Other vectors to it are the ones that are either in the same bucket or in the neighboring buckets

00:02:46.860 | so what it does is allows us to

00:02:49.900 | compress

00:02:51.660 | our vectors into low resolution vectors, which makes

00:02:55.580 | Our search a lot faster

00:02:58.380 | And that is what you can see here so we have our dense vectors at the top typically we're using dense vectors for this

00:03:07.580 | And they can contain hundreds or thousands of values and they're all typically floating point numbers

00:03:15.580 | so

00:03:16.860 | Memory wise it's pretty heavy

00:03:18.860 | And what we do is convert them into a very small

00:03:23.340 | Binary vector which is what you can you can see the bottom. So it's a lot more memory efficient and

00:03:30.300 | Usually it should be faster to search although sometimes it can actually go the other way and it can

00:03:37.580 | And it can become slower to search than just comparing

00:03:40.700 | Um than just using a flat index now at the same time because we are

00:03:45.740 | approximating these vectors

00:03:48.460 | the search

00:03:50.060 | Accuracy is obviously going to decrease but what we want to do really is to maintain decent

00:03:55.100 | accuracy

00:03:57.660 | Whilst speeding up the search and our aim is using xq which is our query vector

00:04:04.300 | It is to return the k nearest neighbors

00:04:07.980 | As accurately as possible, obviously, we're approximating so it won't be perfect, but that's fine as long as you get a decent speed up

00:04:15.100 | now lsh with random projection works

00:04:20.060 | by splitting

00:04:22.700 | our vector space

00:04:24.540 | Which is obviously highly dimensional vector space

00:04:26.860 | using

00:04:29.340 | Hyperplanes, so I mean it's what you can you can see right here

00:04:34.220 | And the way that it works is that given

00:04:36.300 | A single hyperplane on the positive side of that hyperplane if your if your vector

00:04:42.780 | Appeared on that side

00:04:44.860 | It would be assigned a positive

00:04:47.020 | dot product value

00:04:49.100 | okay, and

00:04:50.940 | Then we would we process that and in our binary vector. This would be assigned a one

00:04:55.660 | on the other side that you have the negative side of the

00:05:00.440 | Hyperplane and if your vector is on that side of it, it will be assigned a negative value with the dot product

00:05:08.120 | and with that

00:05:10.840 | in our binary vector would be assigned a zero and

00:05:14.600 | The reason that that works is we're using the dot product value. So imagine this green line

00:05:20.360 | Down here imagine this is the hyperplane that I just showed you we have that normal vector

00:05:28.920 | The n that comes out here

00:05:31.240 | And using the dot product if the dot product finds that both of these are in the same direction

00:05:36.920 | eg

00:05:39.220 | anything

00:05:41.220 | In this sort of angle

00:05:45.080 | Then it will take that as a positive if it's on the other side, so

00:05:54.120 | Anywhere here it would take that as a negative dot product, which is what you can see

00:05:59.640 | over

00:06:01.560 | here

00:06:03.480 | now a

00:06:04.760 | single binary value isn't going to tell us much about the

00:06:08.040 | direction of our vector or the position of our vector

00:06:12.680 | so what we do is just add more hyperplanes in there, so we

00:06:18.440 | Add more hyperplanes and that gives us more binary values within our vector

00:06:23.080 | So what you can see here is we have those we have two

00:06:25.560 | hyperplanes now

00:06:27.880 | the magenta

00:06:29.320 | obviously correlates to the

00:06:31.320 | zero

00:06:33.160 | index in these vectors and then the

00:06:35.400 | teal

00:06:37.240 | hyperplane correlates to these ones

00:06:39.240 | the the number one indexes

00:06:42.040 | So

00:06:45.800 | What we would do is essentially just use loads of hyperplanes

00:06:50.060 | Which is kind of what you see here

00:06:52.600 | So these are the big arrows and the points that you see there the normal vectors so where we're actually calculating dot product

00:06:59.400 | and then

00:07:01.880 | These are our hyperplanes. Okay, so here for blue we would get

00:07:07.160 | a value of zero

00:07:10.520 | And then here for example for blue we would get a value of one

00:07:15.880 | now let's

00:07:17.400 | Start building this out in in code so we can actually

00:07:19.960 | see

00:07:21.800 | How this works?

00:07:23.800 | So what i'm going to do is set the number of hyperplanes that we would like and we set that using a

00:07:30.600 | parameter here called n bits

00:07:33.080 | I'm going to say we have four hyperplanes just for this example in reality would use more

00:07:38.280 | But for now, we're going to go with four. So we have four binary values here

00:07:44.680 | And all i'm going to do is um create our vector dimensionality as well. So we're just going to use 2d vectors

00:07:51.400 | To make it easier and so we can sort of visualize stuff as well

00:07:55.400 | So all we need to do is we're going to create the plane norms and they are going to be numpy random

00:08:03.480 | rand

00:08:05.320 | And the dimensionality there will be n bits and d

00:08:09.080 | So the number of hyperplanes that we want so n bits

00:08:14.200 | And the dimensionality that we're actually using

00:08:16.760 | And we're just going to minus 0.5 because we want them to center

00:08:22.200 | Around the zero the origin zero axis. You don't need to do this. By the way

00:08:26.760 | It's it's just so we can kind of see the effect of it a little bit better

00:08:31.080 | Okay, and we get get these values

00:08:34.200 | Now those vectors don't align

00:08:36.920 | To this visualization, but it's essentially the same thing. What what we have done is we've created four 2d

00:08:44.260 | Hyperplanes or we've actually created four

00:08:46.360 | 2d

00:08:48.840 | normal vectors

00:08:50.280 | That we're going to use

00:08:51.800 | To build our binary vectors

00:08:53.800 | So we're going to create these

00:08:56.280 | Three vectors a b and c and what we're going to do is calculate the dot product for each one of those

00:09:02.040 | So then we know whether

00:09:04.200 | they are

00:09:06.360 | You know positive or negative behind each of our plane norms

00:09:09.960 | So to do that, we're just going to do np dot and then we just add our vector and we add plane norms

00:09:15.880 | And we just transpose those

00:09:18.440 | Okay, so let me

00:09:21.480 | Let's see what we get there

00:09:23.480 | Okay, so we see that we get negative negative positive positive. Okay, so

00:09:28.440 | When we convert this into our binary vector, that will be zero zero one one

00:09:32.840 | Now we want to do this not just for

00:09:37.320 | a but also for for b and c so

00:09:39.720 | Okay

00:09:42.280 | Now what we want to do

00:09:44.280 | is say

00:09:46.040 | Okay, if it's negative it's zero if it's positive. It's a one so to do that

00:09:51.800 | All we want to do is write a dot

00:09:54.280 | and we say

00:09:56.840 | Well greater than zero

00:09:58.840 | It's a it's a one so it's positive

00:10:00.920 | And we do that again for each one of our

00:10:07.000 | vectors

00:10:08.760 | Let's see what we get

00:10:10.760 | Okay, so now we get false false true true so the final thing to do there

00:10:16.680 | Although I don't you don't necessarily need to is just convert them. In fact, we do need to uh purely to create our

00:10:24.600 | the the binary vector string

00:10:29.000 | so

00:10:31.320 | We'll see why in a moment. It's fine

00:10:34.360 | As type in so it's essentially easier for us to visualize

00:10:37.660 | That's it

00:10:41.240 | Okay, and we should get something that more looks like this here, um, so

00:10:48.600 | You see here obviously the values the the positions are slightly different

00:10:53.400 | But we have a is on the positive side of the teal hyperplane. So one is of course

00:10:59.800 | One and then on c and b are both on the other side

00:11:04.200 | So they are of course negative

00:11:06.200 | now

00:11:07.960 | if we consider

00:11:09.560 | a b and c

00:11:11.160 | b and c kind of in the more in the

00:11:13.240 | In a similar position. So we would hope that they kind of align in there

00:11:18.200 | In the values that they have a bit better than them

00:11:21.400 | Um a but let's have a look see what we get

00:11:25.800 | Okay, so they're the same so that's good because they are

00:11:33.480 | Very much in a very similar place which you can see from here

00:11:37.560 | So a b and c in this case match up to what we're writing in the code. It's just the hyperplanes are different

00:11:43.160 | Now it's these binary vectors that we use to create our lsh lsh buckets

00:11:50.360 | So what we're going to do is actually implement that just using a python dictionary

00:11:56.060 | to make it easy so

00:11:58.760 | What we'll do is i'm going to put each of our vectors

00:12:01.980 | A

00:12:03.980 | Dot b dot

00:12:06.140 | And c dot into this

00:12:09.420 | Vector list. It's just so we can iterate through them a little bit easier

00:12:13.020 | I'm going to initialize our buckets which is going to be like I said a python dictionary

00:12:17.680 | And here i'm just going to set i equal to zero. So this

00:12:21.420 | Is just so we can loop through each of those each of those vectors

00:12:25.360 | So we're going to do for i in range

00:12:30.220 | length of

00:12:31.980 | Vectors, so yeah, we don't need either

00:12:34.380 | So the first thing we want to do is is create a hash string using the

00:12:40.540 | Vectors that we have up here. So i'm going to do hash string equals i'm going to do

00:12:45.340 | Just join like that

00:12:48.380 | And what we want to do is join all of those numbers together, but we need to to join them as strings so

00:12:57.100 | run vectors

00:12:59.500 | i

00:13:01.020 | as

00:13:02.220 | type

00:13:03.260 | string

00:13:04.860 | Okay

00:13:06.220 | so

00:13:08.060 | if I

00:13:09.260 | Let me show you what that does hash string

00:13:12.300 | Okay, so we just get something like that now

00:13:15.340 | Then what we want to do is say if the hash string is not already within our buckets is not in

00:13:21.900 | Buckets our keys we want to initialize a new list so buckets

00:13:29.260 | hash string

00:13:31.260 | Equals a new list

00:13:33.660 | So we're initializing a list

00:13:35.820 | Essentially initializing a book. It's put our vectors in

00:13:39.740 | So

00:13:43.420 | Initialize it and then after that we we just add we append it to that bucket

00:13:49.420 | So this is

00:13:53.260 | Essentially what we need let me

00:13:57.260 | Okay, yeah should it should be fine so let's print the buckets and see we'll get

00:14:01.260 | If hash string is

00:14:04.860 | What did I write there?

00:14:07.900 | Not in there we go

00:14:10.780 | So now we see we have these two hash buckets

00:14:13.980 | And one and two have both been been put into the same one. All right, and that's essentially how

00:14:21.020 | Lsh what lsh works, but just on a

00:14:24.860 | much bigger scale

00:14:27.580 | so

00:14:29.420 | Now, let's say, you know, we we have our buckets

00:14:32.060 | and what we want to do is

00:14:34.780 | Give a new

00:14:37.180 | vector we want to

00:14:39.180 | Search using that I want to hash it and search. This is our query vector xq

00:14:44.220 | So what we we see here

00:14:49.020 | So we've got like two examples one on the left is an example

00:14:53.820 | So this is we're comparing our query vector against two

00:14:57.180 | samples in our

00:14:59.960 | lsh buckets

00:15:02.920 | So xq is this zero it's been hashed zero one one one

00:15:11.100 | And what we do is first we compare it to

00:15:15.100 | This vector and we see and we're using hamming distance here, which is essentially, you know, do these two equivalent

00:15:21.580 | values match or not

00:15:24.140 | If they do it's what zero there's there's no distance between them

00:15:28.780 | If they do not match then the distance is one and then we add up

00:15:33.660 | all of the distance values

00:15:36.380 | at the end there so

00:15:38.540 | With this one none of them match. So we get like one plus one plus one plus one, which is obviously four

00:15:45.180 | so the hamming distance between those two is four which is is

00:15:48.060 | And the biggest you're going to get with this dimensionality of binary vectors

00:15:53.180 | And then we have the second one and these ones match a bit better. So zero is equal to zero one to one one to one

00:16:00.700 | so all those zeros

00:16:03.340 | and then so like zero zero zero and then this final one is one so then

00:16:10.220 | that equals

00:16:12.300 | one

00:16:13.980 | Okay

00:16:15.820 | So that's hamming distance

00:16:17.820 | And when we consider that with our code

00:16:21.980 | Over here

00:16:26.060 | We we also have to consider that there's a degree of

00:16:29.180 | A degree of information being lost because

00:16:33.740 | This is how we're storing the vectors. We don't store the original vectors anymore. This is

00:16:39.100 | You know that essentially like final form

00:16:43.020 | in the lsh index, so

00:16:45.980 | Say we have our query vector and it comes through a zero zero one zero like great. We get a perfect match, but

00:16:52.380 | In reality, does it is it close to one or is it closer to we don't we don't actually know

00:16:58.300 | So we have to be careful when we're building our buckets

00:17:02.060 | To make sure

00:17:05.020 | That there's not too many items within each bucket

00:17:08.620 | We need them to be reasonably spread out but not spread too thin because if we spread them too thin across

00:17:14.460 | Too many buckets the the index becomes absolutely huge. So it's

00:17:19.100 | is definitely like

00:17:21.240 | balancing acts between

00:17:22.860 | having enough buckets

00:17:24.860 | enough granularity in there to differentiate between

00:17:27.900 | a reasonable number of vectors

00:17:30.540 | But not too granular that we just make the index bigger

00:17:36.860 | Because then it's slower than just doing a flat search

00:17:39.500 | now

00:17:41.740 | What you what we see here is what we just discussed, right? So

00:17:45.740 | given

00:17:48.380 | These two vectors a and b they're reasonably far from each other if we use a value of n bits value of two

00:17:53.900 | Our vectors are not big enough. They get booked it into the same place. We can't differentiate them

00:18:01.900 | But what we can do obviously increase the number of hyperplanes or increase n bits

00:18:06.700 | And then we can differentiate them. So so for you know, these two here

00:18:12.220 | Exactly the same

00:18:14.220 | Uh buckets these are not we have differences here

00:18:17.820 | here

00:18:20.620 | And here so we can differentiate between them, which obviously is what we need

00:18:26.620 | But at the same time we are increasing the size of our index so it means

00:18:32.780 | we are becoming

00:18:35.580 | More accurate, but we're also getting slower. So

00:18:37.900 | It's yeah finding the the middle ground between them both

00:18:42.540 | Now that's it for the implementation details behind vice

00:18:49.260 | What we'll do

00:18:51.820 | Is we're going to leave this video here and we'll cover

00:18:56.140 | Device implementation in the next video, which I will leave a link to in the description so you can find that easily

00:19:02.540 | But for now, thank you very much for watching and i'll see you in the next one