back to indexProduct Quantization for Vector Similarity Search (+ Python)
Chapters
0:0 Introduction
2:42 Dimensionality Reduction
4:50 Product Quantization
9:9 Python Code
18:31 Coding
21:16 Visualizing
22:26 Quantizing
00:00:00.000 |
I will come to the next video in our series on similarity search. We're going to be covering 00:00:05.760 |
product quantization, which is a very effective method for reducing the memory usage of our 00:00:14.800 |
vectors. Now, I'll just show you this very quickly. So we, in the first column over here, 00:00:22.560 |
we have the recall speed and memory for a dataset of size 1 million. So we're using 00:00:30.480 |
the SIFT 1M dataset here. And the dimensionality of the vectors in there is 120. It's a very small 00:00:36.240 |
compared to a lot of vector sets out there. Now, the memory usage of that, when we just saw them 00:00:44.000 |
as they are, is a quarter of a gigabyte, just over, which is pretty big. And you think this 00:00:50.800 |
is a small dataset, small dimensionality, so this is already pretty significant. And when you start 00:01:00.320 |
getting into high dimensionality or just larger datasets, it quickly becomes very unmanageable. 00:01:07.600 |
So product quantization, or PQ, allows us to reduce that significantly. So in this example, 00:01:15.920 |
so the tests that we run here, we reduce the memory usage by 97%, which is pretty significant. 00:01:23.360 |
So from a quarter of a gigabyte to 6.5 megabytes, it's like nothing. And then the speed increase is 00:01:29.280 |
pretty significant as well. So just under a six-fold speed increase there. So pretty good. 00:01:36.240 |
Obviously, the recall also decreases a fair bit. So that's just something that we need to be 00:01:42.000 |
careful with. But at the same time, there's different parameters that we can use in order to 00:01:48.880 |
fine-tune that. So what we're just going to cover in this video is the logic behind product 00:01:57.120 |
quantization. There will also be another video, which will be following this one, where we'll 00:02:02.960 |
have a look at product quantization in FICE. And we'll also have a look at a composite 00:02:08.320 |
index of product quantization and inverted files, so IVF. And with that, we can actually make the 00:02:16.800 |
speed even faster. The memory goes up a little bit to, I think it's like nine megabytes, so pretty 00:02:22.960 |
minimal. But the speed increase is very significant. It's like a 92 times speed increase over a flat 00:02:32.480 |
index. So it's super fast. But we'll be covering that in the next video. For this one, I'm just 00:02:38.000 |
going to work through what product quantization is and how it works. So we'll start with the 00:02:43.440 |
example of dimensionality reduction, which is what you can see here. So we have this S value up here, 00:02:49.040 |
which is kind of like the scope of possible vectors that we can have, so all along here. 00:02:56.000 |
In the middle here, I've highlighted a single vector. So this would be X, our vector. And D 00:03:03.440 |
over here is, of course, our dimensionality of that vector. So typically, well, in this CIF-1M 00:03:10.240 |
data set, for example, we'd use a dimensionality of 128. Dimensionality reduction is something 00:03:17.200 |
like PCA, for example, where we would just reduce the dimensionality there. And we would try to 00:03:24.320 |
maintain the geometric or spatial properties of that vector. Even though it has less dimensions, 00:03:31.040 |
we would still try and maintain that. So vectors that were close to each other before with a high 00:03:35.920 |
dimensionality should be close to each other again in low dimensionality. Now, what we have here is 00:03:43.760 |
not dimensionality reduction. This is quantization. Now, quantization is a very generic term. And it 00:03:50.800 |
focuses on any method where we're trying to reduce the scope of our vectors. Now, in this case, 00:03:57.600 |
so beforehand, the scope is typically either a very large number or technically infinite. 00:04:04.400 |
So if we're using floats, for example, the number of possible vectors that you can have within that 00:04:10.480 |
space is usually pretty big. You may as well say it's infinite. Now, in this case, what we do is 00:04:18.400 |
reduce that. So we reduce the possible scope into a more finite set. Then we may be going from 00:04:27.680 |
something like a practically infinite scope to something like 256 possible vectors. That would 00:04:35.680 |
be a very small number, but it's just an example. Then notice that the dimensionality does not 00:04:40.960 |
necessarily change. It can change with quantization, but that's not the point of 00:04:46.000 |
quantization. The point is to reduce the scope of our possible vectors. Now, at a very high level, 00:04:52.640 |
let's just kind of cover what product quantization is doing. So what we have here, 00:04:59.440 |
this big blue block at the top here, this is our initial vector x. Now, we'll say dimensionality, 00:05:09.040 |
this is 128. So D equals 128. Now, the number of bits required to sort this vector, let's say it's 00:05:18.800 |
every value is a 32-bit float. So we have 32 bits multiplied by the dimensionality is 128. 00:05:27.440 |
So in total, that gives us 4,096 bits that we need to sort a single vector. Now, that is pretty 00:05:35.520 |
significant. So what we are going to do with PQ is try our best to reduce that. First thing we do 00:05:41.200 |
with product quantization is split our vector x into subvectors, each one represented by U. So U 00:05:51.040 |
to a subscript of J, where J is a value from 0 up to 7 in this case, representing the position of 00:05:57.280 |
that subvector. Now, the next step is the quantization of product quantization. So the 00:06:04.240 |
next step, we process each one of these subvectors through a clustering algorithm. Each of these 00:06:09.200 |
clustering algorithms is specific to each subspace. So it's specific to J equals 0, 00:06:15.600 |
1, 2, 3, or so on. And within each one of those clustering sets, we have a set of centroids. Now, 00:06:22.880 |
the number of centroids, we set that beforehand. So when we're building our index, we set the 00:06:28.880 |
number of centroids that we want. And the more centroids we have, the more memory we use, but 00:06:34.480 |
also the more accurate the index becomes. Because you think you have, say you have two centroids 00:06:41.520 |
in one vector subspace. Your subvector will have to be assigned to one of those two centroids. 00:06:46.880 |
And in the case that we only have two centroids, neither of them could be particularly close 00:06:51.520 |
to our vector or subvector. So that average distance between subvectors and the clusters 00:06:58.080 |
that they're assigned to is called the quantization error. And for good accuracy, good recall, good 00:07:05.040 |
results, we want to minimize that quantization error as much as possible. And we do that by 00:07:10.000 |
adding more centroids. So if we compare our cluster of two centroids to a set of clusters 00:07:17.280 |
where we have, let's say, 200 centroids, the chances are we're going to find a centroid that 00:07:22.880 |
is more closely fitting to our subvector in the 200 centroid algorithm than the one with just two 00:07:29.360 |
centroids. So that's the logic behind it, sort of the memory usage and accuracy trade-off there. 00:07:34.480 |
Obviously, the more centroids we have, the more data we store in our index to cover the full 00:07:38.960 |
scope of all those centroids. Now, once we've assigned a centroid to our subvector, we map 00:07:45.840 |
that centroid to its unique ID. So every centroid within this index will have a unique ID. That 00:07:52.480 |
unique ID is a 8-bit integer, typically. So we've now gone from-- so originally, we had a vector x 00:08:00.960 |
of dimensionality 128. And each value was a 32-bit float. So the number of bits we use there, 00:08:09.520 |
4,096. Now we only have a ID vector, which contains eight 8-bit integers. So that's 8 00:08:20.240 |
multiplied by 8. In this case, we have 64 bits. So we've compressed our 4,096-bit vector into 00:08:30.080 |
a 64-bit vector. And that's what gets stored in our index. 00:08:36.800 |
Now, there's also something called a codebook, which maps each one of those IDs back to the 00:08:43.440 |
reconstruction or reproduction values, which are those centroid vectors. But of course, 00:08:50.240 |
because we've minimized the scope there, we don't need to store that many of those mappings. So say 00:08:57.360 |
our scope is 256, we only need to store 256 of those mappings. So this is how product quantization 00:09:07.280 |
can be so effective. Now, we'll go through some more visuals. But I want to actually write this 00:09:15.360 |
out in code as we go along. For me, at least, writing out in code makes it much easier to 00:09:21.200 |
at least logically understand the process. So there are a few variables that we need to define 00:09:30.320 |
here. So we have this vector up here. It's very small. It's using integer values. The data types 00:09:37.680 |
here don't matter. We just want to build out the process of product quantization. This is not an 00:09:44.640 |
effective implementation or anything like that. We just want to understand the process and logic 00:09:48.720 |
behind it. So the first thing is, how many subvectors are we going to convert our x full 00:09:56.080 |
vector into? In this case, I want to create four subvectors. And we also need to get the 00:10:03.040 |
dimensionality of x as well, which is just the length of x. Now, we have those two. And the 00:10:09.280 |
first thing we need to do is make sure that d is divisible by m, because we need equally sized 00:10:17.600 |
subvectors. So what we do is rewrite assert that d is, in fact, divisible by m. So we just write 00:10:26.640 |
this. And if that is the case, we don't get an error. So if I just add a 1 in there, I'm going 00:10:32.880 |
to throw it off. We get this assertion error. So we just add that in there to say, OK, d is 00:10:38.960 |
definitely divisible by m. So that's good. We can continue. And what we want to do is say, OK, 00:10:46.160 |
what is the length of each subvector? It's going to be equal to d divided by m. And we'll have to 00:10:52.960 |
convert that over to an int as well. So we'll just do that. And let's see what we get. So each 00:11:01.520 |
subvector will be of size 3. Now, let's build that. So our subvector, or our set of subvectors, 00:11:11.200 |
will be assigned to u. And what we're going to do is go x from row to row plus d, d underscore, 00:11:21.520 |
for row in range 0 to d. And we're going to take it in steps of d, d underscore. OK. And then let's 00:11:30.880 |
see what we get. And see that we now have our subvectors. Now, what does that look like? Let's 00:11:37.520 |
visualize that. So here we have that vector. So d up here is our 12. And we're splitting that vector 00:11:46.240 |
x of dimension 12 into four subvectors. So m equals 4 down here. We'll create four subvectors. 00:11:54.480 |
Now, d over m, which is what we calculated before, produces this d star. Now, in the code, 00:12:00.880 |
we can't write the star. So I've changed it to d underscore. And the value of that is 3. So each of 00:12:08.800 |
our subvectors has a dimensionality of 3. So here we have that d star dimensionality of 3 here. 00:12:17.120 |
These are our subvectors. We have 4 in total. So that is equal to m. Now, next thing I want to do 00:12:24.400 |
is answer k. So k is the number of possible values that we're going to have in our data set. So we 00:12:34.880 |
are going to do 2 to the power of 5. And we print it out. And we get 32. So k is going to be equal 00:12:42.160 |
to 32. Now, the one thing we need to make sure of here is this k value, so the number of possible 00:12:51.200 |
centroids, that is going to be shared across our entire set of subspaces or subvectors. 00:13:00.320 |
So we need to make sure that that is divisible by m again. So we do the same thing again. We say 00:13:06.480 |
assert that k is, in fact, divisible by m. If it is, we can go on to get k star or k underscore, 00:13:16.560 |
which is going to be k divided by m. OK. And let's have a look at what that is. 00:13:22.960 |
So this means that with that, we should also make sure that's an integer, 00:13:27.840 |
like so. OK. So what that means is that we are going to have 8 centroids per subspace. So each 00:13:40.560 |
one of these subvectors here will have the possibility of being assigned to one of the 00:13:47.200 |
nearest 8 centroids within its subspace. And in total, across all of our subvectors, of course, 00:13:55.040 |
we have a total of 32 possible centroids. Now, what that 32 there means is that we will have 00:14:03.680 |
a codebook. So a codebook is simply a mapping of our centroid vectors. Or, in fact, let's do the 00:14:11.840 |
other way around. It's a mapping of centroid vector IDs. So each one of the centroids that 00:14:17.360 |
we haven't created yet but we will create will be assigned a unique ID. And it maps us from the 00:14:23.280 |
unique ID to the actual centroid subvector. So that's where later on we call those centroids 00:14:32.800 |
reproduction values. Because later on, when we're searching through our index, our index is storing 00:14:39.440 |
those 8-bit integers. But obviously, we can't compare those 8-bit integers to an actual vector 00:14:45.760 |
that we're searching with, our query vector. So what we do is we map each one of those 8-bit 00:14:53.200 |
integers, which are the IDs, back to their original centroid values. Not to the original subvector 00:15:00.160 |
values, but to the original centroid values. And that's what our codebook is for. So this k value, 00:15:07.600 |
this 32, means that we will have a total of 32 mappings in our codebook. Now, I know that can 00:15:15.680 |
maybe seem confusing. So let's have a look at what that actually looks like. So over here, we have 00:15:21.040 |
u and our subvectors. Here, we would have, let's say, j equals 0, j equals 1, 2, 3. And over here, 00:15:30.480 |
what we have is our reproduction value. So those centroid or those cluster centroids. Now, in this 00:15:38.000 |
case, we have these three centroids. In reality, we set k equal to 8. So we should actually see 00:15:45.360 |
not 3 centroids here, but we would actually see 8. So we have 1, 2, 3, 4, 5, 6, 7, 8. So in reality, 00:15:54.960 |
there would actually be that many centroids in there. And each one of those centroids can be 00:16:00.000 |
referred to as c, j. So in this case, j would be 0 and i. So we would have, let's say, this centroid 00:16:09.920 |
is i equals 0, this centroid is i equals 1, this centroid is i equals 2, and so on, all the way up 00:16:16.960 |
to, in this case, 7. So 0 is 7, giving us the 8 centroids there. So to get to this centroid here 00:16:25.280 |
in our codebook, so this c represents our codebook, we would go c, 0, 0. Or this one over here, this 00:16:34.480 |
number 2, would be c, 0, 2. Let's say down here, this green centroid, let's say that is i equals 00:16:44.160 |
4. So that means that to get to that in our codebook, we would say c, 3, 4. Okay, so that's 00:16:52.240 |
just the notation behind it. So our quantized subvector is one of these centroid subvectors. 00:16:59.760 |
And each one of our reproduction values, or those subvectors, the quantized subvectors, 00:17:06.640 |
will be assigned a reproduction value id, which is simply what we have here. So this c, j, i. 00:17:13.360 |
So in this case, this value here may be, it may be the value 0, 8 integer 0. And if we started 00:17:21.440 |
counting through all of our codebook, that would relate to c, 0, 0. Now if we were to go, let's say 00:17:29.840 |
down here, this is number 10, the integer value of 10. Now that if we count through, so j is going 00:17:38.080 |
to go up to 7, and that's our integer value of 7. And it's going to get reset because we're going to 00:17:44.960 |
take j from 0 to 1, which is going to reset our i counter. So it will go from c, 0, 7. And the 00:17:56.160 |
next one in our codebook will be down here, which will be c, 1, 0. So this actual integer value in 00:18:05.920 |
our id vector, that would be represented by the value 8. So this value here would be 8, 00:18:13.840 |
so we add one more. We go to 9, so that would be c, 1, 1. And 10 would be c, 1, 2. So c, 1, 2. Okay, 00:18:22.560 |
and that's how we represent those. And we refer back to those original centroid subvectors within 00:18:29.120 |
the reproduction values codebook. But for now, we don't have those centroids, so we need to 00:18:34.880 |
create some. So what I'm going to do is from random, import random int. C is going to be our 00:18:42.800 |
overall codebook or list of reproduction values. And we're going to say for j in range m, 00:18:50.960 |
so we're looping through each of our subspaces here, we are going to initialize cj. Okay, 00:18:59.840 |
and then in here, we need to store eight different centroid vectors or reproduction values. 00:19:07.280 |
And of course, we refer to those as i, so position those as i, for i in range. 00:19:13.760 |
And here, we are looping through k_. So in our case, that would be equal to 8. So we 00:19:20.480 |
run through that 8 times. Here we have m, so you think 4. We loop through 4 j's, 00:19:29.680 |
and we loop through 8 i's for each one of those. So in total, we get that 32 value, 00:19:36.000 |
which is our k value. And what we want to do is say we're going to say cji is going to be equal 00:19:43.360 |
to, we want to say randint from 0 up to, it's going to be our vector space, our cluster vector 00:19:51.360 |
space. So we're just going to set it to the maximum value that we have in here. So we have a 9. 00:19:59.440 |
So we set that. So these are the centroid values, remember, not the ids for just underlying in 00:20:08.320 |
range. Here we have our dimensionality of each subvector. And what we want to do is just append 00:20:16.160 |
that to cj, cji. And then outside that loop, we want to append cj to c. 00:20:26.000 |
Okay, so we've now just created our clusters, which we can see in here. So each subspace, 00:20:37.600 |
so each j has a total of 8 possible centroids. So we have 0 or 1, 2, 8 here in each one. And 00:20:49.600 |
we don't forget we have m of those subspaces, so we should see 4 of them, although it does 00:20:56.000 |
put out here, but this is in fact just the fourth one here. Now I think it's pretty cool to be able 00:21:03.120 |
to visualize that. So I'm just going to copy and paste this code across. You can see here, 00:21:08.960 |
you can copy if you want. It will be also in the notebooks that will be linked in the description 00:21:14.640 |
of the video. And we're just going to visualize each of our centroids within each of those 00:21:24.320 |
subspaces. So this is c, where j is equal to 0, j is equal to 1, 2, and 3. And these here 00:21:33.360 |
are our centroids. Now typically, of course, we would train these, but we are not going to train 00:21:39.920 |
them. We just want to create a very simple example here. So we're not going to go that far into it. 00:21:47.920 |
Okay, so I mean that's everything we see there. And now what we need to do, so we've produced 00:21:54.160 |
those reproduction values, all those centroids, and now what we need to do is actually take our 00:22:01.520 |
subvector and assign it to the nearest reproduction value or centroid. Now how do we do that? We 00:22:08.160 |
simply calculate the distance between our subvector and each of its equivalent reproduction values or 00:22:16.640 |
centroids. I'm sure it's going to get annoying if I keep saying reproduction values or centroids, 00:22:22.720 |
so I'm just going to call them centroids from now on. It's so much easier. So we are going to 00:22:27.920 |
calculate, we're going to identify the nearest of our centroids to our specific subvector. So 00:22:36.560 |
I'm just going to copy these functions in. The top here, we're just calculating the Euclidean distance. 00:22:44.640 |
It's pretty straightforward. If you need to just Google Euclidean distance, the form is pretty 00:22:52.720 |
straightforward. And then here we're just using that Euclidean distance and looping through each 00:22:58.480 |
of our values, so the K_, so each of the centroids within our specific J subspace. So all we need to 00:23:07.920 |
do there is pass a specific J subspace and our specific subvector to that function. And to do 00:23:18.480 |
that, all we want to do, I'll just also note that what we're doing here, these are the index 00:23:26.240 |
positions, so the high positions, and we're looping through and saying if the new distance that we 00:23:32.240 |
calculated between, you see here we're accessing that specific centroid, if that is less than the 00:23:38.800 |
previous one between our subvector and that centroid, then we assign the nearest ID to that 00:23:46.880 |
vector, not necessarily distance. We don't care about distance, we just want to see the nearest ID 00:23:52.720 |
to nearest IDX, which is what we return. So what we're going to return there are in fact 00:23:59.040 |
the I values for those centroids. So first thing we're going to do is initialize our ID's vector, 00:24:05.520 |
so here we're actually going to build that 8-bit integer IDs. In this case, there's not going to be 00:24:12.240 |
eight of the values, there will be four, because the length of this vector is always going to be 00:24:17.760 |
equal to M. So we have 4J in range M, I is equal to the nearest between CJ and UJ. I want to say 00:24:36.800 |
ID's append I. And then let's see what we have. Okay, so that is our quantized subvector, so each 00:24:46.960 |
one of these IDs here represent one of our centroid positions. So if we wanted to get those 00:24:56.480 |
centroid positions, so in this case I'm going to call them the reproduction or reconstruction values, 00:25:01.520 |
we would do this. So this is going to be our reconstructed vector. 00:25:07.040 |
I'm going to go between, so we're going to go for each J in range M, again remember, 00:25:15.840 |
we're going to say CJI, so the reconstruction value, is equal to C for J. And then the I 00:25:27.680 |
position is whichever value we have in here, so we need to say ID's J. Okay, and then we're going to 00:25:35.440 |
do Q, extend, CJI, and now let's have a look at what we get for Q. So this is our fully reconstructed 00:25:46.160 |
vector. Now earlier on I said that we get the quantization error, and we typically measure that 00:25:53.920 |
using the mean squared error, so this function here, you can see it there. So what we can do 00:26:01.920 |
is we can measure the error, or mean squared error, between our quantized vector and the 00:26:08.400 |
original vector. So all we do here is we go mean squared error, X, and Q. And here we get the total 00:26:17.040 |
squared error, mean squared error, between our original vector and the quantized version. 00:26:22.960 |
Through sort of increasing the M value, or increasing K, we should be able to minimize 00:26:30.800 |
this, and therefore improve the performance of our index. So I have one final thing to show you, 00:26:37.760 |
which is how the search works in PQ. So on the right we have just a single ID vector. 00:26:47.360 |
We would use our codebook, C, so that they would each go into this, like so. 00:26:59.360 |
And from there we would output our quantized subvectors. So these are their reproduction 00:27:07.280 |
values. Now on the left we have our query vector, and we would split that, of course, 00:27:17.600 |
like we did before with our initial vectors. So we would split that, and we would get this. 00:27:28.080 |
Now, what we would do is we calculate the Euclidean distance between these, 00:27:34.800 |
and what we would get is we would simply add all of these up. So we'd take the product of all of 00:27:43.520 |
them, hence why this is called product quantization, and the total distance would be all of those 00:27:50.800 |
added together. Okay, let's say this. Now we would do that for all of our vectors. So this is a 00:28:01.680 |
single, so this is like QU. That's a single quantized vector, or set of quantized subvectors. 00:28:11.120 |
We need to do that for all of them. And then we just find one that produces the lowest or the 00:28:16.720 |
lowest or the smallest product distance. Now, as well, it's worth noting, because we're taking 00:28:22.560 |
the product, this isn't really the distance between the two vectors, between U or the 00:28:28.240 |
quantized version of U and XQ, but it's almost like a proxy value for that. And then that's 00:28:38.800 |
what we use to find, to identify the nearest vector. Now, I'm not going to go into the code 00:28:45.280 |
behind this because it's quite long, but I am going to, I'm definitely going to record that 00:28:50.560 |
and just leave it as like an extra or almost bonus video if you do want to go through that. 00:28:55.920 |
But it is, it's pretty long, so I wouldn't say you necessarily need to. But in the next video, 00:29:02.240 |
anyway, we're going to have a look at how we implement all this in FICE, which is obviously 00:29:07.200 |
much more efficient. And it is in that video that we will also introduce the IVF and PQ 00:29:15.440 |
CompSetIndex, which just allows us, if you have a look at the speed here, 00:29:19.200 |
to really reduce the speed an insane amount. It's like a 90 times increase in the speed 00:29:27.760 |
of our search. So it's pretty cool. But yeah, that's it for this video. 00:29:31.840 |
So thank you very much for watching, and I will see you in the next one.