back to indexComposite Indexes and the Faiss Index Factory
Chapters
0:0 Intro
1:54 Composite Indexes
6:43 Faiss Index Factory
11:34 Why we use Index Factory
17:11 Outro
00:00:00.000 |
And welcome to the next video in our similarity search series. So far we've worked on quite a few 00:00:08.400 |
different indexes that we can use in FISE for similarity search and through the past 00:00:15.280 |
few videos and articles we've tested a lot of different indexes and we've sort of tested 00:00:22.960 |
to see what works and what doesn't work and the sort of pros and cons of each of those indexes. 00:00:30.400 |
Now that's very good and we've had some I think pretty interesting and cool results but in reality 00:00:38.320 |
the best or the highest performing indexes are all composite indexes. Now composite simply means an 00:00:47.520 |
index which is built of multiple parts and all state-of-the-art performances are produced by 00:00:55.600 |
these composite indexes. So what we'll cover in this video is composite indexes at a high level, 00:01:02.960 |
we'll just introduce them. We'll also learn how we can use the FISE index factory function. Now 00:01:08.720 |
when we're building composite indexes this function is incredibly useful and I would say 00:01:15.280 |
almost essential. You can build composite indexes without it but it's very difficult and the code is 00:01:23.680 |
pretty horrific to actually work with. So I think index factory is pretty essential in building 00:01:31.760 |
these sort of indexes and we'll also explore how we can build a few of the more popular 00:01:38.400 |
composite indexes as well. So IVF-ADC, Multi-D-ADC and IVF-HMSW indexes and let's explain what all 00:01:50.960 |
of those are because I know it's just a load of words and letters. So first we want to have a look 00:01:56.400 |
at the composite components. So the components that we can have within a composite index. So 00:02:01.920 |
there are two sort of main components that we're pretty much always going to have 00:02:08.480 |
and that is so in the middle we have what is typically a coarse quantizer. So I'll just put 00:02:16.560 |
coarse here. This is stuff like IVF and we can also use HMSW here and then we pretty much always 00:02:27.520 |
have another component which is our fine quantizer. Now in terms of fine quantization you're looking 00:02:35.920 |
at things like product quantization but in this sort of space we would also just include flat 00:02:43.600 |
vectors. Although that's not really a form of quantization it's just using flat vectors they 00:02:49.520 |
would be sort of interchangeable. So flat vectors get put into this box not because they are a form 00:02:56.560 |
of fine quantization but simply because flat vectors would be placed into that position 00:03:02.480 |
within the index structure. Then the most common addition to these two would be the pre-processing 00:03:12.320 |
step. So the pre-processing of our vectors and here you have a few things you can have 00:03:19.200 |
PCA which is dimensionality reduction or we can also have OPQ. Now OPQ is optimized product 00:03:27.840 |
quantization and in order to integrate both of these we would have to have PQ down here as well 00:03:34.720 |
and all this OPQ step would do is rotate the vectors that are being placed into our index 00:03:42.320 |
in such a way that it optimizes them for product quantization later on. These three are probably 00:03:50.080 |
the most common ones but we also have another block as well and that comes at the end here. 00:03:56.400 |
This is a sort of like a refinement step. All throughout here we've sort of built our 00:04:03.600 |
approximate search and then we return top 10k or top 10,000 most relevant results according 00:04:12.880 |
to our approximate search and then we perform an exhaustive search at the end which will 00:04:17.920 |
re-rank them based on a more definitive, I would say, search. Here's just a few examples of that. 00:04:27.600 |
So we have the vector transform, coarse quantizer, fine quantizer and refinement steps there. 00:04:34.240 |
So for example what we might do is when the vectors are incoming we transform them, 00:04:41.920 |
we rotate them using OPQ, then we perform coarse quantization and assign all of those 00:04:51.360 |
incoming vectors, those rotated vectors now, into different lists within a IVF index. 00:04:58.480 |
Once they've been assigned to their different clusters or cells we use product quantization 00:05:06.800 |
to compress them and reduce the memory usage of our index. And then when it comes to search time 00:05:13.920 |
what we might want to do is use a refinement step which would take the full vectors which in this 00:05:22.080 |
case would not allow us to reduce the memory usage of our index significantly. But we could 00:05:30.560 |
add a PQ refinement step here so those vectors are stored as their quantized forms or compressed 00:05:38.880 |
forms rather than flat vectors, whichever way works for your index. During our search we would 00:05:45.920 |
go through each of these steps and turn the top 10,000 items, so the top 10k, and using our final 00:05:56.160 |
refinement step rather than using IVF we would just take the Euclidean distance or the L2 distance 00:06:03.440 |
between all of them and find the ones that are most similar. And this is an example of that, 00:06:09.840 |
so this is IVF PQ, this is an example of the two middle parts of that, so IVF and PQ, 00:06:18.240 |
we are excluding the re-ranking and excluding the OPQ part. And what we're doing here is we 00:06:27.440 |
take our original vector, we assign it to an IVF cell, and then we also take that original vector 00:06:33.280 |
in a separate almost stream, we quantize it, convert them into the codes, and place that 00:06:42.000 |
under the IVF cell assignment. Now I think that's plenty for an introduction to CompCert Index, 00:06:51.200 |
let's move on to how we start building them using the Index Factory. Okay, so here I'm just 00:06:59.680 |
importing our data, we're using the CIFT1M data set as we usually do. If you don't have that, 00:07:07.440 |
there will be a link in the description to the download script for that. And I've also 00:07:13.520 |
also just imported FICE up here. So the first thing I'm going to do is show you the difference 00:07:20.560 |
between building an index with the non-method that we've been using in every previous article and 00:07:28.960 |
video and how we can do the same using the Index Factory. So what we need to do is we write 00:07:36.080 |
Quantizer, FICE, and in here I'm going to write Index Flat L2. So this is because we're using the 00:07:45.360 |
flat vectors. Again, like I said before, the flat vectors are in the place of where we would put a 00:07:52.800 |
fine quantizer usually, but they themselves are not actually a fine quantizer. And then to build 00:08:00.480 |
our index from that, we would write FICE, and this is our Index Flat IVF, or IVF Flat. 00:08:09.760 |
And in here we first pass our quantizer, then our dimensionality, and then we pass NList, 00:08:19.840 |
which is the number of Voronoi cells or oculus that we want to include in our IVF list. Okay, 00:08:27.040 |
after that we add our vectors. Actually, we train them first, so we train on those vectors, 00:08:34.720 |
and then we add them. Okay, do I need to, oh, I need to rerun this. Okay, 00:08:45.200 |
run that and run this again. Once we have those, we can do search like we usually do, 00:08:56.080 |
and that's how we build our index and then add data to it. So that's the usual approach. Using 00:09:03.680 |
the Index Factory approach is slightly different. So what we do is instead we write, I'm going to 00:09:09.840 |
write Index F just so we can compare the two indexes later on, and all we need to write is 00:09:15.600 |
FICE. We call it Index Factory Function, and in here we need to pass the dimensionality of our 00:09:22.960 |
vectors like we usually do, but then we pass this, it's called an Index Factory String, 00:09:29.120 |
and in here we specify the structure of the index that we'd like to build. 00:09:36.640 |
So we're going to say IVF256, so an IVF index with 256 cells, followed by, or which contains 00:09:48.400 |
flat vectors, and that's it. So we can run that, and then we do the same thing again. We do Index 00:09:54.960 |
F train XB, and Index F add XB, like so. And that should work. So once we've done that, 00:10:07.680 |
what I want to show you is that they do in fact output the exact same results. So we need to do 00:10:15.920 |
DI, so what we normally do, index.search, XQ, K. K I haven't defined, let's go with 100, 00:10:23.520 |
or let's go 10, because so we can actually see the results. 00:10:29.360 |
Okay, and then let's do the same, but for Index F, which is our Index Factory Index. 00:10:44.080 |
and you should see the same thing again. I'm going to just replace these as well. 00:10:49.600 |
Okay, and I mean, we can see they're exactly the same. There's no difference between them. So we 00:10:59.840 |
were returning the same nearest neighbors using both. So they are the same indexes is what I'm 00:11:06.480 |
trying to point out here. The search time is pretty much the same. I always find it's like 00:11:13.040 |
very slightly faster when you use Index Factory, but I mean very slightly. It's on like micro 00:11:20.880 |
scale value, like maybe five microseconds faster, which is obviously not that much. 00:11:26.800 |
And in terms of memory uses, they're both exactly the same as well. So that's a very simple example, 00:11:35.440 |
but obviously when we start building more complex indexes, the Index Factory becomes a lot more 00:11:42.160 |
useful. So let me show you a good example of why we'd use Index Factory, e.g. let's put together 00:11:50.880 |
a more complex, a more complex index, and we'll see the difference between the two approaches. 00:11:56.960 |
Okay, so I've just pulled this code in here, and this is, I said before, I have those kind of like 00:12:03.920 |
four different components that we can have. We have pre-processing, and we have like the coarse 00:12:09.200 |
and fine quantizer. So the fine quantizer here, I'll just rename this as Vex because it's not 00:12:17.280 |
really a quantizer, so it's just a flat vectors. And then we also have the set that comes after 00:12:24.640 |
where we're re-ranking the vectors as well. And I also need to add that in. So it's just 00:12:30.080 |
index equals vice index refine flat, and in here we have to pass all, well we have to pass this. 00:12:38.880 |
So I'm going to rename that to Q and add it in here. So here we're building our index like we 00:12:45.040 |
usually do. So we've got our rotation, OPQ rotation, which is a pre-processing set, and then 00:12:51.280 |
we go on to our IVF. So we're putting, sorting those out into different cells, and we are using 00:12:59.360 |
the fine quantizer, it's not really a quantizer, it's the flat vectors, and then we're merging all 00:13:05.120 |
of those together into this sort of merged almost proto-index. We could use this as an index, but 00:13:15.280 |
then we want to add another post-processing set, which is a re-ranking, which is what we've done 00:13:21.520 |
there. Okay, and I mean all this code is quite complicated, I mean at least to me it's not that 00:13:29.680 |
easy to read, there's a lot going on. So running up, it can take a little bit of time to build, 00:13:36.160 |
to be honest, and maybe it's better if I use less data. I think it's fine for now. If it takes too 00:13:47.520 |
long I'll reduce how much data we have. And if we wanted to rebuild this in using the index factory, 00:13:55.680 |
it's super simple. All we do is we write index equals FICE index factory, so literally we're 00:14:05.040 |
going to do this on one line and pretty much all of this. Index factory, and we have our dimensionality 00:14:11.920 |
D, and then we have our string. So first we have the OPQ set, so the pre-processing set. 00:14:19.120 |
In there we're using an M value of 32. Then that goes on to our IVF set, so we're using IVF there. 00:14:31.840 |
We are using an endless value of 256. That's flat. Okay, I've just added these, 00:14:41.120 |
so it's not training for too long, I've added these in here. So here we're not using flat, 00:14:49.840 |
so we're using flat vectors here, but then we're using IVF PQ. So this is actually PQ, and we're 00:14:58.800 |
using M here, so M is 32 up here. And then the next step after that is our re-indexing or 00:15:10.800 |
refinement step, and that's just R flat like this. And that's it. That's the whole, what we wrote, 00:15:21.680 |
all of this, we compress it into the single line with the index factory, which is why when you're 00:15:28.800 |
doing the more complicated stuff, or at least when you start using composite indexes, I think 00:15:36.320 |
index factory is practically an essential. Missing it is, or try and do it without, 00:15:44.160 |
is you can do it of course, like we showed here, you can, but it's difficult. And it gets a lot 00:15:50.880 |
more complicated than this as well. So I think it's probably the best approach, in my opinion, 00:15:57.200 |
at least. Now, if we train and do the same again, where I'm just doing the first 50k, 00:16:04.880 |
which is still taking forever, maybe because we're recording at the same time. 00:16:14.960 |
Okay. And what I'm just going to do here is di equals index search. And then we can, 00:16:22.400 |
again, we just see what we return and make sure they match. Okay. And let's have a look at what 00:16:31.840 |
we have. Let's print out i before we lose it. And we do the same here. So, and we'll just do 00:16:40.560 |
the search as well. Index search, xqk. And again, we just want to see i. Okay. And that's finished. 00:16:52.160 |
And we can see, so we compare those results and they are exactly the same, which is what we'd 00:16:58.640 |
expect because we are building the same index. So from this to this using the index factory. So I 00:17:05.840 |
think it's fair to say there is good reason to use it when you're building these composite indexes. 00:17:12.640 |
Now in the next video, we're going to be covering the composite indexes and how we can build them 00:17:20.240 |
using the index factory. Now, if you are watching this on YouTube, it's the next video in the 00:17:27.920 |
playlist. Otherwise, if you're on, if you're over on Pinecone reading the article, it's below. So 00:17:35.200 |
you can read about it or watch it there. So thank you very much for watching and I'll see you in