back to index

Composite Indexes and the Faiss Index Factory


Chapters

0:0 Intro
1:54 Composite Indexes
6:43 Faiss Index Factory
11:34 Why we use Index Factory
17:11 Outro

Whisper Transcript | Transcript Only Page

00:00:00.000 | And welcome to the next video in our similarity search series. So far we've worked on quite a few
00:00:08.400 | different indexes that we can use in FISE for similarity search and through the past
00:00:15.280 | few videos and articles we've tested a lot of different indexes and we've sort of tested
00:00:22.960 | to see what works and what doesn't work and the sort of pros and cons of each of those indexes.
00:00:30.400 | Now that's very good and we've had some I think pretty interesting and cool results but in reality
00:00:38.320 | the best or the highest performing indexes are all composite indexes. Now composite simply means an
00:00:47.520 | index which is built of multiple parts and all state-of-the-art performances are produced by
00:00:55.600 | these composite indexes. So what we'll cover in this video is composite indexes at a high level,
00:01:02.960 | we'll just introduce them. We'll also learn how we can use the FISE index factory function. Now
00:01:08.720 | when we're building composite indexes this function is incredibly useful and I would say
00:01:15.280 | almost essential. You can build composite indexes without it but it's very difficult and the code is
00:01:23.680 | pretty horrific to actually work with. So I think index factory is pretty essential in building
00:01:31.760 | these sort of indexes and we'll also explore how we can build a few of the more popular
00:01:38.400 | composite indexes as well. So IVF-ADC, Multi-D-ADC and IVF-HMSW indexes and let's explain what all
00:01:50.960 | of those are because I know it's just a load of words and letters. So first we want to have a look
00:01:56.400 | at the composite components. So the components that we can have within a composite index. So
00:02:01.920 | there are two sort of main components that we're pretty much always going to have
00:02:08.480 | and that is so in the middle we have what is typically a coarse quantizer. So I'll just put
00:02:16.560 | coarse here. This is stuff like IVF and we can also use HMSW here and then we pretty much always
00:02:27.520 | have another component which is our fine quantizer. Now in terms of fine quantization you're looking
00:02:35.920 | at things like product quantization but in this sort of space we would also just include flat
00:02:43.600 | vectors. Although that's not really a form of quantization it's just using flat vectors they
00:02:49.520 | would be sort of interchangeable. So flat vectors get put into this box not because they are a form
00:02:56.560 | of fine quantization but simply because flat vectors would be placed into that position
00:03:02.480 | within the index structure. Then the most common addition to these two would be the pre-processing
00:03:12.320 | step. So the pre-processing of our vectors and here you have a few things you can have
00:03:19.200 | PCA which is dimensionality reduction or we can also have OPQ. Now OPQ is optimized product
00:03:27.840 | quantization and in order to integrate both of these we would have to have PQ down here as well
00:03:34.720 | and all this OPQ step would do is rotate the vectors that are being placed into our index
00:03:42.320 | in such a way that it optimizes them for product quantization later on. These three are probably
00:03:50.080 | the most common ones but we also have another block as well and that comes at the end here.
00:03:56.400 | This is a sort of like a refinement step. All throughout here we've sort of built our
00:04:03.600 | approximate search and then we return top 10k or top 10,000 most relevant results according
00:04:12.880 | to our approximate search and then we perform an exhaustive search at the end which will
00:04:17.920 | re-rank them based on a more definitive, I would say, search. Here's just a few examples of that.
00:04:27.600 | So we have the vector transform, coarse quantizer, fine quantizer and refinement steps there.
00:04:34.240 | So for example what we might do is when the vectors are incoming we transform them,
00:04:41.920 | we rotate them using OPQ, then we perform coarse quantization and assign all of those
00:04:51.360 | incoming vectors, those rotated vectors now, into different lists within a IVF index.
00:04:58.480 | Once they've been assigned to their different clusters or cells we use product quantization
00:05:06.800 | to compress them and reduce the memory usage of our index. And then when it comes to search time
00:05:13.920 | what we might want to do is use a refinement step which would take the full vectors which in this
00:05:22.080 | case would not allow us to reduce the memory usage of our index significantly. But we could
00:05:30.560 | add a PQ refinement step here so those vectors are stored as their quantized forms or compressed
00:05:38.880 | forms rather than flat vectors, whichever way works for your index. During our search we would
00:05:45.920 | go through each of these steps and turn the top 10,000 items, so the top 10k, and using our final
00:05:56.160 | refinement step rather than using IVF we would just take the Euclidean distance or the L2 distance
00:06:03.440 | between all of them and find the ones that are most similar. And this is an example of that,
00:06:09.840 | so this is IVF PQ, this is an example of the two middle parts of that, so IVF and PQ,
00:06:18.240 | we are excluding the re-ranking and excluding the OPQ part. And what we're doing here is we
00:06:27.440 | take our original vector, we assign it to an IVF cell, and then we also take that original vector
00:06:33.280 | in a separate almost stream, we quantize it, convert them into the codes, and place that
00:06:42.000 | under the IVF cell assignment. Now I think that's plenty for an introduction to CompCert Index,
00:06:51.200 | let's move on to how we start building them using the Index Factory. Okay, so here I'm just
00:06:59.680 | importing our data, we're using the CIFT1M data set as we usually do. If you don't have that,
00:07:07.440 | there will be a link in the description to the download script for that. And I've also
00:07:13.520 | also just imported FICE up here. So the first thing I'm going to do is show you the difference
00:07:20.560 | between building an index with the non-method that we've been using in every previous article and
00:07:28.960 | video and how we can do the same using the Index Factory. So what we need to do is we write
00:07:36.080 | Quantizer, FICE, and in here I'm going to write Index Flat L2. So this is because we're using the
00:07:45.360 | flat vectors. Again, like I said before, the flat vectors are in the place of where we would put a
00:07:52.800 | fine quantizer usually, but they themselves are not actually a fine quantizer. And then to build
00:08:00.480 | our index from that, we would write FICE, and this is our Index Flat IVF, or IVF Flat.
00:08:09.760 | And in here we first pass our quantizer, then our dimensionality, and then we pass NList,
00:08:19.840 | which is the number of Voronoi cells or oculus that we want to include in our IVF list. Okay,
00:08:27.040 | after that we add our vectors. Actually, we train them first, so we train on those vectors,
00:08:34.720 | and then we add them. Okay, do I need to, oh, I need to rerun this. Okay,
00:08:45.200 | run that and run this again. Once we have those, we can do search like we usually do,
00:08:56.080 | and that's how we build our index and then add data to it. So that's the usual approach. Using
00:09:03.680 | the Index Factory approach is slightly different. So what we do is instead we write, I'm going to
00:09:09.840 | write Index F just so we can compare the two indexes later on, and all we need to write is
00:09:15.600 | FICE. We call it Index Factory Function, and in here we need to pass the dimensionality of our
00:09:22.960 | vectors like we usually do, but then we pass this, it's called an Index Factory String,
00:09:29.120 | and in here we specify the structure of the index that we'd like to build.
00:09:36.640 | So we're going to say IVF256, so an IVF index with 256 cells, followed by, or which contains
00:09:48.400 | flat vectors, and that's it. So we can run that, and then we do the same thing again. We do Index
00:09:54.960 | F train XB, and Index F add XB, like so. And that should work. So once we've done that,
00:10:07.680 | what I want to show you is that they do in fact output the exact same results. So we need to do
00:10:15.920 | DI, so what we normally do, index.search, XQ, K. K I haven't defined, let's go with 100,
00:10:23.520 | or let's go 10, because so we can actually see the results.
00:10:29.360 | Okay, and then let's do the same, but for Index F, which is our Index Factory Index.
00:10:40.880 | So here we just replace index with Index F,
00:10:44.080 | and you should see the same thing again. I'm going to just replace these as well.
00:10:49.600 | Okay, and I mean, we can see they're exactly the same. There's no difference between them. So we
00:10:59.840 | were returning the same nearest neighbors using both. So they are the same indexes is what I'm
00:11:06.480 | trying to point out here. The search time is pretty much the same. I always find it's like
00:11:13.040 | very slightly faster when you use Index Factory, but I mean very slightly. It's on like micro
00:11:20.880 | scale value, like maybe five microseconds faster, which is obviously not that much.
00:11:26.800 | And in terms of memory uses, they're both exactly the same as well. So that's a very simple example,
00:11:35.440 | but obviously when we start building more complex indexes, the Index Factory becomes a lot more
00:11:42.160 | useful. So let me show you a good example of why we'd use Index Factory, e.g. let's put together
00:11:50.880 | a more complex, a more complex index, and we'll see the difference between the two approaches.
00:11:56.960 | Okay, so I've just pulled this code in here, and this is, I said before, I have those kind of like
00:12:03.920 | four different components that we can have. We have pre-processing, and we have like the coarse
00:12:09.200 | and fine quantizer. So the fine quantizer here, I'll just rename this as Vex because it's not
00:12:17.280 | really a quantizer, so it's just a flat vectors. And then we also have the set that comes after
00:12:24.640 | where we're re-ranking the vectors as well. And I also need to add that in. So it's just
00:12:30.080 | index equals vice index refine flat, and in here we have to pass all, well we have to pass this.
00:12:38.880 | So I'm going to rename that to Q and add it in here. So here we're building our index like we
00:12:45.040 | usually do. So we've got our rotation, OPQ rotation, which is a pre-processing set, and then
00:12:51.280 | we go on to our IVF. So we're putting, sorting those out into different cells, and we are using
00:12:59.360 | the fine quantizer, it's not really a quantizer, it's the flat vectors, and then we're merging all
00:13:05.120 | of those together into this sort of merged almost proto-index. We could use this as an index, but
00:13:15.280 | then we want to add another post-processing set, which is a re-ranking, which is what we've done
00:13:21.520 | there. Okay, and I mean all this code is quite complicated, I mean at least to me it's not that
00:13:29.680 | easy to read, there's a lot going on. So running up, it can take a little bit of time to build,
00:13:36.160 | to be honest, and maybe it's better if I use less data. I think it's fine for now. If it takes too
00:13:47.520 | long I'll reduce how much data we have. And if we wanted to rebuild this in using the index factory,
00:13:55.680 | it's super simple. All we do is we write index equals FICE index factory, so literally we're
00:14:05.040 | going to do this on one line and pretty much all of this. Index factory, and we have our dimensionality
00:14:11.920 | D, and then we have our string. So first we have the OPQ set, so the pre-processing set.
00:14:19.120 | In there we're using an M value of 32. Then that goes on to our IVF set, so we're using IVF there.
00:14:31.840 | We are using an endless value of 256. That's flat. Okay, I've just added these,
00:14:41.120 | so it's not training for too long, I've added these in here. So here we're not using flat,
00:14:49.840 | so we're using flat vectors here, but then we're using IVF PQ. So this is actually PQ, and we're
00:14:58.800 | using M here, so M is 32 up here. And then the next step after that is our re-indexing or
00:15:10.800 | refinement step, and that's just R flat like this. And that's it. That's the whole, what we wrote,
00:15:21.680 | all of this, we compress it into the single line with the index factory, which is why when you're
00:15:28.800 | doing the more complicated stuff, or at least when you start using composite indexes, I think
00:15:36.320 | index factory is practically an essential. Missing it is, or try and do it without,
00:15:44.160 | is you can do it of course, like we showed here, you can, but it's difficult. And it gets a lot
00:15:50.880 | more complicated than this as well. So I think it's probably the best approach, in my opinion,
00:15:57.200 | at least. Now, if we train and do the same again, where I'm just doing the first 50k,
00:16:04.880 | which is still taking forever, maybe because we're recording at the same time.
00:16:14.960 | Okay. And what I'm just going to do here is di equals index search. And then we can,
00:16:22.400 | again, we just see what we return and make sure they match. Okay. And let's have a look at what
00:16:31.840 | we have. Let's print out i before we lose it. And we do the same here. So, and we'll just do
00:16:40.560 | the search as well. Index search, xqk. And again, we just want to see i. Okay. And that's finished.
00:16:52.160 | And we can see, so we compare those results and they are exactly the same, which is what we'd
00:16:58.640 | expect because we are building the same index. So from this to this using the index factory. So I
00:17:05.840 | think it's fair to say there is good reason to use it when you're building these composite indexes.
00:17:12.640 | Now in the next video, we're going to be covering the composite indexes and how we can build them
00:17:20.240 | using the index factory. Now, if you are watching this on YouTube, it's the next video in the
00:17:27.920 | playlist. Otherwise, if you're on, if you're over on Pinecone reading the article, it's below. So
00:17:35.200 | you can read about it or watch it there. So thank you very much for watching and I'll see you in