And welcome to the next video in our similarity search series. So far we've worked on quite a few different indexes that we can use in FISE for similarity search and through the past few videos and articles we've tested a lot of different indexes and we've sort of tested to see what works and what doesn't work and the sort of pros and cons of each of those indexes.
Now that's very good and we've had some I think pretty interesting and cool results but in reality the best or the highest performing indexes are all composite indexes. Now composite simply means an index which is built of multiple parts and all state-of-the-art performances are produced by these composite indexes.
So what we'll cover in this video is composite indexes at a high level, we'll just introduce them. We'll also learn how we can use the FISE index factory function. Now when we're building composite indexes this function is incredibly useful and I would say almost essential. You can build composite indexes without it but it's very difficult and the code is pretty horrific to actually work with.
So I think index factory is pretty essential in building these sort of indexes and we'll also explore how we can build a few of the more popular composite indexes as well. So IVF-ADC, Multi-D-ADC and IVF-HMSW indexes and let's explain what all of those are because I know it's just a load of words and letters.
So first we want to have a look at the composite components. So the components that we can have within a composite index. So there are two sort of main components that we're pretty much always going to have and that is so in the middle we have what is typically a coarse quantizer.
So I'll just put coarse here. This is stuff like IVF and we can also use HMSW here and then we pretty much always have another component which is our fine quantizer. Now in terms of fine quantization you're looking at things like product quantization but in this sort of space we would also just include flat vectors.
Although that's not really a form of quantization it's just using flat vectors they would be sort of interchangeable. So flat vectors get put into this box not because they are a form of fine quantization but simply because flat vectors would be placed into that position within the index structure.
Then the most common addition to these two would be the pre-processing step. So the pre-processing of our vectors and here you have a few things you can have PCA which is dimensionality reduction or we can also have OPQ. Now OPQ is optimized product quantization and in order to integrate both of these we would have to have PQ down here as well and all this OPQ step would do is rotate the vectors that are being placed into our index in such a way that it optimizes them for product quantization later on.
These three are probably the most common ones but we also have another block as well and that comes at the end here. This is a sort of like a refinement step. All throughout here we've sort of built our approximate search and then we return top 10k or top 10,000 most relevant results according to our approximate search and then we perform an exhaustive search at the end which will re-rank them based on a more definitive, I would say, search.
Here's just a few examples of that. So we have the vector transform, coarse quantizer, fine quantizer and refinement steps there. So for example what we might do is when the vectors are incoming we transform them, we rotate them using OPQ, then we perform coarse quantization and assign all of those incoming vectors, those rotated vectors now, into different lists within a IVF index.
Once they've been assigned to their different clusters or cells we use product quantization to compress them and reduce the memory usage of our index. And then when it comes to search time what we might want to do is use a refinement step which would take the full vectors which in this case would not allow us to reduce the memory usage of our index significantly.
But we could add a PQ refinement step here so those vectors are stored as their quantized forms or compressed forms rather than flat vectors, whichever way works for your index. During our search we would go through each of these steps and turn the top 10,000 items, so the top 10k, and using our final refinement step rather than using IVF we would just take the Euclidean distance or the L2 distance between all of them and find the ones that are most similar.
And this is an example of that, so this is IVF PQ, this is an example of the two middle parts of that, so IVF and PQ, we are excluding the re-ranking and excluding the OPQ part. And what we're doing here is we take our original vector, we assign it to an IVF cell, and then we also take that original vector in a separate almost stream, we quantize it, convert them into the codes, and place that under the IVF cell assignment.
Now I think that's plenty for an introduction to CompCert Index, let's move on to how we start building them using the Index Factory. Okay, so here I'm just importing our data, we're using the CIFT1M data set as we usually do. If you don't have that, there will be a link in the description to the download script for that.
And I've also also just imported FICE up here. So the first thing I'm going to do is show you the difference between building an index with the non-method that we've been using in every previous article and video and how we can do the same using the Index Factory. So what we need to do is we write Quantizer, FICE, and in here I'm going to write Index Flat L2.
So this is because we're using the flat vectors. Again, like I said before, the flat vectors are in the place of where we would put a fine quantizer usually, but they themselves are not actually a fine quantizer. And then to build our index from that, we would write FICE, and this is our Index Flat IVF, or IVF Flat.
And in here we first pass our quantizer, then our dimensionality, and then we pass NList, which is the number of Voronoi cells or oculus that we want to include in our IVF list. Okay, after that we add our vectors. Actually, we train them first, so we train on those vectors, and then we add them.
Okay, do I need to, oh, I need to rerun this. Okay, run that and run this again. Once we have those, we can do search like we usually do, and that's how we build our index and then add data to it. So that's the usual approach. Using the Index Factory approach is slightly different.
So what we do is instead we write, I'm going to write Index F just so we can compare the two indexes later on, and all we need to write is FICE. We call it Index Factory Function, and in here we need to pass the dimensionality of our vectors like we usually do, but then we pass this, it's called an Index Factory String, and in here we specify the structure of the index that we'd like to build.
So we're going to say IVF256, so an IVF index with 256 cells, followed by, or which contains flat vectors, and that's it. So we can run that, and then we do the same thing again. We do Index F train XB, and Index F add XB, like so. And that should work.
So once we've done that, what I want to show you is that they do in fact output the exact same results. So we need to do DI, so what we normally do, index.search, XQ, K. K I haven't defined, let's go with 100, or let's go 10, because so we can actually see the results.
Okay, and then let's do the same, but for Index F, which is our Index Factory Index. So here we just replace index with Index F, and you should see the same thing again. I'm going to just replace these as well. Okay, and I mean, we can see they're exactly the same.
There's no difference between them. So we were returning the same nearest neighbors using both. So they are the same indexes is what I'm trying to point out here. The search time is pretty much the same. I always find it's like very slightly faster when you use Index Factory, but I mean very slightly.
It's on like micro scale value, like maybe five microseconds faster, which is obviously not that much. And in terms of memory uses, they're both exactly the same as well. So that's a very simple example, but obviously when we start building more complex indexes, the Index Factory becomes a lot more useful.
So let me show you a good example of why we'd use Index Factory, e.g. let's put together a more complex, a more complex index, and we'll see the difference between the two approaches. Okay, so I've just pulled this code in here, and this is, I said before, I have those kind of like four different components that we can have.
We have pre-processing, and we have like the coarse and fine quantizer. So the fine quantizer here, I'll just rename this as Vex because it's not really a quantizer, so it's just a flat vectors. And then we also have the set that comes after where we're re-ranking the vectors as well.
And I also need to add that in. So it's just index equals vice index refine flat, and in here we have to pass all, well we have to pass this. So I'm going to rename that to Q and add it in here. So here we're building our index like we usually do.
So we've got our rotation, OPQ rotation, which is a pre-processing set, and then we go on to our IVF. So we're putting, sorting those out into different cells, and we are using the fine quantizer, it's not really a quantizer, it's the flat vectors, and then we're merging all of those together into this sort of merged almost proto-index.
We could use this as an index, but then we want to add another post-processing set, which is a re-ranking, which is what we've done there. Okay, and I mean all this code is quite complicated, I mean at least to me it's not that easy to read, there's a lot going on.
So running up, it can take a little bit of time to build, to be honest, and maybe it's better if I use less data. I think it's fine for now. If it takes too long I'll reduce how much data we have. And if we wanted to rebuild this in using the index factory, it's super simple.
All we do is we write index equals FICE index factory, so literally we're going to do this on one line and pretty much all of this. Index factory, and we have our dimensionality D, and then we have our string. So first we have the OPQ set, so the pre-processing set.
In there we're using an M value of 32. Then that goes on to our IVF set, so we're using IVF there. We are using an endless value of 256. That's flat. Okay, I've just added these, so it's not training for too long, I've added these in here. So here we're not using flat, so we're using flat vectors here, but then we're using IVF PQ.
So this is actually PQ, and we're using M here, so M is 32 up here. And then the next step after that is our re-indexing or refinement step, and that's just R flat like this. And that's it. That's the whole, what we wrote, all of this, we compress it into the single line with the index factory, which is why when you're doing the more complicated stuff, or at least when you start using composite indexes, I think index factory is practically an essential.
Missing it is, or try and do it without, is you can do it of course, like we showed here, you can, but it's difficult. And it gets a lot more complicated than this as well. So I think it's probably the best approach, in my opinion, at least. Now, if we train and do the same again, where I'm just doing the first 50k, which is still taking forever, maybe because we're recording at the same time.
Okay. And what I'm just going to do here is di equals index search. And then we can, again, we just see what we return and make sure they match. Okay. And let's have a look at what we have. Let's print out i before we lose it. And we do the same here.
So, and we'll just do the search as well. Index search, xqk. And again, we just want to see i. Okay. And that's finished. And we can see, so we compare those results and they are exactly the same, which is what we'd expect because we are building the same index.
So from this to this using the index factory. So I think it's fair to say there is good reason to use it when you're building these composite indexes. Now in the next video, we're going to be covering the composite indexes and how we can build them using the index factory.
Now, if you are watching this on YouTube, it's the next video in the playlist. Otherwise, if you're on, if you're over on Pinecone reading the article, it's below. So you can read about it or watch it there. So thank you very much for watching and I'll see you in