Best Indexes for Similarity Search in Faiss

I'm welcome to this video on Well these composite indexes that you can you can see on the screen now IVF ADC Monte de ADC and IVF HSW in the last video in the series we sort of introduce composite indexes and the index factory in FICE and What we're going to do now is obviously go through each of these and build them Using what we learned in the last video the index factory So Let's move on to the first one IVF ADC This is it stands for IVF is in like the inverted file index followed by asymmetric distance of computation if you watched the video or the videos or read the articles on product quantization you might remember ADC ADC is Well, it's part of product quantization so During indexing so when we first take our vectors, and we add them to the index There's the sort of two main steps Vectors are assigned to lists or the cells That's the IVF part of it, so if I just point that out So we're taking our vector over here, and IVF assignment is this part so this is our index IVF Assigning them to IVF cells which are the the voronoi cells that we usually visualize And from that you're assigned this cell ID So the cell ID is the IVF part of our index And then the second step is that those vectors that we've just assigned they go through a product quantization compression step so if I Goes all the way over here, and we create and we create a PQ code Do you quantize code and they get sold together so did the code to get stored in the?

IVF list That is a an IVF ADC index at at its core now During search this is where the the ADC part comes in so ADC is that that asymmetric? distance computation So we have to in product quantization Well when you when you quantize your vectors you have these two alternative ways of searching the first which is symmetric distance computation over here is not as popular and what you do with that is you take your query vector which is xq and you quantize it and Then you compare it to all of your pre quantized Pre-indexed vectors, so essentially both sides of what you're comparing are quantized, and that's why it's Asymmetrical or symmetric distance computation the alternative which tends to work better is Where you don't quantize your query vector, so you just take its cue as it is You compare that to your pre quantized pre indexed vectors, so the quantized speed vectors that's where that's where you get the IVF side of things, and then you also get the ADC side of things and When it can suffice we can we can use this index factory string to pull that together, so we have IVF on the left Product quantization on the right, so let's let's go to Python and put that together So I've already imported the data using again the SIFT 1M Dataset there'll be a link in the description which will explain how you can get that data set from that we have Show you we have XB.

Which is the index vector the vectors are searching through and then XQ which is just a single vector that we're using to search and So we put that together We just write index vice index factory as we as we have been doing we have our number of dimensions 128 for this data set and then we want to input the the string the index factory string Which is so we have IVF.

Which is the first part. We're gonna split into 256 cells for ourselves and We are using PQ for the second part. Which is the ADC part and here we're going to use an M value 32 and a n bits value of 8 now By default this is 8 so we can Include the times 8 there, but you don't need it The only way you probably need this is if you want to use a lower n bits value So say you want to use 4 if trying to go for 12 you Well, you're not going to be able to unless vice has updated that yet.

So basically you're not allowed to go over 8 For now with the the PQ index or IVF PQ index, but that should be being removed in the future version And then like we usually do we just do index train so we need to train of course using IVF and And also put quantization.

So we we really really do need to train And then now we add our vectors now from there we just do search as we normally do so the I equals index search Xq K And Runner we'll see No, I don't really need to see what we return, but I'll show you the performance of that so I mean while with us the Mays or explain so we have M here M is the the number of sub vectors that we're splitting our full vectors into for the Product quantization set and the n bits value here is 8 which is the number of bits used by each Subquantizer within the product quantization set.

So you just increase that to perform to increase Recall performance or accuracy. Okay, and then you see we have our nearest indexes there now Let me show you the actual performance of the metrics for this So what we have here is a few things on to compare so we have the blue lines here Oh what we just built so the IVF you see then we also have the the magenta line that is still IVF ADC, but we've added a Optimization set to it now for me.

There was no real performance increase there And I just want to compare that to a simple flat index Now a flat index if we use that the the memory usage of that is something like half a gigabyte Whereas the memory usage for our to? PQ or IVF PQ indexes is I think of around 40 megabytes So is a pretty yeah a huge difference there So that that's why you'd use PQ in the first place to minimize your memory usage of the index But in terms of recall, of course, you can't really be a flat index So let's move on to the next one.

So we have multi D ADC, which is a Multi index asymmetric distance computation so that the ADC part is the same so we know okay, there's product quantization in there and The multi multi D part is what is different so this is based on IMI, which is a inverted multi index and IMI is similar to IVF, but it where we have our voronoi cells in IVF IMI uses product quantization to split your vectors into Subvectors and then process them through that Voronoi cell or cluster them using Voronoi cells.

So it's like IVF but split across multiple Vector subspaces summit like a like a tiered Voronoi cell structure, which is why I'm trying to visualize here So with IVF, we just have one layer whereas with IMI so this is IMI not necessarily the multi D ADC. So this is just I I So our vectors come in they get split into two in this case And then they get clustered based on that and then when we introduce our query vector We split that again into two and we find the cells that it belongs to and we find a crossover between those cells Let's find the nearest neighbor candidates Okay, and this is using just an n-probe value of one in this case now We can we can put that together quite easily using this index factor string So we have the IMI index on the left and then we have again PQ because it's a ADC On the right.

So let's go ahead and put that together Okay, so I'm gonna come up here. I'm just going to use the same same Script we use here. So I'm gonna take IMI to multiply by eight there or two two by eight and Let me remove the by eight there now in this case.

It's actually better if we include a PQ So this in this case, it does improve the performance of our of our index so we'll add that and What I want to do so I'm going to get a recall function so we can compare the matches between this and a simple flat index and see how many of the Responses it returns match or line Okay, so I'm gonna add that in here.

All we're doing is creating a flat index and Creating this this function here to compare how many of them match and what I'm also gonna do is I'm gonna change K To a hundred and remember we're using ten. That's a little bit low So let's change it to a hundred and then we can compare those a little easier the reason I want to do that is to show you the Performance difference we'll get when we start modifying another variable called the Emperor value So let me wait for this to to finish training and we'll jump into that Okay, so that that's finished now and we so we've run this I can Compare or do it.

Yeah. Yeah want to do this and I'm gonna call recall on here and we'll see what it is So we get 5% okay, that's pretty terrible and The way that we we can improve this is By increasing the Emperor value So the number of cells in our IVF index that we're searching the any problem So well, I'm sure what we normally do Is this so we'd write index and probe and we set this to a higher number but we can't actually do that anymore because our index consists of Let's say multiple indexes.

So if we were just using this part would be fine We could do that but because we've added a BQ we can't access and probe in a normal way. So we have to We have to use something else and that is if I write it out. It's vice Extract index and we need to write IVF.

I know we're not using IVF using IMI, but we use this This here for both of those. I am I or IVF whichever one you're using And then in here we just pass our index and then from here So this is kind of extracted the IMI part or the IVF part of our index And that means that we can now set the Emperor value.

So we can set up to 14 and What we can do is DI equals and there's a search again New execute K And we do recall Hi And get 44 percent. So is increasing Now what happens we say increase that to 100? And get 69 so you see is we keep increasing this number the Performance or the recall increases as well Okay, so 69 looks like the max we're going to get there Although in reality, so I've tested this before you you can definitely get higher than that You should in fact, well slightly higher.

You should be able to go to about 75% so I Mean that is that is that index and Through this we do get very fast search times. So you see here 0.2 Is is a faster I think so time it makes a search and Once it's okay, and let's reduce this.

Let's say 50 see if we still get okay 60 Okay, so I'm gonna go for 60 for a 66 percent recall and then index search And this this loops through multiple times to get the the average speed and we get 179 microseconds, which is pretty fast So that is that last two of our comps indexes Let me before we move on to next one.

Let me actually show you the performance you can expect from from this one So this is what I got when when I was testing on on this index So in the middle there the magenta line, that's what we just built You also have the the non-optimized version Which is the the blue line and then I wanted to also include the the flat index in there as well for comparison The flat index is slower though And plus again the the memory requirements for that flat index are sort of through the roof This one is incredibly small not as small as IVF PQ or IVF ADC But so pretty small Okay, so this is our final concept index and I think the most But most performing one from what I've seen as long as you don't mind increasing the memory usage quite a bit So before we using product quantization, we can use product quantization with this and still get pretty good results actually But the the memory usage when we're using this with flat vectors is of course more So what we're doing here, we still have that IVF component So we're not using I am I this time using IVF like we were in the first concept index and then we're also adding a HN SW or hierarchical navigable small worlds in depth or graph and What it does is uses IVF Splits our data into all of its different Cells or cells which each have a centroid to identify which cells are closest to your query and during search time rather than comparing all of those centroids to your Search query.

So like an exhaustive search across all those centroids. We use HN SW to Approximate that and optimize or speed it up. So let's just go into HN SW very quickly first so it's based on the small worlds graph theory and I mean this way you can see on on the on the screen right now, there's this circle or the nodes of vertices on the side and then we have all the edges or links between them and the theory is that in a Very large networks is something like Facebook where there's billions of users You will find the average degree separation the average number of links between any two people is is very small I think with the Facebook one is something like credibly small like four or six And that's from you know, it's between the average between anyone any two random people on Facebook are connected by only four to six friends and with this what we find is that in these small world graphs, we have long range and short range links now when we're traversing a Around that graph.

So from one node or one vertex to another We obviously move more quickly across the graph when it's a long range link So let's say your friend you're American your friend has a friend over in India And they have friend who is someone important in one the villages in India For example, you know between you and that important person in a village in India, even though you don't know anyone in India You have your friend that friend over in India and him so there's three Sets to get him now the set between you and your friend.

You're pretty close. You have a very high proximity So it's a short range link between your friend and their friend over in India that's a long range link now H&S W takes this theory and applies it so what it does is it takes long range links and Splits them across different layers.

So in this case, we define long range as geometrically far apart and on the highest level of our H&S W graph We have all of these long-range links and it's on that first level that we input our query vector And we first find those nearest neighbors. So we take our first few Traversals across the links and as we go down each layer so each traversal we go down a layer the links become shorter and shorter range it becomes finer and At the bottom layer, we finally get to what is our approximate nearest neighbor now That's what you can see here.

So we saw there's the entry point on Layer 2 which is our entry layer that only has long ranging so we can make it a big jump from our Entry point over here over to here Then we're on to the next step. So we move down the layer Down here and then we go.

Okay, where's it? Where's the nearest neighbor now? It's over here. Okay, so so we go there and then we're on to the next step So we again want to move down a layer. So now we're now bottom layer and This is our final set. It's okay. Where's where's our nearest neighbor?

It's over here. Okay, and Then that's what we assume we assume that this vertex or node is our nearest neighbor to our query vector and The reason we do this is if you think okay up here. We only have long range Links, that means there's very few Nodes, whereas down here.

We probably have loads of nodes. So if we saw it down here, we'd be making loads of steps Whereas doing it like this we can make big steps early on and then smaller steps later Which just minimizes the number of steps we need to take or the number of comparisons that we make now How how does this apply to?

IVF we said okay. This is a IVF and patients of you composite index and This is what you can see now. So that HHSW process that we just went through Imagine all those points or those those vertices. All of those are cell centroids from IVF So what we're doing is rather than comparing exhaustively comparing all of those cell centroids We're just comparing a few demo or what finding the approximate closest or nearest neighbor very quickly And then we limit our exhaustive search scope to that single So if the end pro value is is one obviously usually it's not going to be born.

It shouldn't be greater Yeah, but that's pretty much what we're doing with this IVF HHSW index now because IVF HHSW optimizes is centroid or focuses on optimizing the centroid search portion It makes more sense if we increase the number of centuries because if we increase the number of centroids we reduce the number of vectors within each cell and Because our optimized search point here or the approximate search is the centroid comparison and then once we're in those centroids We're actually comparing all of the vectors in there So we're forming an exhaustive search on the scope that we do that great So what we do is we increase the number of centroids because that's the optimized portion and we reduce the number of vectors that will return in our in our final scope because that's the unoptimized part of This search so to pull that together.

We're going to use this string this index factory string In terms of increasing speed, like I said, we'll just increase this number here, which we will see I'll show you another graph in a minute Okay, so let's go up here I'm going to compare. Sorry. I'm going to modify this.

So we're going to IVF 4096 followed on with an underscore by HSW 32 and what this means is we're creating this many cell centroids or cells and This 32 is how many? Connections or links each node or vertex has in the HSW graph And then after that, we were using a flat in those here now.

I said before we can use pq32 for example here But the the accuracy to recall won't be as good but you can use that Right. So if you if you need to reduce the memory usage with this and it is very good by the way using pq So you still get very good recall you can do like that now.

Let's run that We will do the search again. We'll check the recall. See what we get Okay, so that has just finished and we get this recall of 25 cent And now again, we're using the the Emperor value of 1 by default. We probably want to increase this So come down here.

We don't need to change the variable name. I'm just doing it Because it bothers me leaving it as I am I am I and Let's go 60 again. I mean we do have a high number of cells here. So probably want to increase it a bit more that but Let's go with that.

See what we get. I'm getting nice 3% which okay fair enough. That's pretty good like straight out 93% That's nice so let's Time it see how fast that is and you see okay. So before we were getting with with the multi D ADC index we're getting I think 364 Microseconds now getting 5 foot 4 a bit more But it's hardly anything and I mean we're getting nice 2% before we were getting I think 6% so but of course at the same time the memory usage for this is it's like half a gigabyte But we can reduce that by Instead of using flat index.

We just change its peak you say - not like that. We change its peak you say - and That improves it a lot. It's but we're not doing that here. We're just gonna leave it flat And I mean, let's just so if we for example, we know that the maximum number cells here is 1496 let's do an insanely high number and this is the maximum Performance that we're going to get out of it.

So a Hundred percent right because we're using flats of vectors. We can get that hundred percent point So yeah, we can get very good performance 89% there and Yeah, it's super easy. So 432 microseconds. I mean for me, I think that's very good so For that reason, this is definitely one of my favorite indexes.

It's just again the memory usage is high So let's have a look at the actual performance search time recall So included a few different number of cells here for us to compare so like I said before what wet the optimal the optimized part is the IVF part or the Centroid search part number of cells so we can increase the number of cells and that will decrease the search time But also decrease the recall a little bit as well, but not that much so You know, you can you can modify it.

You can increase the number of cells IVF viral cells To improve the search time if you if you need to now, I mean, that's it I don't want to cover anything else in the video. We've covered those just three indexes the IVF ADC multi multi D ADC and IVF patient sub you with those I think you can build some pretty cool indexes.

So Thank you very much for watching. I hope some useful and I will see you in the next one

Best Indexes for Similarity Search in Faiss

Chapters

Transcript