Best Indexes for Similarity Search in Faiss

00:00:00.000 | I'm welcome to this video on

00:00:02.520 | Well these composite indexes that you can you can see on the screen now

00:00:06.440 | IVF ADC Monte de ADC and IVF HSW in the last video in the series

00:00:13.480 | we sort of introduce composite indexes and the index factory in FICE and

00:00:19.160 | What we're going to do now is obviously go through each of these and

00:00:22.840 | build them

00:00:25.120 | Using what we learned in the last video the index factory

00:00:29.020 | So

00:00:30.660 | Let's move on to the first one IVF ADC

00:00:33.420 | This is it stands for IVF is in like the inverted file index

00:00:38.980 | followed by asymmetric distance of computation if you watched the video or the videos or read the articles on

00:00:47.780 | product quantization you might remember ADC

00:00:50.780 | ADC is

00:00:53.020 | Well, it's part of product quantization

00:00:55.380 | so

00:00:58.060 | During indexing so when we first take our vectors, and we add them to the index

00:01:02.820 | There's the sort of two main steps

00:01:05.580 | Vectors are assigned to lists or the cells

00:01:08.220 | That's the IVF part of it, so if I just point that out

00:01:13.740 | So we're taking our vector over here, and IVF assignment is this part so this is our index IVF

00:01:20.960 | Assigning them to IVF cells which are the the voronoi cells that we usually

00:01:27.280 | visualize

00:01:28.940 | And from that you're assigned this cell ID

00:01:31.820 | So the cell ID is the IVF part of our index

00:01:36.140 | And then the second step is that those vectors that we've just assigned they go through a product quantization

00:01:44.100 | compression step so if I

00:01:46.860 | Goes all the way over here, and we create and we create a PQ code

00:01:51.300 | Do you quantize code and they get sold together so did the code to get stored in the?

00:01:57.400 | IVF list

00:01:59.160 | That is a an IVF

00:02:01.880 | ADC index at at its core

00:02:05.000 | now

00:02:06.880 | During search this is where the the ADC part comes in so ADC is that that asymmetric?

00:02:12.840 | distance computation

00:02:15.360 | So we have to in product quantization

00:02:18.880 | Well when you when you quantize your vectors you have these two

00:02:23.400 | alternative ways of searching the first which is symmetric distance computation over here is not as popular and

00:02:31.080 | what you do with that is you take your query vector which is xq and you quantize it and

00:02:37.880 | Then you compare it to all of your pre quantized

00:02:41.260 | Pre-indexed vectors, so essentially both sides of what you're comparing are quantized, and that's why it's

00:02:49.620 | Asymmetrical or symmetric distance computation the alternative which tends to work better is

00:02:55.000 | Where you don't quantize your query vector, so you just take its cue as it is

00:03:01.180 | You compare that to your pre quantized pre indexed vectors, so the quantized speed vectors

00:03:09.020 | that's where that's where you get the IVF side of things, and then you also get the ADC side of things and

00:03:17.040 | When it can suffice we can we can use this index factory string to pull that together, so we have IVF on the left

00:03:24.020 | Product quantization on the right, so let's let's go to Python and put that together

00:03:30.100 | So I've already imported the data using again the SIFT 1M

00:03:35.420 | Dataset there'll be a link in the description which will explain how you can get that data set from that we have

00:03:43.360 | Show you we have XB. Which is the

00:03:46.420 | index vector the vectors are searching through and then XQ which is just a single vector that we're using to search and

00:03:54.480 | So we put that together

00:03:56.920 | We just write index vice index factory as we as we have been doing we have our number of dimensions

00:04:04.560 | 128 for this data set and then we want to input the the string the index factory string

00:04:11.680 | Which is so we have IVF. Which is the first part. We're gonna split into 256

00:04:17.720 | cells for ourselves and

00:04:20.320 | We are using PQ for the second part. Which is the ADC part

00:04:24.980 | and here we're going to use an M value 32 and

00:04:30.680 | a n bits value of

00:04:34.120 | 8 now

00:04:36.680 | By default this is 8 so we can

00:04:40.560 | Include the times 8 there, but you don't need it

00:04:44.000 | The only way you probably need this is if you want to use a lower n bits value

00:04:48.960 | So say you want to use 4 if trying to go for 12 you

00:04:51.760 | Well, you're not going to be able to unless vice has updated that yet. So basically you're not allowed to go over 8

00:05:00.240 | For now with the the PQ index or IVF PQ index, but that should be being removed in the future version

00:05:09.600 | And then like we usually do we just do index train so we need to train

00:05:15.200 | of course using IVF and

00:05:18.120 | And also put quantization. So we we really really do need to train

00:05:23.240 | And then now we add our vectors now from there we just do search as we normally do so the I equals

00:05:32.080 | index search

00:05:35.120 | Xq K

00:05:38.120 | And

00:05:40.120 | Runner we'll see

00:05:43.040 | No, I don't really need to see what we return, but I'll show you the performance of that

00:05:48.400 | so I mean while with us the

00:05:51.080 | Mays or explain so we have M here M is the the number of sub vectors that we're splitting

00:05:57.960 | our full vectors into for the

00:06:00.880 | Product quantization set and the n bits value here is 8 which is the number of bits used by each

00:06:08.280 | Subquantizer within the product quantization set. So you just increase that to perform to increase

00:06:15.000 | Recall performance or accuracy. Okay, and then you see we have our nearest indexes there

00:06:22.960 | now

00:06:26.000 | Let me show you the actual performance of the metrics for this

00:06:31.080 | So what we have here is a few things on to compare so we have the blue lines here

00:06:38.080 | Oh what we just built so the IVF you see then we also have the the magenta line

00:06:43.080 | that is still IVF ADC, but we've added a

00:06:47.200 | Optimization set to it now for me. There was no real performance increase there

00:06:53.240 | And I just want to compare that to a simple flat

00:06:56.840 | index

00:06:58.480 | Now a flat index if we use that the the memory usage of that is something like half a gigabyte

00:07:05.660 | Whereas the memory usage for our to?

00:07:08.560 | PQ or IVF PQ indexes is I think of around 40 megabytes

00:07:15.120 | So is a pretty yeah a huge difference there

00:07:18.200 | So that that's why you'd use PQ in the first place to minimize your memory usage of the index

00:07:25.760 | But in terms of recall, of course, you can't really be a flat index

00:07:28.940 | So let's move on to the next one. So we have multi D ADC, which is a

00:07:34.220 | Multi index asymmetric distance computation so that the ADC part is the same

00:07:40.440 | so we know okay, there's product quantization in there and

00:07:43.760 | The multi multi D part is what is different

00:07:47.440 | so this is based on IMI, which is a inverted multi index and

00:07:52.680 | IMI is similar to IVF, but it

00:07:56.440 | where we have our voronoi cells in

00:07:59.480 | IVF IMI uses product quantization to split your vectors into

00:08:06.120 | Subvectors and then process them through that

00:08:10.220 | Voronoi cell or cluster them using Voronoi cells. So it's like IVF but split across multiple

00:08:17.360 | Vector subspaces summit like a like a tiered Voronoi cell structure, which is why I'm trying to visualize here

00:08:24.880 | So with IVF, we just have one layer

00:08:27.280 | whereas with

00:08:29.560 | IMI so this is IMI not necessarily the multi D ADC. So this is just I

00:08:35.120 | I

00:08:37.040 | So our vectors come in they get split into two in this case

00:08:40.960 | And then they get clustered based on that and then when we introduce our query vector

00:08:47.000 | We split that again into two and we find the cells that it belongs to and we find a crossover between those cells

00:08:54.320 | Let's find the nearest neighbor candidates

00:08:58.120 | Okay, and this is using just an n-probe value of one in this case now

00:09:03.600 | We can we can put that together quite easily using this index factor string

00:09:08.600 | So we have the IMI index on the left and then we have again PQ because it's a ADC

00:09:14.840 | On the right. So let's go ahead and put that together

00:09:17.720 | Okay, so I'm gonna come up here. I'm just going to use the same same

00:09:22.560 | Script we use here. So I'm gonna take IMI to multiply by eight there or two two by eight and

00:09:30.600 | Let me remove the by eight there now in this case. It's actually better if we include a PQ

00:09:40.960 | So this in this case, it does improve the performance of our of our index

00:09:47.600 | so we'll add that and

00:09:49.720 | What I want to do

00:09:52.560 | so I'm going to get a recall function so we can compare the matches between this and a

00:09:59.920 | simple flat index and see how many of the

00:10:04.760 | Responses it returns match or line

00:10:09.600 | Okay, so I'm gonna add that in here. All we're doing is creating a flat index and

00:10:15.200 | Creating this this function here to compare how many of them match and what I'm also gonna do is I'm gonna change K

00:10:23.720 | To a hundred and remember we're using ten. That's a little bit low

00:10:27.560 | So let's change it to a hundred and then we can compare those a little easier

00:10:32.560 | the reason I want to do that is to show you the

00:10:36.600 | Performance difference we'll get when we start modifying another variable called the Emperor value

00:10:41.200 | So let me wait for this to to finish training and we'll jump into that

00:10:48.560 | Okay, so that that's finished now and we so we've run this I can

00:10:55.720 | Compare or do it. Yeah. Yeah want to do this and I'm gonna call recall on here and we'll see what it is

00:11:04.320 | So we get 5%

00:11:06.160 | okay, that's pretty terrible and

00:11:08.920 | The way that we we can improve this is

00:11:12.960 | By increasing the Emperor value

00:11:16.480 | So the number of cells in our IVF index that we're searching the any problem

00:11:21.280 | So well, I'm sure what we normally do

00:11:23.760 | Is this so we'd write index and probe and we set this to a higher number

00:11:28.880 | but we can't actually do that anymore because

00:11:32.000 | our index consists of

00:11:35.440 | Let's say multiple indexes. So if we were just using this part would be fine

00:11:40.320 | We could do that but because we've added a BQ we can't access and probe in a normal way. So we have to

00:11:46.680 | We have to use something else and that is if I write it out. It's

00:11:51.560 | vice

00:11:54.080 | Extract index and we need to write IVF. I know we're not using IVF using IMI, but we use this

00:12:01.520 | This here for both of those. I am I or IVF whichever one you're using

00:12:05.920 | And then in here we just pass our index and then from here

00:12:10.920 | So this is kind of extracted the IMI part or the IVF part of our index

00:12:16.400 | And that means that we can now set the Emperor value. So we can set up to 14 and

00:12:21.480 | What we can do is DI equals and there's a search again

00:12:25.860 | New execute K

00:12:28.480 | And we do recall

00:12:30.680 | Hi

00:12:31.960 | And get 44 percent. So is increasing

00:12:34.600 | Now what happens we say increase that to 100?

00:12:38.480 | And get 69 so you see is we keep increasing this number the

00:12:45.360 | Performance or the recall increases as well

00:12:48.380 | Okay, so 69 looks like the max we're going to get there

00:12:53.040 | Although in reality, so I've tested this before you you can definitely get higher

00:12:57.440 | than that

00:12:59.760 | You should in fact, well slightly higher. You should be able to go to about 75%

00:13:03.800 | so I

00:13:06.840 | Mean that is that is that index and

00:13:11.040 | Through this we do get very fast search times. So you see here 0.2

00:13:17.840 | Is is a faster I think so

00:13:21.800 | time it

00:13:24.520 | makes a search and

00:13:26.520 | Once it's okay, and let's reduce this. Let's say 50 see if we still get okay

00:13:33.440 | 60

00:13:37.880 | Okay, so I'm gonna go for 60 for a 66 percent recall and then index search

00:13:42.840 | And this this loops through multiple times to get the the average

00:13:48.200 | speed and we get

00:13:51.520 | 179 microseconds, which is pretty fast

00:13:54.240 | So that is that last two of our comps indexes

00:14:01.000 | Let me before we move on to next one. Let me actually show you the performance you can expect from from this one

00:14:08.000 | So this is what I got when when I was testing on on this index

00:14:11.880 | So in the middle there the magenta line, that's what we just built

00:14:17.520 | You also have the the non-optimized version

00:14:20.440 | Which is the the blue line and then I wanted to also include the the flat index in there as well for comparison

00:14:27.360 | The flat index is slower though

00:14:29.680 | And plus again the the memory requirements for that flat index are sort of through the roof

00:14:35.760 | This one is incredibly small not as small as IVF PQ or IVF ADC

00:14:41.880 | But so pretty small

00:14:44.400 | Okay, so this is our final concept index and I think the most

00:14:49.080 | But most performing one from what I've seen as long as you don't mind increasing the memory usage quite a bit

00:14:58.000 | So before we using product quantization, we can use product quantization with this and still get pretty good results actually

00:15:04.760 | But the the memory usage when we're using this with flat vectors is of course more

00:15:10.720 | So what we're doing here, we still have that IVF component

00:15:14.400 | So we're not using I am I this time using IVF like we were in the first

00:15:17.840 | concept index and then we're also adding a

00:15:21.160 | HN SW or

00:15:23.680 | hierarchical navigable small worlds in depth or graph and

00:15:30.120 | What it does is uses IVF

00:15:33.800 | Splits our data into all of its different

00:15:36.960 | Cells or cells which each have a centroid

00:15:40.040 | to identify which cells are closest to your query and

00:15:43.800 | during search time rather than

00:15:46.560 | comparing all of those centroids to your

00:15:49.760 | Search query. So like an exhaustive search across all those centroids. We use HN SW to

00:15:57.280 | Approximate that and optimize or speed it up. So let's just go into HN SW very quickly first

00:16:05.560 | so it's based on the small worlds graph theory and I mean this way you can see on on the

00:16:11.680 | on the screen right now, there's this circle or the nodes of

00:16:15.840 | vertices on the side and then we have all the edges or links between them and the theory is that in a

00:16:23.800 | Very large networks is something like Facebook where there's billions of users

00:16:28.600 | You will find the average degree separation the average number of links between any two people is is very small

00:16:35.520 | I think with the Facebook one is something like credibly small like four or six

00:16:39.160 | And that's from you know, it's between the average between anyone any two random people on Facebook are connected by only four to six

00:16:47.560 | friends and

00:16:49.600 | with this what we find is that in these small world graphs, we have long range and

00:16:55.560 | short range links now when we're traversing a

00:16:59.240 | Around that graph. So from one node or one vertex to another

00:17:03.620 | We obviously move more quickly across the graph when it's a long range link

00:17:08.640 | So let's say your friend you're American your friend has a friend over in India

00:17:16.240 | And they have friend who is someone important in one the villages in India

00:17:23.920 | For example, you know between you and that important person in a village in India, even though you don't know anyone in India

00:17:31.000 | You have your friend that friend over in India and him so there's three

00:17:37.780 | Sets to get him now the set between you and your friend. You're pretty close. You have a very high proximity

00:17:45.320 | So it's a short range link between your friend and their friend over in India

00:17:50.100 | that's a long range link now H&S W takes this theory and

00:17:56.240 | applies it so

00:17:58.400 | what it does is it takes long range links and

00:18:00.920 | Splits them across different layers. So in this case, we define long range as

00:18:07.240 | geometrically far apart and on the highest level of our H&S W graph

00:18:13.920 | We have all of these long-range links and it's on that first level that we input our query vector

00:18:20.360 | And we first find those nearest neighbors. So we take our first few

00:18:25.220 | Traversals across the links and as we go down each layer

00:18:29.540 | so each traversal we go down a layer the links become shorter and shorter range it becomes finer and

00:18:36.820 | At the bottom layer, we finally get to what is our approximate nearest neighbor now

00:18:43.360 | That's what you can see here. So we saw there's the entry point on

00:18:47.180 | Layer 2 which is our entry layer that only has long ranging so we can make it a big jump

00:18:54.220 | from our

00:18:56.820 | Entry point over here over to here

00:18:59.940 | Then we're on to the next step. So we move down the layer

00:19:05.260 | Down here and then we go. Okay, where's it? Where's the nearest neighbor now?

00:19:10.960 | It's over here. Okay, so so we go there and then we're on to the next step

00:19:15.800 | So we again want to move down a layer. So now we're now bottom layer and

00:19:20.880 | This is our final set. It's okay. Where's where's our nearest neighbor? It's over here. Okay, and

00:19:27.320 | Then that's what we assume we assume that this

00:19:30.600 | vertex or node is our nearest neighbor to our query vector and

00:19:35.760 | The reason we do this is if you think okay up here. We only have long range

00:19:41.320 | Links, that means there's very few

00:19:44.180 | Nodes, whereas down here. We probably have loads of nodes. So if we saw it down here, we'd be making loads of steps

00:19:50.720 | Whereas doing it like this we can make big steps early on and then smaller steps later

00:19:56.200 | Which just minimizes the number of steps we need to take or the number of comparisons that we make now

00:20:02.000 | How how does this apply to?

00:20:04.000 | IVF we said okay. This is a IVF and patients of you composite index and

00:20:10.240 | This is what you can see now. So that HHSW process that we just went through

00:20:15.280 | Imagine all those points or those those vertices. All of those are

00:20:20.320 | cell centroids from IVF

00:20:23.520 | So what we're doing is rather than comparing exhaustively comparing all of those cell centroids

00:20:30.200 | We're just comparing a few demo or what finding the approximate closest or nearest neighbor very quickly

00:20:36.080 | And then we limit our exhaustive search scope to that single

00:20:41.400 | So if the end pro value is is one obviously usually it's not going to be born. It shouldn't be greater

00:20:46.300 | Yeah, but that's pretty much what we're doing with this IVF HHSW index now because IVF

00:20:54.080 | HHSW optimizes is centroid or focuses on optimizing the centroid search portion

00:21:00.360 | It makes more sense if we increase the number of centuries because if we increase the number of centroids

00:21:05.720 | we reduce the number of vectors within each cell and

00:21:09.120 | Because our optimized search point here or the approximate search is the centroid comparison and then once we're in those centroids

00:21:17.880 | We're actually comparing all of the vectors in there

00:21:21.280 | So we're forming an exhaustive search on the scope that we do that great

00:21:25.840 | So what we do is we increase the number of centroids because that's the optimized

00:21:31.520 | portion and we reduce the number of vectors that will return in our in our final scope because that's the unoptimized part of

00:21:38.640 | This search so to pull that together. We're going to use this string this index factory string

00:21:44.960 | In terms of increasing speed, like I said, we'll just increase this number here, which we will see

00:21:51.520 | I'll show you another graph in a minute

00:21:54.160 | Okay, so let's go up here

00:21:58.560 | I'm going to compare. Sorry. I'm going to modify this. So we're going to IVF

00:22:03.800 | 4096 followed on with an underscore

00:22:08.820 | by HSW 32 and what this means is we're creating this many cell centroids or cells and

00:22:16.560 | This 32 is how many?

00:22:21.000 | Connections or links each node or vertex has in the HSW graph

00:22:28.040 | And then after that, we were using a flat in those here now. I said before we can use pq32 for example here

00:22:35.600 | But the the accuracy to recall won't be as good but you can use that

00:22:40.360 | Right. So if you if you need to reduce the memory usage with this and it is very good by the way using pq

00:22:46.840 | So you still get very good recall you can do like that now. Let's run that

00:22:53.200 | We will do the search again. We'll check the recall. See what we get

00:22:57.280 | Okay, so that has just finished and we get this recall of 25 cent

00:23:02.760 | And now again, we're using the the Emperor value of 1 by default. We probably want to increase this

00:23:09.120 | So come down here. We don't need to change the variable name. I'm just doing it

00:23:14.040 | Because it bothers me leaving it as I am I am I and

00:23:20.560 | Let's go 60 again. I mean we do have a high number of cells here. So probably want to increase it a bit more

00:23:26.360 | that

00:23:28.800 | but

00:23:30.080 | Let's go with that. See what we get. I'm getting nice 3% which okay fair enough. That's pretty good like straight out

00:23:37.080 | 93%

00:23:39.880 | That's nice

00:23:41.720 | so let's

00:23:43.920 | Time it see how fast that is and you see okay. So before we were getting with with the multi D

00:23:50.960 | ADC index we're getting I think

00:23:53.640 | 364

00:23:56.400 | Microseconds now getting 5 foot 4 a bit more

00:23:59.560 | But it's hardly anything and I mean we're getting nice 2% before we were getting I think

00:24:07.920 | 6% so but of course at the same time the memory usage for this is it's like half a gigabyte

00:24:14.520 | But we can reduce that by

00:24:16.960 | Instead of using flat index. We just change its peak you say - not like that. We

00:24:23.780 | change its peak you say - and

00:24:27.280 | That improves it a lot. It's but we're not doing that here. We're just gonna leave it flat

00:24:33.320 | And I mean, let's just so if we for example, we know that the maximum number cells here is

00:24:39.500 | 1496 let's do an insanely high number and this is the maximum

00:24:45.640 | Performance that we're going to get out of it. So a

00:24:48.440 | Hundred percent right because we're using flats of vectors. We can get that hundred percent point

00:24:54.800 | So yeah, we can get very good performance 89% there

00:25:00.640 | and

00:25:02.840 | Yeah, it's super easy. So

00:25:04.840 | 432 microseconds. I mean for me, I think that's very good

00:25:09.800 | so

00:25:11.880 | For that reason, this is definitely one of my favorite indexes. It's just again the memory usage is high

00:25:18.960 | So let's have a look at the actual performance search time recall

00:25:24.260 | So included a few different number of cells here for us to compare

00:25:29.120 | so like I said before what wet the optimal the optimized part is the IVF part or the

00:25:36.080 | Centroid search part number of cells so we can increase the number of cells and that will decrease the search time

00:25:43.400 | But also decrease the recall a little bit as well, but not that much so

00:25:47.640 | You know, you can you can modify it. You can increase the number of cells IVF viral cells

00:25:54.000 | To improve the search time if you if you need to now, I mean, that's it

00:26:00.400 | I don't want to cover anything else in the video. We've covered those just three indexes

00:26:05.480 | the IVF ADC multi multi D ADC and

00:26:10.200 | IVF patient sub you with those I think you can build some pretty cool indexes. So

00:26:16.000 | Thank you very much for watching. I hope some useful and I will see you in the next one

Best Indexes for Similarity Search in Faiss

Chapters