back to index

Best Indexes for Similarity Search in Faiss


Chapters

0:0 Intro
0:30 IVFADC
3:30 IVFADC in Faiss
7:29 Multi-D-ADC
9:17 Multi-D-ADC in Faiss
14:43 IVF-HNSW
21:39 IVF-HNSW in Faiss
25:58 Outro

Whisper Transcript | Transcript Only Page

00:00:00.000 | I'm welcome to this video on
00:00:02.520 | Well these composite indexes that you can you can see on the screen now
00:00:06.440 | IVF ADC Monte de ADC and IVF HSW in the last video in the series
00:00:13.480 | we sort of introduce composite indexes and the index factory in FICE and
00:00:19.160 | What we're going to do now is obviously go through each of these and
00:00:22.840 | build them
00:00:25.120 | Using what we learned in the last video the index factory
00:00:30.660 | Let's move on to the first one IVF ADC
00:00:33.420 | This is it stands for IVF is in like the inverted file index
00:00:38.980 | followed by asymmetric distance of computation if you watched the video or the videos or read the articles on
00:00:47.780 | product quantization you might remember ADC
00:00:50.780 | ADC is
00:00:53.020 | Well, it's part of product quantization
00:00:58.060 | During indexing so when we first take our vectors, and we add them to the index
00:01:02.820 | There's the sort of two main steps
00:01:05.580 | Vectors are assigned to lists or the cells
00:01:08.220 | That's the IVF part of it, so if I just point that out
00:01:13.740 | So we're taking our vector over here, and IVF assignment is this part so this is our index IVF
00:01:20.960 | Assigning them to IVF cells which are the the voronoi cells that we usually
00:01:27.280 | visualize
00:01:28.940 | And from that you're assigned this cell ID
00:01:31.820 | So the cell ID is the IVF part of our index
00:01:36.140 | And then the second step is that those vectors that we've just assigned they go through a product quantization
00:01:44.100 | compression step so if I
00:01:46.860 | Goes all the way over here, and we create and we create a PQ code
00:01:51.300 | Do you quantize code and they get sold together so did the code to get stored in the?
00:01:57.400 | IVF list
00:01:59.160 | That is a an IVF
00:02:01.880 | ADC index at at its core
00:02:06.880 | During search this is where the the ADC part comes in so ADC is that that asymmetric?
00:02:12.840 | distance computation
00:02:15.360 | So we have to in product quantization
00:02:18.880 | Well when you when you quantize your vectors you have these two
00:02:23.400 | alternative ways of searching the first which is symmetric distance computation over here is not as popular and
00:02:31.080 | what you do with that is you take your query vector which is xq and you quantize it and
00:02:37.880 | Then you compare it to all of your pre quantized
00:02:41.260 | Pre-indexed vectors, so essentially both sides of what you're comparing are quantized, and that's why it's
00:02:49.620 | Asymmetrical or symmetric distance computation the alternative which tends to work better is
00:02:55.000 | Where you don't quantize your query vector, so you just take its cue as it is
00:03:01.180 | You compare that to your pre quantized pre indexed vectors, so the quantized speed vectors
00:03:09.020 | that's where that's where you get the IVF side of things, and then you also get the ADC side of things and
00:03:17.040 | When it can suffice we can we can use this index factory string to pull that together, so we have IVF on the left
00:03:24.020 | Product quantization on the right, so let's let's go to Python and put that together
00:03:30.100 | So I've already imported the data using again the SIFT 1M
00:03:35.420 | Dataset there'll be a link in the description which will explain how you can get that data set from that we have
00:03:43.360 | Show you we have XB. Which is the
00:03:46.420 | index vector the vectors are searching through and then XQ which is just a single vector that we're using to search and
00:03:54.480 | So we put that together
00:03:56.920 | We just write index vice index factory as we as we have been doing we have our number of dimensions
00:04:04.560 | 128 for this data set and then we want to input the the string the index factory string
00:04:11.680 | Which is so we have IVF. Which is the first part. We're gonna split into 256
00:04:17.720 | cells for ourselves and
00:04:20.320 | We are using PQ for the second part. Which is the ADC part
00:04:24.980 | and here we're going to use an M value 32 and
00:04:30.680 | a n bits value of
00:04:34.120 | 8 now
00:04:36.680 | By default this is 8 so we can
00:04:40.560 | Include the times 8 there, but you don't need it
00:04:44.000 | The only way you probably need this is if you want to use a lower n bits value
00:04:48.960 | So say you want to use 4 if trying to go for 12 you
00:04:51.760 | Well, you're not going to be able to unless vice has updated that yet. So basically you're not allowed to go over 8
00:05:00.240 | For now with the the PQ index or IVF PQ index, but that should be being removed in the future version
00:05:09.600 | And then like we usually do we just do index train so we need to train
00:05:15.200 | of course using IVF and
00:05:18.120 | And also put quantization. So we we really really do need to train
00:05:23.240 | And then now we add our vectors now from there we just do search as we normally do so the I equals
00:05:32.080 | index search
00:05:40.120 | Runner we'll see
00:05:43.040 | No, I don't really need to see what we return, but I'll show you the performance of that
00:05:48.400 | so I mean while with us the
00:05:51.080 | Mays or explain so we have M here M is the the number of sub vectors that we're splitting
00:05:57.960 | our full vectors into for the
00:06:00.880 | Product quantization set and the n bits value here is 8 which is the number of bits used by each
00:06:08.280 | Subquantizer within the product quantization set. So you just increase that to perform to increase
00:06:15.000 | Recall performance or accuracy. Okay, and then you see we have our nearest indexes there
00:06:26.000 | Let me show you the actual performance of the metrics for this
00:06:31.080 | So what we have here is a few things on to compare so we have the blue lines here
00:06:38.080 | Oh what we just built so the IVF you see then we also have the the magenta line
00:06:43.080 | that is still IVF ADC, but we've added a
00:06:47.200 | Optimization set to it now for me. There was no real performance increase there
00:06:53.240 | And I just want to compare that to a simple flat
00:06:56.840 | index
00:06:58.480 | Now a flat index if we use that the the memory usage of that is something like half a gigabyte
00:07:05.660 | Whereas the memory usage for our to?
00:07:08.560 | PQ or IVF PQ indexes is I think of around 40 megabytes
00:07:15.120 | So is a pretty yeah a huge difference there
00:07:18.200 | So that that's why you'd use PQ in the first place to minimize your memory usage of the index
00:07:25.760 | But in terms of recall, of course, you can't really be a flat index
00:07:28.940 | So let's move on to the next one. So we have multi D ADC, which is a
00:07:34.220 | Multi index asymmetric distance computation so that the ADC part is the same
00:07:40.440 | so we know okay, there's product quantization in there and
00:07:43.760 | The multi multi D part is what is different
00:07:47.440 | so this is based on IMI, which is a inverted multi index and
00:07:52.680 | IMI is similar to IVF, but it
00:07:56.440 | where we have our voronoi cells in
00:07:59.480 | IVF IMI uses product quantization to split your vectors into
00:08:06.120 | Subvectors and then process them through that
00:08:10.220 | Voronoi cell or cluster them using Voronoi cells. So it's like IVF but split across multiple
00:08:17.360 | Vector subspaces summit like a like a tiered Voronoi cell structure, which is why I'm trying to visualize here
00:08:24.880 | So with IVF, we just have one layer
00:08:27.280 | whereas with
00:08:29.560 | IMI so this is IMI not necessarily the multi D ADC. So this is just I
00:08:37.040 | So our vectors come in they get split into two in this case
00:08:40.960 | And then they get clustered based on that and then when we introduce our query vector
00:08:47.000 | We split that again into two and we find the cells that it belongs to and we find a crossover between those cells
00:08:54.320 | Let's find the nearest neighbor candidates
00:08:58.120 | Okay, and this is using just an n-probe value of one in this case now
00:09:03.600 | We can we can put that together quite easily using this index factor string
00:09:08.600 | So we have the IMI index on the left and then we have again PQ because it's a ADC
00:09:14.840 | On the right. So let's go ahead and put that together
00:09:17.720 | Okay, so I'm gonna come up here. I'm just going to use the same same
00:09:22.560 | Script we use here. So I'm gonna take IMI to multiply by eight there or two two by eight and
00:09:30.600 | Let me remove the by eight there now in this case. It's actually better if we include a PQ
00:09:40.960 | So this in this case, it does improve the performance of our of our index
00:09:47.600 | so we'll add that and
00:09:49.720 | What I want to do
00:09:52.560 | so I'm going to get a recall function so we can compare the matches between this and a
00:09:59.920 | simple flat index and see how many of the
00:10:04.760 | Responses it returns match or line
00:10:09.600 | Okay, so I'm gonna add that in here. All we're doing is creating a flat index and
00:10:15.200 | Creating this this function here to compare how many of them match and what I'm also gonna do is I'm gonna change K
00:10:23.720 | To a hundred and remember we're using ten. That's a little bit low
00:10:27.560 | So let's change it to a hundred and then we can compare those a little easier
00:10:32.560 | the reason I want to do that is to show you the
00:10:36.600 | Performance difference we'll get when we start modifying another variable called the Emperor value
00:10:41.200 | So let me wait for this to to finish training and we'll jump into that
00:10:48.560 | Okay, so that that's finished now and we so we've run this I can
00:10:55.720 | Compare or do it. Yeah. Yeah want to do this and I'm gonna call recall on here and we'll see what it is
00:11:04.320 | So we get 5%
00:11:06.160 | okay, that's pretty terrible and
00:11:08.920 | The way that we we can improve this is
00:11:12.960 | By increasing the Emperor value
00:11:16.480 | So the number of cells in our IVF index that we're searching the any problem
00:11:21.280 | So well, I'm sure what we normally do
00:11:23.760 | Is this so we'd write index and probe and we set this to a higher number
00:11:28.880 | but we can't actually do that anymore because
00:11:32.000 | our index consists of
00:11:35.440 | Let's say multiple indexes. So if we were just using this part would be fine
00:11:40.320 | We could do that but because we've added a BQ we can't access and probe in a normal way. So we have to
00:11:46.680 | We have to use something else and that is if I write it out. It's
00:11:54.080 | Extract index and we need to write IVF. I know we're not using IVF using IMI, but we use this
00:12:01.520 | This here for both of those. I am I or IVF whichever one you're using
00:12:05.920 | And then in here we just pass our index and then from here
00:12:10.920 | So this is kind of extracted the IMI part or the IVF part of our index
00:12:16.400 | And that means that we can now set the Emperor value. So we can set up to 14 and
00:12:21.480 | What we can do is DI equals and there's a search again
00:12:25.860 | New execute K
00:12:28.480 | And we do recall
00:12:31.960 | And get 44 percent. So is increasing
00:12:34.600 | Now what happens we say increase that to 100?
00:12:38.480 | And get 69 so you see is we keep increasing this number the
00:12:45.360 | Performance or the recall increases as well
00:12:48.380 | Okay, so 69 looks like the max we're going to get there
00:12:53.040 | Although in reality, so I've tested this before you you can definitely get higher
00:12:57.440 | than that
00:12:59.760 | You should in fact, well slightly higher. You should be able to go to about 75%
00:13:06.840 | Mean that is that is that index and
00:13:11.040 | Through this we do get very fast search times. So you see here 0.2
00:13:17.840 | Is is a faster I think so
00:13:21.800 | time it
00:13:24.520 | makes a search and
00:13:26.520 | Once it's okay, and let's reduce this. Let's say 50 see if we still get okay
00:13:37.880 | Okay, so I'm gonna go for 60 for a 66 percent recall and then index search
00:13:42.840 | And this this loops through multiple times to get the the average
00:13:48.200 | speed and we get
00:13:51.520 | 179 microseconds, which is pretty fast
00:13:54.240 | So that is that last two of our comps indexes
00:14:01.000 | Let me before we move on to next one. Let me actually show you the performance you can expect from from this one
00:14:08.000 | So this is what I got when when I was testing on on this index
00:14:11.880 | So in the middle there the magenta line, that's what we just built
00:14:17.520 | You also have the the non-optimized version
00:14:20.440 | Which is the the blue line and then I wanted to also include the the flat index in there as well for comparison
00:14:27.360 | The flat index is slower though
00:14:29.680 | And plus again the the memory requirements for that flat index are sort of through the roof
00:14:35.760 | This one is incredibly small not as small as IVF PQ or IVF ADC
00:14:41.880 | But so pretty small
00:14:44.400 | Okay, so this is our final concept index and I think the most
00:14:49.080 | But most performing one from what I've seen as long as you don't mind increasing the memory usage quite a bit
00:14:58.000 | So before we using product quantization, we can use product quantization with this and still get pretty good results actually
00:15:04.760 | But the the memory usage when we're using this with flat vectors is of course more
00:15:10.720 | So what we're doing here, we still have that IVF component
00:15:14.400 | So we're not using I am I this time using IVF like we were in the first
00:15:17.840 | concept index and then we're also adding a
00:15:21.160 | HN SW or
00:15:23.680 | hierarchical navigable small worlds in depth or graph and
00:15:30.120 | What it does is uses IVF
00:15:33.800 | Splits our data into all of its different
00:15:36.960 | Cells or cells which each have a centroid
00:15:40.040 | to identify which cells are closest to your query and
00:15:43.800 | during search time rather than
00:15:46.560 | comparing all of those centroids to your
00:15:49.760 | Search query. So like an exhaustive search across all those centroids. We use HN SW to
00:15:57.280 | Approximate that and optimize or speed it up. So let's just go into HN SW very quickly first
00:16:05.560 | so it's based on the small worlds graph theory and I mean this way you can see on on the
00:16:11.680 | on the screen right now, there's this circle or the nodes of
00:16:15.840 | vertices on the side and then we have all the edges or links between them and the theory is that in a
00:16:23.800 | Very large networks is something like Facebook where there's billions of users
00:16:28.600 | You will find the average degree separation the average number of links between any two people is is very small
00:16:35.520 | I think with the Facebook one is something like credibly small like four or six
00:16:39.160 | And that's from you know, it's between the average between anyone any two random people on Facebook are connected by only four to six
00:16:47.560 | friends and
00:16:49.600 | with this what we find is that in these small world graphs, we have long range and
00:16:55.560 | short range links now when we're traversing a
00:16:59.240 | Around that graph. So from one node or one vertex to another
00:17:03.620 | We obviously move more quickly across the graph when it's a long range link
00:17:08.640 | So let's say your friend you're American your friend has a friend over in India
00:17:16.240 | And they have friend who is someone important in one the villages in India
00:17:23.920 | For example, you know between you and that important person in a village in India, even though you don't know anyone in India
00:17:31.000 | You have your friend that friend over in India and him so there's three
00:17:37.780 | Sets to get him now the set between you and your friend. You're pretty close. You have a very high proximity
00:17:45.320 | So it's a short range link between your friend and their friend over in India
00:17:50.100 | that's a long range link now H&S W takes this theory and
00:17:56.240 | applies it so
00:17:58.400 | what it does is it takes long range links and
00:18:00.920 | Splits them across different layers. So in this case, we define long range as
00:18:07.240 | geometrically far apart and on the highest level of our H&S W graph
00:18:13.920 | We have all of these long-range links and it's on that first level that we input our query vector
00:18:20.360 | And we first find those nearest neighbors. So we take our first few
00:18:25.220 | Traversals across the links and as we go down each layer
00:18:29.540 | so each traversal we go down a layer the links become shorter and shorter range it becomes finer and
00:18:36.820 | At the bottom layer, we finally get to what is our approximate nearest neighbor now
00:18:43.360 | That's what you can see here. So we saw there's the entry point on
00:18:47.180 | Layer 2 which is our entry layer that only has long ranging so we can make it a big jump
00:18:54.220 | from our
00:18:56.820 | Entry point over here over to here
00:18:59.940 | Then we're on to the next step. So we move down the layer
00:19:05.260 | Down here and then we go. Okay, where's it? Where's the nearest neighbor now?
00:19:10.960 | It's over here. Okay, so so we go there and then we're on to the next step
00:19:15.800 | So we again want to move down a layer. So now we're now bottom layer and
00:19:20.880 | This is our final set. It's okay. Where's where's our nearest neighbor? It's over here. Okay, and
00:19:27.320 | Then that's what we assume we assume that this
00:19:30.600 | vertex or node is our nearest neighbor to our query vector and
00:19:35.760 | The reason we do this is if you think okay up here. We only have long range
00:19:41.320 | Links, that means there's very few
00:19:44.180 | Nodes, whereas down here. We probably have loads of nodes. So if we saw it down here, we'd be making loads of steps
00:19:50.720 | Whereas doing it like this we can make big steps early on and then smaller steps later
00:19:56.200 | Which just minimizes the number of steps we need to take or the number of comparisons that we make now
00:20:02.000 | How how does this apply to?
00:20:04.000 | IVF we said okay. This is a IVF and patients of you composite index and
00:20:10.240 | This is what you can see now. So that HHSW process that we just went through
00:20:15.280 | Imagine all those points or those those vertices. All of those are
00:20:20.320 | cell centroids from IVF
00:20:23.520 | So what we're doing is rather than comparing exhaustively comparing all of those cell centroids
00:20:30.200 | We're just comparing a few demo or what finding the approximate closest or nearest neighbor very quickly
00:20:36.080 | And then we limit our exhaustive search scope to that single
00:20:41.400 | So if the end pro value is is one obviously usually it's not going to be born. It shouldn't be greater
00:20:46.300 | Yeah, but that's pretty much what we're doing with this IVF HHSW index now because IVF
00:20:54.080 | HHSW optimizes is centroid or focuses on optimizing the centroid search portion
00:21:00.360 | It makes more sense if we increase the number of centuries because if we increase the number of centroids
00:21:05.720 | we reduce the number of vectors within each cell and
00:21:09.120 | Because our optimized search point here or the approximate search is the centroid comparison and then once we're in those centroids
00:21:17.880 | We're actually comparing all of the vectors in there
00:21:21.280 | So we're forming an exhaustive search on the scope that we do that great
00:21:25.840 | So what we do is we increase the number of centroids because that's the optimized
00:21:31.520 | portion and we reduce the number of vectors that will return in our in our final scope because that's the unoptimized part of
00:21:38.640 | This search so to pull that together. We're going to use this string this index factory string
00:21:44.960 | In terms of increasing speed, like I said, we'll just increase this number here, which we will see
00:21:51.520 | I'll show you another graph in a minute
00:21:54.160 | Okay, so let's go up here
00:21:58.560 | I'm going to compare. Sorry. I'm going to modify this. So we're going to IVF
00:22:03.800 | 4096 followed on with an underscore
00:22:08.820 | by HSW 32 and what this means is we're creating this many cell centroids or cells and
00:22:16.560 | This 32 is how many?
00:22:21.000 | Connections or links each node or vertex has in the HSW graph
00:22:28.040 | And then after that, we were using a flat in those here now. I said before we can use pq32 for example here
00:22:35.600 | But the the accuracy to recall won't be as good but you can use that
00:22:40.360 | Right. So if you if you need to reduce the memory usage with this and it is very good by the way using pq
00:22:46.840 | So you still get very good recall you can do like that now. Let's run that
00:22:53.200 | We will do the search again. We'll check the recall. See what we get
00:22:57.280 | Okay, so that has just finished and we get this recall of 25 cent
00:23:02.760 | And now again, we're using the the Emperor value of 1 by default. We probably want to increase this
00:23:09.120 | So come down here. We don't need to change the variable name. I'm just doing it
00:23:14.040 | Because it bothers me leaving it as I am I am I and
00:23:20.560 | Let's go 60 again. I mean we do have a high number of cells here. So probably want to increase it a bit more
00:23:30.080 | Let's go with that. See what we get. I'm getting nice 3% which okay fair enough. That's pretty good like straight out
00:23:39.880 | That's nice
00:23:41.720 | so let's
00:23:43.920 | Time it see how fast that is and you see okay. So before we were getting with with the multi D
00:23:50.960 | ADC index we're getting I think
00:23:56.400 | Microseconds now getting 5 foot 4 a bit more
00:23:59.560 | But it's hardly anything and I mean we're getting nice 2% before we were getting I think
00:24:07.920 | 6% so but of course at the same time the memory usage for this is it's like half a gigabyte
00:24:14.520 | But we can reduce that by
00:24:16.960 | Instead of using flat index. We just change its peak you say - not like that. We
00:24:23.780 | change its peak you say - and
00:24:27.280 | That improves it a lot. It's but we're not doing that here. We're just gonna leave it flat
00:24:33.320 | And I mean, let's just so if we for example, we know that the maximum number cells here is
00:24:39.500 | 1496 let's do an insanely high number and this is the maximum
00:24:45.640 | Performance that we're going to get out of it. So a
00:24:48.440 | Hundred percent right because we're using flats of vectors. We can get that hundred percent point
00:24:54.800 | So yeah, we can get very good performance 89% there
00:25:02.840 | Yeah, it's super easy. So
00:25:04.840 | 432 microseconds. I mean for me, I think that's very good
00:25:11.880 | For that reason, this is definitely one of my favorite indexes. It's just again the memory usage is high
00:25:18.960 | So let's have a look at the actual performance search time recall
00:25:24.260 | So included a few different number of cells here for us to compare
00:25:29.120 | so like I said before what wet the optimal the optimized part is the IVF part or the
00:25:36.080 | Centroid search part number of cells so we can increase the number of cells and that will decrease the search time
00:25:43.400 | But also decrease the recall a little bit as well, but not that much so
00:25:47.640 | You know, you can you can modify it. You can increase the number of cells IVF viral cells
00:25:54.000 | To improve the search time if you if you need to now, I mean, that's it
00:26:00.400 | I don't want to cover anything else in the video. We've covered those just three indexes
00:26:05.480 | the IVF ADC multi multi D ADC and
00:26:10.200 | IVF patient sub you with those I think you can build some pretty cool indexes. So
00:26:16.000 | Thank you very much for watching. I hope some useful and I will see you in the next one