Choosing Indexes for Similarity Search (Faiss in Python)

Hi welcome to the video. I'm going to take you through a few different indexes in FISE today so FISE for similarity search and we're going to learn how we can decide which index to use based on our data. Now these indexes are reasonably complex but we're going to just have a high level look at each one of them.

At some point in the future we'll go into more depth for sure but for now this is what we're going to do. So we're going to cover the indexes that you see on the screen at the moment. So we have the flat indexes which are just plain and simple nothing special going on there and then we're going to have a look at LSH or locality sensitive hashing, HNSW which is hierarchical navigable small worlds and then finally we're going to have a look at an IVF index as well.

So first thing I'm going to show you is how to get some data for following through this. So we're going to be using the SIFT1M dataset which is 1 million vectors that we can use for testing similarity. Now there's a little bit of code so I'm just going to show it to you.

So we have here we're just downloading the code. There'll be a notebook for this in the description as well so you can just use that and copy things across. But we're downloading it from here and this will give us a tar file. So we download that and then here all we're doing is extracting all the files from inside that tar file.

And then here I'm reading everything into the notebook. So inside that tar file we get these FVEX files and we have to open them in a certain way which is what we're doing here. So we're setting up the function to read them, sorry, here. And then here I'm reading in two files.

So we get a few different files here. So I'm sorry this should be SIFT. So we get the base data which is going to be the data that we're going to search through and then we also have query data here. And then what I'm doing here is just selecting a single query or single vector to query with rather than all of them because we get quite a few in there.

And then here we can just see so this is our query vector, the XQ. And then we also have WB here which is going to be the data that we'll index and search through. And we can see some of it there as well. So that's how we get data.

Let's move on to some flat indexes. So what you can see at the moment is a visual representation of a flat L2 index. Now up here, this is what we're doing. So we're calculating, we have all of these points. So these are all of the WB points that we saw before and this is our query vector.

And we just calculate the distance between all of those. And then what we do is just take the top three. So the top K in reality but in this case it's top three. Now we also have IP so we have both L2 distance and IP distance as well. IP works in a different way.

So we're using a different formula to actually calculate the distance or similarity there. So it's not exactly as you see it here. But before we write any code, I just want to say that with flat indexes, they are 100% quality. And typically what we want to do with FICE and similarly other search indexes is balance the search quality versus the search speed.

Higher search quality, usually slower search speed. And flat indexes are just pure search quality because they are an exhaustive search. They check the distance between your query vector and every other vector in the index. Which is fine if you don't have a particularly big data set or you don't care about time.

But if you do, then you probably don't want to use that because it can take an incredibly long time. If you have a billion vectors in your data set and you do 100 queries a minute, then as far as I know, it's impossible to run that. And if you were going to run that, you'd need some pretty insane hardware.

So we can't use flat indexes and exhaustive search in most cases. But I will show you how to do it. So first I'm just going to define dimensionality of our data, which is 128, which we can see up here, 1 to 8. I'm also going to say how many results do we want to return.

I'm going to say 10. Okay. We also need to import FICE before we do anything. And then we can initialize our index. So I said we have two. So we have FICE index flat 02 or IP. I'm going to use IP because it's very slightly faster. It seems from me testing it, it's very slightly faster, but there's hardly any difference in reality.

So initializes our index and then we want to add our data to it. So we add WB and then we perform a search. So let me create a new cell and let me just run this quickly. Okay. And what I'm going to do is just time it so you can see how long this takes as well.

So I'm going to do time and we're going to do index. So DI equals index search. Then in here we have our query vector and how many samples we'd like to return. So I'm going to go with K. Okay. So that was reasonably quick and that's because we don't have a huge dataset and we're just searching for one query.

So it's not really too much of a problem there. But what I do want to show you is, so if we print out I, that returns all of the IDs or the indexes of the 10 most similar vectors. Now I'm going to use that as a baseline for each of our other indexes.

So this is, like I said, a hundred percent quality and we can use this accuracy to test out other indexes as well. So what I'm going to do is take that and convert it into a list. And if we just have a look at what we get, we'll see that we get a list like that.

And we're just going to use that, like I said, to see how our other indexes are performing. So we'll move on to the other indexes. And like I said before, we want to try and go from this, which is the flat indexes, where it's just a hundred percent search quality to something that's more 50/50.

But it depends on our use case as well. Sometimes we might want more speed, sometimes higher quality. So we will see a few of those through these indexes. So we start with LSH. So a very high level. LSH works by grouping vectors in two different buckets. Now, what we can see on the screen now is a typical hashing function for like a Python dictionary.

And what these hashing functions do is they try to minimize collisions. So a collision is where we would have the case of two items, maybe say these two, being hashed into the same bucket. And with a dictionary, you don't want that because you want every bucket to be an independent value.

Otherwise it increases the complexity of extracting your values from a single bucket if they've collided. Now, LSH is slightly different because we actually do want to group things. So we can see it as a dictionary, but rather than, whereas before we were avoiding those collisions, you can see here we're putting them into completely different buckets every time.

Rather than doing that, we're trying to maximize collisions. So you can see here that we've pushed all three of these keys into this single bucket here. And we've also pushed all of these keys into this single bucket. So we get groupings of our values. Now, when it comes to performing our search, we process our query through the same hashing function and that will push it to one of our buckets.

Now, in the case of maybe appearing in this bucket here, we use hamming distance to find the nearest bucket. And then we can search or we restrict our scope to these values. So we just restricted our scope there, which means that we do not need to search through everything.

So we are avoiding searching through those values down there. Now let's have a look at how we implement that. So it's pretty straightforward. All we do is index, we do vice index, LSH. We have our dimensionality. Then we also have this other variable, which is called n bits. So I will put that in a variable up here, do n bits.

And what I'm going to do is I'm going to make it D multiplied by four. So n bits, we will have to scale with the dimensionality of our data, which comes into another problem, which I'll mention later on, which is the curse of dimensionality. But I'll talk more about it in a moment.

So here we have n bits, and then we add our data like we did before. And then we can search our data just like we did before. So we do time and we do, we want d pi equals index search. And we are searching using our query, our search query, and we want to return 10 items.

Okay. So quicker speed, see here. And what we can also do is compare the results to our 100% quality index or flat index. And we do that using NumPy in 1D, baseline i. Okay. So I'm just going to look at it visually here. So we can see we have quite a lot of matches.

So plenty of trues, a couple of falses, true, false, false, false, false. So these are the top 10 that have been returned using our LSH algorithm. And we're checking if they exist in the baseline results that we got from our flat index earlier. And we're returning that most of them are present in that baseline.

So most of them do match. So it's reasonably good recall there. So that's good. And it was faster. So we've got 17.6 milliseconds here. How much did we get up here? We got 157 milliseconds. So slightly less accurate, but what is that? 10 times faster. So it's pretty good.

And we can mess around with n bits. We can increase it to increase the accuracy of our index, or we decrease it to increase the speed. So again, it's just trying to find that balance between both. Okay. So this is a graph of just showing you the recall. So with different n bit values.

So as we saw before, we increase the n bits value for good recall, but at the same time, we have that curse of dimensionality. So if we are multiplying our dimensionality value D by eight in order to get a good recall, then if we have a dimensionality of four, that's not a very high number.

So it's going to be reasonably fast. But if we increase that to dimensionality, for example, 512, that becomes very, very complex very quickly. So you have to be careful with your dimensionality. Lower dimensionality is very good for LSH. Otherwise it's not so good. You can see that here. So at the bottom here, I've used, this is on the same dataset.

So an n bits value of D multiplied by two, with LSH, it's super fast. It's faster than our flat index, which is what you would hope. But if we increase the n bits value quite a bit, so maybe you want very high performance, then it gets out of hand very quickly and our search time just grows massively.

So you kind of have to find that balance. But what we got before was pretty good. We had a D multiplied by four, I think, and we got reasonable performance and it was fast. So it's good. And that also applies to the index size as well. So low n bits size, index size isn't too bad.

With higher n bits, it's pretty huge. So also something to think about. Now let's move on to HNSW. Now HNSW is, well the first part of it is NSW, which is Navigo Small World Graphs. Now what makes a graph small world, it's essentially means that this graph can be very large, but the number of hops, so the number of steps you need to take between any two vertices, which is the points, is very low.

So in this example here, we have this vertex over here. And to get over to this one on the opposite side, we need to take one, two, three, four hops. And this is obviously a very small network, so it doesn't really count, but you can see this sort of behavior in very large networks.

So I think in 2016, there was a study from Facebook. And at that point, I don't remember the exact number of people that they had on the platform, but I think it's in the billions. And they found that the average number of hops that you need to take between any two people on the platform is like 3.6.

So that's a very good example of a Navigo Small World Graph. Now hierarchical NSW graphs, which is what we are using, they're built in the same way like a NSW graph, but then they're split across multiple layers, which is what you can see here. And when we are performing our search, the path it takes will hop between different layers in order to find our nearest neighbor.

Now it's pretty complicated, and this is really, I think, oversimplifying it a lot, but that's the general gist of it. I'm not going to go any further into it. We will, I think, in a future video and article. Now let's put that together in code. So we have a few different variables here.

We have M, which I'm going to set to 16. And M is the number of connections that each vertex has. So of course, that means greater connectivity. We're probably going to find our nearest neighbors more accurately. EF search, which is what is the depth of our search every time we perform a search.

So we can set this to a higher value if we want to search more of the network, or a low value if we want to search less of the network. Obviously low value is going to be quicker. High value is going to be more accurate. And then we have EF construction.

Now this, similar to EF search, is how much of the network will we search, but not during the actual search, during the construction of the network. So this is essentially how efficiently and accurately are we going to build the network in the first place. So this will increase the add time, but the search time, it makes no difference on.

So it's good to use a high number, I think, for this one. So we'll initialize our index. And we have this FICE index, HNSW, flat. So we can use different vector series. We can, I think, PQ, PQ there. And essentially, what that's going to do is make this search faster, but slightly less accurate.

Now this is already ready fast with flats, and that's all we're going to stick with. But again, like I said, we will return to this at some point in the future and cover it in a lot more detail for sure. So dimensionality, we need to pass in our M value here as well.

Now we want to apply those two parameters. So we have EFSearch, which is obviously EFSearch. And then we also have HNSWD, obviously the EF construction. So that should be everything ready to go. And all we want to do now is add our data. So index.addWB. Now like I said, we have that EF construction.

We've used a reasonably high value. So you can see this is already taking a lot longer than the previous indexes to actually add our vectors into it. But it's still not going to take that long. And then once it is done, we are going to do our search, just like we did every other time.

So we have DI equals search. Sorry, index.search. And we are going to pass in our query, and also K. So 43.6 seconds to add the vectors there, so a fair bit longer. And then look at this, super fast, like that, 3.7 milliseconds. So much faster than the last one.

I think the last one was 16 milliseconds. This is a flat index, 157. LSH, we have 17.6. So really quick, which is cool. So how's the performance? So let's have a look. OK, so we get quite a few faulters here, and only a couple of trues. So OK, it's not so great.

It was really fast, but it's not very accurate. But fortunately, we can fix that. So let's increase our EF search. I'm going to increase it a fair bit. Let's go 32.32. And this is probably, I would imagine, more than enough to get good performance. So run this, and run this.

OK, and now we see we get pretty good results. Now, the wartime is higher. So it's just a case of balancing it, because this is now higher than LSH. But what we can do is increase EF construction time. The value for EF construction increases or decreases, depending on what everyone says.

A lot of flexibility with this, and it can be really fast. HNSW is essentially one of the best performing indexes that you can use. If you look at the current state of the art, a lot of them are HNSW, or they're based on HNSW in some way or another.

So these are good ones to go with. You just need to play around them a little bit. So this is a few of the performance I found using the same data set. But I'm messing around, so we have the EF construction values down here. So we start with 16 over here, up to 64.

EF search values over here, and our M values over here. And we've got pretty good recall over 64 on the EF construction. So EF construction is a really good one to just increase, because it doesn't increase your search time, which is pretty cool, I think. And then here is the search time.

Again, HNSW, M, and EF search. Obviously, I didn't include EF construction there, because it doesn't make a difference. And this is the one thing with HNSW. The index size is absolutely huge. So that's just one thing to bear in mind. The index size can take a lot of memory.

But otherwise, really, really cool index. And then that leaves us on to our final index, which is the IVF index. And this is super popular, and with good reasons. It is very good. So the inverted file index is based on essentially clustering data points. So we see here, we have all of these different data points, the little crosses.

And then we have these three other points, which are going to be our cluster centroids. So around each, or based in each of our cluster centroids, we expand the catchment radius around each of those. And as you can see here, where each of those circles collides, it creates the edge of what are going to be our almost like catchment cells.

And this is called a Voronoi diagram, or it's a really hard word, Dirichlet tessellation. I don't know if that's correct, but I think it sounds pretty cool. So I thought I'd throw that in there. So we create these cells. In each one of those cells, any data point within those cells will be allocated to that given centroid.

And then when you search within a specific cell, you pass your XQ value in there. And that will be compared, the XQ value will be compared to every single cluster centroid, but not the other values within that cluster or the other clusters, only the cluster centroids. And then from that, you find out which centroid is the closest to your query vector.

And then what we do is we restrict our search scope to only the data points within that cluster or that cell. And then we calculate the nearest vector. So at this point, we have all the vectors only within that cell, and we compare all of those to our query vector.

Now, there is one problem with this, which is called the edge problem. Now, we're just showing this in two-dimensional space. Obviously, in reality, for example, the dataset we're using, we have 128 dimensions. So dimensionality, the edge problem's kind of complicated when you think about it in the hundreds of dimensions.

But what this is, is say with our query, we find our query vector is right on the edge of one of the cells. And if we set our nprobe value, so I mentioned nprobe here, that's how many cells we search. If that is set to one, it means that we're going to restrict our search to only that cell, even though if you look at this, we have two, or we have, I'm trying to think.

So this one for sure is closer to our query vector than any of the magenta data points, and possibly also this one and this one, and maybe even this one. But we're not going to consider any of those because we're restricting our search only to this cell. So we're only going to look at these data points and also these over here.

So that's the edge problem, but we can get around that by not just searching one cell, but by searching quite a few. So in this case, our nprobe value is eight, and that means we're going to search eight of the nearest centroids or centroid cells. And that's how IVF will work.

Let's go ahead and implement that in code. So first thing we need to do is set our nlist value, which is the number of centroids that we will have within our data. And then this time, so this is a little bit different, we need to set the final vector search that we're going to do.

So this is split into two different operations. So we're searching based on clusters, and then we're actually comparing the full vectors within the selected clusters. So we need to define how we're going to do that final search between our full vectors and our query vector. So what we do is write FICE.

So we do index flat. We're going to index flat IP. You can use L2 as well. We set our dimensionality. So we're just initializing a flat index there. And then what we're going to do is feed that into our IVF index. So our IVF index is FICE, index IVF, and flat, because we're using the flat indexes, the flat vectors there.

We need to pass our quantizer. So this step here, the other step to the search process, the dimensionality, and also our nlist value. So how many cells or clusters we're going to have in there. And with this, because we're clustering data, we need to do something else. So let me show you.

So if we write index.is_trained, we get this false. If we wrote off any of our other indexes, this would have been true, because they don't need to be trained, because we're not doing clustering or any other form of training or optimization there. So what we need to do is we need to train our index before we use it.

So we write index_train, and we just pass all of our vectors into that. It's very quick, so it's not really an issue. And then we do index add, pass our data. And then what we do, one thing I want to show you, we have our nprobe value. We'll search with one for now.

So we'll search one cell. And to search, we write di, as we have every other time, search, xq, k. So it's super fast, 3.32 milliseconds. I think that's maybe the fastest, other than our bad-performing or low-quality HNSUbU index. So let's see how that's performed. So we write np.in_on_d, baseline.

Hi. You see, it's not too bad, to be fair, like 50/50 almost. So that's actually pretty good. But what we can do if we want it to be even better is we increase the nprobe value. So let's go up to four. So that's increased the wartime quite a bit.

So from like 3 to 125, which is now super slow, actually. But now we're getting perfect results. We can maybe decrease that to two. So now it's faster. That could have been a one-off sometimes. Occasionally, you get a really slow search. It just happens sometimes. So we set nprobe to two, super fast and super accurate.

So that's a very good index as well. So these are the stats I got in terms of recall and search time in milliseconds for different nprobe values and different endless values. So again, it's just about balancing it again. Index size, the only thing that affects your index size here is obviously the size of your data and the endless value.

But you can increase the endless value loads and the index size hardly increases. So this is like increasing by 100 kilobytes per double of the endless value. So it's like nothing. So that's it for this video. And we covered quite a lot. So I'm going to leave it there.

But I think all these indexes are super useful and quite interesting. And figuring out, just playing around with them. Like you've seen, I've done loads with these graphs, just seeing what is faster, what is slower, where the good quality is. And just playing around the parameters and seeing what you can get out of it is super useful for actually understanding these.

Now what I do want to do going forward is actually explore each one of these indexes in more depth. Because we've only covered them at a very, very high level at the moment. So in future videos, articles, we're going to go into more depth and explore them a lot more.

So that will be pretty interesting, I think. So that's it for this video. Thank you very much for watching. And I'll see you in the next one. Bye.

Choosing Indexes for Similarity Search (Faiss in Python)

Chapters

Transcript