back to indexFaiss - Introduction to Similarity Search
Chapters
0:0 Introduction
3:21 Code Overview
6:15 Index Flat L2
8:10 Index Training
9:2 Adding Vectors
12:10 Query Time
13:0 Voronoi Cells
16:15 Coding
20:16 Index IVF
22:43 Product Quantization Index
25:54 Implementing Product Quantization Index
30:38 Comparison
00:00:03.120 |
We're going to be covering Facebook AI Similarity Search, or FICE. 00:00:14.480 |
And we'll introduce a few of the key indexes that we can use. 00:00:24.760 |
as you can probably tell from the name, it's a similarity search 00:00:28.280 |
and it's a library that we can use from Facebook AI 00:00:46.000 |
on building sentence embeddings and comparing sentence embeddings, 00:00:50.520 |
in those videos, I just did a generic Python loop 00:00:58.760 |
Now, if you're only working with maybe 100 vectors, 00:01:04.080 |
But in reality, we're probably never going to be working 00:01:08.880 |
Facebook AI Similarity Search can scale to tens, 00:01:13.040 |
hundreds of thousands or up to millions and even billions. 00:01:26.400 |
But before we get into it, I'll just sort of visualize 00:01:36.960 |
all of the vectors that we have created and we put it into our 00:01:41.960 |
similar search index, now they could look like this. 00:01:50.000 |
But in reality, there would be hundreds of dimensions here. 00:01:55.840 |
In our use case, we're going to be using dimensions of 768. 00:02:15.200 |
So let's say here, this is our query vector, so X, Q. 00:02:29.400 |
So we would calculate for between our query vector 00:02:33.920 |
and every other vector that is already in there 00:02:36.400 |
in order to find the vectors which are closest to it. 00:02:42.720 |
We can improve, we can decrease the number of dimensions 00:02:49.040 |
in each of our vectors and do it in a intelligent way 00:02:51.800 |
so they take up less space and the calculations are faster. 00:02:58.320 |
So in this case, rather than comparing every single item, 00:03:01.880 |
we might restrict our search to just this area here. 00:03:10.320 |
at a very high level that we can do with FICE. 00:03:13.240 |
So that's enough for the introduction to FICE. 00:03:29.440 |
So I've gone ahead and processed them already 00:03:31.320 |
'cause they do take a little bit of time to actually build, 00:03:45.880 |
that have been separated by a newline character. 00:03:48.400 |
And then in here, we have all of those NumPy binary files. 00:03:58.480 |
That's where we're pulling them all in using this cell here. 00:04:08.920 |
and we append them all into a single NumPy array here. 00:04:14.160 |
And that gives us these 14.5 thousand samples. 00:04:18.520 |
Each embedding is a vector with 768 values inside. 00:04:39.760 |
And then we're just reading that in as a normal file. 00:04:43.680 |
And we just write, I'm gonna put lines equals fp.read. 00:04:48.680 |
And like I said, we're splitting that by newline characters. 00:05:23.640 |
which is the library we're using to create those embeddings, 00:05:36.720 |
And then our model, we're using sentence transformer again. 00:05:40.040 |
And we're using the BERT and BASE NLI mean tokens model. 00:05:52.720 |
we'll see in a moment, we just write model encode, 00:05:56.680 |
and then we write something in here, hello world. 00:06:12.440 |
Now, I think we have everything we need to get started. 00:06:34.120 |
which means that all the vectors are just flat vectors. 00:06:44.200 |
that we're using to measure the similarity of each vector 00:07:03.520 |
So we imported, no, so we need to import FICE. 00:07:07.120 |
And then we write index equals FICE.index flat L2. 00:07:14.280 |
And then in here, we need to pass the dimensionality 00:07:33.920 |
we put sentence embeddings and we write shape one. 00:08:14.360 |
And there is one thing that we need to be aware of. 00:08:22.920 |
So if the index is going to do any clustering, 00:08:26.040 |
we will need to train that clustering algorithm on our data. 00:08:32.520 |
if an index needs training or is trained already 00:08:47.960 |
We'll see, because it's not doing anything special, 00:08:52.920 |
And we can see that when we write is trained, 00:08:57.440 |
just means that we don't actually need to train it. 00:09:01.320 |
Now, how do we add our vectors, our sentence embeddings? 00:09:20.840 |
we can check that they've been added properly 00:09:31.080 |
And with that, we can go ahead and start querying. 00:09:42.000 |
And we want to do the model and code that we did before. 00:09:45.320 |
Now, I'm going to write someone sprints with a football. 00:10:25.720 |
we will return four index IDs into this I variable here. 00:10:48.520 |
So the text that we have up here, that will align. 00:10:52.080 |
So what we can do is we can print all of those out. 00:10:58.360 |
And then in here, we want to write lines I for I. 00:11:29.280 |
And we see, obviously, it seems to be working pretty well. 00:11:40.240 |
The only problem is that this takes a long time. 00:11:55.320 |
Okay, so before we move on to the next index, 00:12:00.560 |
I just want to have a look at the sort of speed 00:12:10.760 |
So if we go over here, I've already written all this code. 00:12:18.480 |
So come down here, we have this flat L2 index, 00:12:30.320 |
And this is a number of vectors within that index. 00:12:40.880 |
You can see, you know, it increases quite quickly. 00:12:44.560 |
Now this is in FICE, but it's still an exhaustive search. 00:12:52.040 |
We're not using that approximate search capabilities 00:13:09.440 |
Now, the most popular of these uses a technique 00:13:13.280 |
very similar to something called Voronoi cells. 00:13:41.800 |
or the cells that you see are called Voronoi cells. 00:14:01.800 |
Now, as well as those, we also have our centroids. 00:14:08.960 |
And these are simply the centers of those cells. 00:14:45.520 |
this thing here, to every single one of those vectors, 00:14:58.440 |
is instead of checking against every one of those vectors, 00:15:03.800 |
And once we figure out which centroid is the closest, 00:15:18.560 |
So in this case, it would probably be this centroid here, 00:15:37.360 |
whereas the closest vector here is right there. 00:15:43.640 |
might actually be a better approximation or a better, 00:16:04.560 |
by the fact that this is just a lot, a lot faster. 00:16:11.040 |
It's whatever is going to work best for your use case. 00:16:20.760 |
is define how many of those cells we would like. 00:16:29.680 |
And then from there, we can set up our quantizer, 00:16:33.000 |
which is almost, it's like another step in the process. 00:16:41.840 |
we are still going to be measuring the L2 distance. 00:16:44.560 |
So we still actually need that index in there. 00:16:48.200 |
So to do that, we need to write FICE index flat L2, 00:16:59.760 |
And like I said, that's just a step in the process. 00:17:14.640 |
So this is the one that is creating those partitions. 00:17:35.440 |
Okay, now, if you remember what I said before, 00:17:41.520 |
we, in some cases, we'll need to train our index. 00:17:46.120 |
Now, this is an example of one of those times. 00:17:58.320 |
Now, to train it, we need to just write index train, 00:18:09.360 |
we want to pass all of our sentence embeddings. 00:18:23.800 |
So now our index is essentially ready to receive our data. 00:18:28.800 |
So we do this exactly the same way as we did before. 00:18:46.000 |
Okay, so now we see that we have our index, it's ready, 00:18:51.800 |
So what I'm going to do is use the exact same query vector 00:18:59.080 |
Going to time it so that we can see how quick this is 00:19:36.960 |
So we'll see that the times do vary a little bit 00:19:41.920 |
quite randomly, but maybe that's a little bit slow, 00:20:20.520 |
but maybe if you find the results are not that great 00:20:34.200 |
but maybe we should approximate a little bit less 00:20:39.120 |
And we can do that by setting the nProbe value. 00:20:52.400 |
and we can see it will probably take slightly longer. 00:21:02.680 |
'cause there were no accuracy issues here anyway. 00:21:06.320 |
But let me just explain what that is actually doing. 00:21:26.320 |
based on the first nearest centroid to our query vector. 00:21:33.920 |
or let's use a smaller number in this example. 00:22:00.200 |
so the number of cells that we are going to search is four. 00:22:14.800 |
so we might get a better performance, better accuracy. 00:22:34.080 |
In our case, we don't really need to increase this, 00:23:05.640 |
So it's probably better if I try and draw this out. 00:23:28.440 |
So we split this into several and then we take them out. 00:23:38.640 |
And this is just one vector that I'm visualizing here, 00:23:43.640 |
but we would obviously do this with many, many vectors. 00:23:56.360 |
Now, that means that we have a lot of these sub-vectors. 00:24:16.400 |
and each of those clusters is going to have a centroid. 00:24:27.840 |
is going to be run through its own clustering algorithm 00:24:42.200 |
And what we do is for each of these sub-vectors, 00:24:47.200 |
so each of these sub-vectors, they get pulled into here. 00:24:55.200 |
and it gets assigned to its nearest centroid. 00:25:02.660 |
all the way back over here and add it into our vector. 00:25:17.160 |
it's probably the wrong way to think about it. 00:25:38.320 |
the size of our vectors, but pretty significantly, 00:25:53.460 |
Now we need to define two new variables here. 00:25:58.440 |
So M, which is going to be the number of centroids 00:26:04.000 |
So now one thing that we do need to know with M 00:26:30.040 |
Now we should be able to divide that into eight, I think. 00:26:40.720 |
because if we do five, we see that D doesn't fit. 00:26:45.480 |
So five doesn't fit nicely into D, whereas eight does. 00:26:59.640 |
And that's because of the way that those vectors 00:27:04.440 |
are broken down into the final centroid ID vectors. 00:27:09.440 |
And we also need to specify the number of bits 00:27:19.680 |
And then we can set up our index and also the quantizer. 00:27:26.640 |
So the quantizer is going to be FICE.index Flat L2 D. 00:27:45.160 |
So it's not a flat vector anymore, which is the full vector. 00:27:52.080 |
So where we have reduced the size of it through this, 00:27:57.080 |
through the method I explained before, where we drew out. 00:28:00.800 |
Now we need to pass a few arguments into here. 00:28:10.960 |
So the quantizer D, which is our own dimensionality, 00:28:34.400 |
You may have guessed that we might need to train this one. 00:28:40.840 |
So to train it, we'll just write index.train. 00:28:49.440 |
Okay, so it might take a little bit longer this time. 00:29:00.520 |
Should be a lot quicker, or a fair bit quicker. 00:29:03.040 |
It's hard to get much quicker than the last one. 00:29:05.480 |
So we're going to use the same code as before. 00:29:48.960 |
but these two at the front are now different. 00:30:02.640 |
Let's have a look at what we are pulling through. 00:30:14.240 |
I mean, although the accuracy has decreased technically, 00:30:23.840 |
So I mean, nonetheless, I think that is pretty cool. 00:30:31.880 |
let's compare this to our previous two other methods 00:30:48.200 |
And then we have IVF flat with a end period value of 10, 00:31:01.280 |
And then we have flat L2 at the top, which obviously. 00:31:19.920 |
So I think obviously Pfizer's is pretty cool, 00:31:25.320 |
and I think we're definitely going to explore it more