back to indexMetadata Filtering for Vector Search + Latest Filter Tech
Chapters
0:0 Intro
0:24 Vector Search Recap
2:3 Why Filter?
2:56 Metadata Filtering 101
7:48 Pre-filtering
9:37 Post-filtering
11:30 Single-Stage Filtering
12:22 Vectors and Metadata Code
13:58 Connecting to Pinecone
14:55 Building Query Vector
16:47 Querying
21:37 First Filter
24:40 Adding More Conditions
27:3 Filtering with Numbers
30:55 Search Speed and Filtering
33:44 Outro
00:00:00.000 |
Hi, welcome to the video. We're going to be exploring 00:00:04.160 |
two of the common methods that we can use to filter indexes in Vector Similarity Search. 00:00:12.800 |
And then we're also going to explore Pinecone's new solution to filtering in Vector Search. 00:00:25.040 |
Now, in Vector Similarity Search, what we do is build representations of data, so that could be 00:00:35.440 |
text, images, or cooking recipes, and we convert them into vectors. We then store those vectors 00:00:45.200 |
in an index, and what we typically want to do is perform some kind of search or comparison of all 00:00:52.560 |
the vectors within that index. So, for example, if you found this video or article through Google or 00:01:01.200 |
YouTube, you will have typed something, some sort of query, into one of those search engines. So, 00:01:10.080 |
maybe something like, "How do I filter in Vector Similarity Search?" Then whichever search engine 00:01:19.120 |
you use most likely converted your query into a vector representation, and then it took that 00:01:28.880 |
vector representation and compared it to all of the other vectors in its vector index. So, 00:01:37.120 |
those could be pages, videos, and so on. It could be anything, really. And out of all of those 00:01:45.120 |
index vectors, it was this video or this article which seemed to be one of the most similar vectors 00:01:53.920 |
to your query vector. And so, you were served this video or this article near the top of your search, 00:02:00.560 |
you clicked on it, and now here we are. In search and recommender systems, 00:02:06.400 |
there's almost always a need to apply some sort of filter. On Google, we can search based on 00:02:13.120 |
categories such as news or shopping. We can search by date. We can search by language or 00:02:20.160 |
region. And likewise, Netflix, Amazon, Spotify might want to compare users in specific regions. 00:02:29.520 |
So, restricting the search scope to relevant vectors is, in many cases, an absolute necessity. 00:02:39.200 |
And despite the very clear need for filtering, there isn't a particularly good approach for 00:02:50.000 |
doing so. So, let's start having a look at the different types of filters available to us. 00:02:56.160 |
So, during the video, we'll be covering each one of those. We have pre-filtering, post-filtering, 00:03:02.960 |
and Pinecone's new single-stage filtering. Now, what I want to just do here is have a look at 00:03:11.040 |
what metadata filtering actually is. So, when we have a vector index, each vector is going to be 00:03:18.960 |
assigned some sort of metadata. And that can be anything. It could be a number, a date, text. 00:03:26.960 |
It can really be anything that we can use to filter our search. And what we want to do is 00:03:35.440 |
search where this or some condition is true. So, for example, say we have a big corporation, 00:03:43.920 |
they have all these different departments, and there are loads of internal documents within 00:03:51.200 |
that corporation. Some of those documents are assigned to the engineering department, some of 00:03:56.640 |
them are assigned to HR, and so on. A user in that company might want to go into their search, 00:04:04.800 |
and sometimes they may want to search across all departments, but sometimes they might want to 00:04:10.720 |
apply some sort of filter. So, they might want to say, "I want the top K documents where the 00:04:18.480 |
department is equal to engineering," or, "I want the top K where the department is not HR." 00:04:26.000 |
And we can apply anything in these metadata filters. So, we may want documents that are 00:04:35.600 |
quite recent. So, what we might say is we want the top K documents where the date is greater than or 00:04:43.920 |
equal to 14 days ago. And then we can sort of mix and match all of those different metadata filters 00:04:52.080 |
as well. Now, to implement a metadata filter, we need two things really. We need our vector index, 00:05:02.640 |
which is what you can see at the top here, and we also need our metadata index. Now, each of these 00:05:11.120 |
will be paired one by one. So, each vector will have its own metadata record, and what we would 00:05:19.520 |
do is we'd apply a condition to our metadata index, and that would remove a few of these. 00:05:28.480 |
So, we'd get rid of these, and then based on what we have removed here, 00:05:35.120 |
we would also remove those equivalent vectors from our vector index. And that's how our 00:05:42.720 |
filter would get applied. But there are different orders and different ways of doing this, as we 00:05:51.120 |
will take a look at now. So, the first one of those is we could have a look at a post filter. 00:06:01.280 |
So, a post filter is nothing more than we take our query vector, which is over here, 00:06:09.760 |
and we also take our metadata query over here. Now, we start by performing a approximate search 00:06:20.160 |
between our query vector and all of our index vectors over here, and we get the top K matches 00:06:28.320 |
there. So, let's say we wanted maybe 10 of those. And then what we do is we then add in our metadata 00:06:37.600 |
query. So, we add that in over here, and that creates a filter for us. So, then we filter those 00:06:46.640 |
remaining vectors through our new filter, and that leaves us with the filtered top K. Now, 00:06:55.680 |
in this case, top K is not going to be, usually is not going to be the number we asked for. So, 00:07:00.960 |
say over here, we have 10 top K matches. Here, we filter some of those out, so we might end up with 00:07:08.000 |
four. And then we also have pre-filtering. Now, pre-filtering, we change the order a little bit. 00:07:14.000 |
We apply our filter before we do the search. So, we have our metadata query over here. 00:07:21.840 |
We use that to create a filter over here. We then apply that to our full vector index over here, 00:07:31.920 |
and that leaves us with so many of our vectors. And then what we do is we search based on that, 00:07:37.760 |
but because we are not searching through the full data sets, we can't do approximate 00:07:44.320 |
nearest neighbor search. So, we have to do an exhaustive search at this point. 00:07:48.640 |
Now, let's start with the pre-filter process. So, here, like we saw before, we start with our 00:07:55.680 |
metadata index. We apply our filter to this, identifying which positions satisfy our filter 00:08:03.440 |
condition, and then we use this filtered metadata index to filter out the vectors which do not 00:08:11.680 |
satisfy our condition. And as we kind of saw before, this is where the issue with pre-filtering 00:08:19.840 |
comes in. Because we have just filtered out some of the vectors or many of the vectors in our index, 00:08:26.480 |
we no longer have the same index as what we started with. And we need the full index to 00:08:33.600 |
apply our approximate search on. But as soon as we filter, we change the structure of that index. So, 00:08:41.440 |
we can no longer perform an approximate nearest neighbor search, which means we're just doing 00:08:46.960 |
a brute force, exhaustive k-nearest neighbor search. Now, if our index is very small or the 00:08:55.360 |
number of vectors that we have output after our filter is very small, this is probably okay. 00:09:04.160 |
But as soon as we start working with big data sets, this is not going to be very manageable. 00:09:11.440 |
And the only other alternative that we have here is to build an index for every possible 00:09:20.320 |
filter outcome, which is not really an option because it's just simply not realistic to build 00:09:28.400 |
that many indexes. So, pre-filtering, we have good accuracy, but it's very slow. 00:09:36.480 |
Now, post-filtering is, of course, slightly different. So, in this case, we start with our 00:09:44.560 |
vector index. Now, we can perform our approximate nearest neighbor search because we have the full 00:09:50.800 |
index. We haven't filtered anything yet. And that returns the top k vectors that we want. So, 00:09:57.840 |
say we want 10 vectors at this point. And then what we do is we find all the vectors 00:10:05.440 |
through our metadata index that satisfy whatever metadata condition we have set. 00:10:10.640 |
And then we apply the filter to those top k vectors, so 10 vectors maybe. And, of course, 00:10:18.560 |
at this point, what we are doing is we're reducing the number of vectors that we get out. So, we 00:10:23.520 |
don't actually get 10 vectors. We, for example, could get four vectors. And in the worst case 00:10:30.320 |
scenario, that filter could rule out all of the vectors that we've returned. And in the end, we 00:10:36.800 |
return nothing, even when in the index there could be some relevant vectors, which is obviously not 00:10:46.480 |
very ideal. We can try and eliminate this problem by just increasing k a lot. So, of course, if we 00:10:54.640 |
use a low k value, the chances of all of them being excluded when we apply our filter post 00:11:02.320 |
search is reasonably high. But if we increase k up to 1 million, it's much lower. But the only 00:11:11.200 |
problem with that is that our search becomes very slow. And the more we increase k to eliminate the 00:11:19.200 |
problem, the slower it gets. So, in this case, we have unreliable accuracy or performance. 00:11:26.160 |
But it is faster, unless we increase k. So, now let's introduce single-stage 00:11:34.800 |
filtering by Pinecone. Now, we are going to go through some code and test this. 00:11:40.400 |
But first, I just want to introduce what it is at a high level. So, it's a new filter built by 00:11:48.320 |
Pinecone. And at a high level, it works by merging the vector and metadata indexes. And it allows us 00:11:56.960 |
to filter and then do an approximate nearest neighbor search. So, what we get there is the 00:12:04.960 |
accuracy of pre-filtering. And at the same time, the search speed is often faster than post-filtering 00:12:12.640 |
as well. So, we really do get the best of both with this new filter. But let's go and actually 00:12:20.800 |
try it out. Okay. So, we're going to be using Pinecone here. And all I've done here is imported 00:12:29.120 |
Pinecone, import JSON, and I've imported my data here. So, this data, I've already uploaded or 00:12:38.240 |
asserted to my Pinecone client. And what it is, is just the squad data set in both English and 00:12:49.040 |
Italian. Now, in there, we have a few different items. So, we have the record ID, we have the 00:13:00.560 |
text, although I've just sold this locally. We have the vector, which has been asserted to Pinecone. 00:13:08.560 |
And then we also have the metadata, which is with Pinecone as well. Now, if we take a look at what 00:13:13.840 |
we have in the metadata, we see that we have the language. So, we have either English or Italian. 00:13:26.000 |
And then we also have the topic. Now, what I want to do is just test the new filtering. So, 00:13:34.880 |
we're going to be filtering based on language, topic. And we also have another metadata item here, 00:13:43.440 |
which I don't have locally, which is just a randomly generated date. So, we can have a look 00:13:49.920 |
at using some of the greater than or equals to, less than equals to filters that we can use in 00:13:57.440 |
Pinecone. So, the first thing I'm going to do is initialize my connection to Pinecone. So, 00:14:06.240 |
I write pinecone.init. I need to pass my API key, which I've loaded above, 00:14:13.520 |
and also the environment that I am working in. Now, of course, this will be different and depend 00:14:24.000 |
on which environment you are using. So, I've initialized there, and what I can do is I can now 00:14:34.000 |
create a direct connection to a specific index within my Pinecone environment. Now, 00:14:40.320 |
I'm going to be connecting to one that I've already made, which is called squad test. 00:14:47.200 |
And now what I'll be able to do is use this index object to perform my queries. 00:14:54.800 |
So, we're going to be performing a vector search here. So, what we need first is a query vector 00:15:03.120 |
to perform our search with. Now, I use the sentence transformers library to encode the 00:15:11.920 |
already indexed vectors. So, what we're going to do is use the same model to encode our query vector. 00:15:21.760 |
So, write sentence transformer, and that embedding model is sentence transformer. 00:15:30.240 |
And I use the stsb xlm-r-multilingual model. So, I will need to download this. 00:15:53.120 |
And then what we want to do is create our query vector. So, I'm going to assign it to xq, 00:16:04.000 |
adopt encode, and then I pass in the query that I would like to perform. So, in this case, 00:16:17.760 |
I'm going to search for context in our dataset, which mentions something along the lines of early 00:16:27.600 |
engineering courses provided by American universities in the 1870s. So, I will execute that. 00:16:35.360 |
And note that we're using a multilingual model here. So, we should find that we will return 00:16:44.960 |
both English and Italian results, but both of them should be something similar to this topic. 00:16:53.040 |
So, we will return our results. We just write index.query, 00:16:59.600 |
and what we can also do before we even do that is we'll just convert xq into the format that we need. 00:17:12.000 |
So, like this. And then in here, we just pass xq, and I'll say that I want my top k value to be 00:17:20.960 |
three. Now, remember, if we were using post-filtering here, if we set top k value of 00:17:30.080 |
three, we would probably return less than three. So, with post-filtering, we would want to set 00:17:35.680 |
something stupidly high here, just to get maybe three samples, if we're lucky. 00:17:42.000 |
But, as we're using single-stage filtering, we only need to set top k equal to three. 00:17:48.720 |
So, we'll execute that, and let's return the results. 00:17:54.320 |
And you see here that we get our IDs. So, we get this ID, one, two, and three. 00:18:04.240 |
So, what we now want to do is we want to map that back to the data that we have stored locally. 00:18:09.360 |
To do that, we're going to write IDs equals iid for i in the results. So, we're just 00:18:19.600 |
getting these IDs here. So, we're going results. We need to enter into the results key. 00:18:30.080 |
We want to access the first position in that list, and then, in there, we want to access the matches. 00:18:38.240 |
And then, from there, we'll print IDs, see what we get. Okay, we get those three 00:18:48.000 |
IDs. And now, what we want to do is use the data that I imported up here. 00:18:58.240 |
Just here. And we're going to use that to print out whatever it is that these IDs are referring 00:19:07.440 |
to. Now, what we have in our data at the moment is a big list, which is not that useful. So, 00:19:15.120 |
it would be more useful if we just reformat that into a dictionary. So, I'll do that quickly. We're 00:19:22.160 |
just going to write getSample is equal to xid. And then, in here, we will store our context 00:19:34.800 |
and metadata. So, context and metadata. We don't need to store the vector in here, 00:19:48.960 |
because we can't read that anyway. So, it's not that useful for us. Let me say, for x in data. 00:19:58.880 |
Okay, so it's not context. Let me come up here and have a look at what we have. 00:20:12.960 |
I think it's text. Yeah, okay. So, let's change that to text here and here. 00:20:19.360 |
Okay, so now we can do, for i in IDs, I want to get the sample. So, getSample i. 00:20:35.680 |
And we'll just print that out. So, we see here that the first one we get is Italian. 00:20:46.480 |
And the translation for this is something to do with the College of Engineering 00:20:53.840 |
was instituted in 1920. So, we have college, engineering, that's good. 00:21:00.160 |
And then we also have something along the lines of the College of Science from the 1870s. So, 00:21:07.840 |
generally, this looks pretty relevant, I think. And then, down here, we have, this is Italian 00:21:16.640 |
again, but we also have the English translation of this here as well. So, we can see straight away, 00:21:23.920 |
School of Engineering, Public Engineering School, founded 1891, and it offered engineering degrees 00:21:31.520 |
as early as 1873. So, that's, again, pretty relevant. Now, I don't understand Italian. 00:21:41.120 |
And so, my first filter here would probably be, okay, I only want to return the English results. 00:21:49.440 |
So, let's go ahead and do that. So, I'm going to say results equals, and let's just copy what we 00:21:56.640 |
had up here. So, we're not repeating ourselves. We just want to take the index query. We can 00:22:04.240 |
include result. No, we don't need it. Let's just take that. And all we need to do now is add our 00:22:15.920 |
filter. So, we just write filter. And then, in here, we want to write. So, we have our metadata, 00:22:23.120 |
and we have our language. And we want to say that this must be equal to, so we use EQEN, 00:22:31.360 |
which is English. So, we get our results, and we're going to want to do the same thing again. 00:22:40.800 |
So, we want to get our IDs, and we want to print those out. 00:22:50.080 |
And there we go. So, now, we're just getting English results. Now, that was pretty fast. So, 00:22:59.200 |
I think what is quite useful is to see how fast those two searches were. 00:23:06.640 |
Now, obviously, we're getting pretty relevant results here, where, again, we're returning three 00:23:11.440 |
results, even with our filter applied. So, that's good. So, it seems like we're getting the accuracy 00:23:19.360 |
of pre-filtering here. And let's have a look at the speed difference between the two approaches. 00:23:27.840 |
Now, we shouldn't see anything particularly major, because this is a very small index. We only have, 00:23:33.920 |
I think, 40,000 vectors here. So, we won't see anything significant. 00:23:40.000 |
But at least we can check that we're not getting anything slow. 00:23:46.080 |
So, let's have a look. And you see here that we're actually getting a slightly faster response 00:24:02.640 |
when we filter. And this is typical with Pinecone single-stage filtering. When we 00:24:09.760 |
add a filter, usually, we'll actually get faster results, which is pretty insane. So, 00:24:16.080 |
not only are we getting good speed, like post-filtering, but we're actually 00:24:21.600 |
making our search faster by adding a filter, which is neither post-filtering nor pre-filtering can do 00:24:30.560 |
that. And again, at the same time, we're still getting that accuracy of pre-filtering. So, 00:24:35.040 |
this is, in my opinion, pretty impressive. Now, we might also want to add another filter. So, 00:24:46.480 |
at the moment, we're just adding one filter, which is fine. It works. But let's say I look 00:24:54.240 |
at my results. And I know this is hard to read. But in here, we have the topic. We have University 00:25:00.400 |
of Kansas. OK, fine. Maybe I'm not interested in the University of Kansas. So, how about here? 00:25:06.560 |
We have University of Notre Dame. Let's say I'm not even interested in these guys either. 00:25:10.000 |
Institute of Technology, let's say, OK, yeah, we can keep them. That's fine. 00:25:16.240 |
So, I want to say, OK, I want everything that is one, in English, and two, 00:25:22.880 |
not from the University of Kansas and not from the University of Notre Dame. So, to do that, 00:25:30.880 |
I need to add another condition to my filter. So, to do that, all I need to do is say topic 00:25:41.520 |
is not in this time. So, we're going to say not in. And then we pass a list here. So, 00:25:48.320 |
a list of what we don't want to see, which was the University of Notre Dame 00:25:57.120 |
and also the University of Kansas. OK, so let's add those two. And let's see what we get. 00:26:10.480 |
So, again, it seemed pretty fast. And we're getting University of Kansas here. So, 00:26:15.760 |
that must mean that I have written something wrong. So, I think here, the topic filter 00:26:26.240 |
in my pine cone index is actually maybe called title. Let's see. And this is also wrong. So, 00:26:35.440 |
let's correct that. OK, so now we're getting something different. So, yes, this should be 00:26:45.440 |
title in reality. So, Institute of Technology. Institute of Technology. And where is our other 00:26:53.440 |
one? Institute of Technology here. Now, we're not returning University of Kansas and we're 00:26:58.560 |
not returning University of Notre Dame, which is what we wanted. Now, there was also the 00:27:05.600 |
date filter that I wanted to show you as well. So, we don't only need to filter based on strings, 00:27:12.240 |
we can also filter based on numeric data times. And for me to show you this, I think it's best 00:27:20.640 |
if we ... It would also be better if we include our metadata here as well. So, 00:27:29.760 |
we can just see it directly from our results. So, we know that we're returning relevant text. 00:27:36.000 |
So, now let's just have a look at the metadata. So, I'm going to include metadata there. 00:27:47.200 |
Let's just see what we get. So, we see now that we actually include the metadata in our results, 00:27:55.280 |
which is also pretty cool. Now, we have a date, which is just a numeric value here. It's just 00:28:02.960 |
something very simple. It just randomly generates. There's no actual relation between the date and 00:28:08.960 |
this record. It's completely random. And we can see, okay, we have a date from 2016, 2008, 00:28:18.000 |
and we also have 2020 here as well. Now, the first thing that I might want to do is say, 00:28:24.960 |
okay, I want to return only the more recent date. So, let's say, okay, we add, we keep all of that, 00:28:33.440 |
all the other filters in there. And we might say, okay, but we also want date 00:28:41.040 |
to be greater than or equal to, let's say, what do we have here? We have 00:28:54.960 |
what is the most recent. So, we have this 2021. Let's say we want to go for ones that are, 00:29:01.280 |
let's say, 2018 onwards for now. So, 2018, 001. Okay. So, the very first day of 2018. 00:29:13.120 |
Let's search and see what we get. And we can see, yep, it's definitely filtering correctly there. 00:29:22.800 |
Now, let's have a look at what the search time is for that. 00:29:25.680 |
So, adding quite a few filter conditions here. So, let's just see what we get. 00:29:40.800 |
And you see, it's actually slightly faster again, which is, again, it's pretty cool. 00:29:51.520 |
But like I said, it's a small data set. When we do this on bigger data sets, the 00:29:57.760 |
difference can be huge. Now, what we can also do is we can actually add another condition 00:30:05.520 |
within our date here. So, we can say, okay, we want it to be greater than or equals to 2018. 00:30:12.400 |
But let's say we want to search for records only in 2018. So, we might also say, okay, 00:30:19.920 |
we want it to be greater than 2018, or the first day of 2018. We also want it to be less than or 00:30:26.400 |
equal to the very last day in 2018. So, 2018, 12/31. And we will filter. And we see that now 00:30:39.920 |
we're only returning records from 2018. So, again, super cool, and I think an incredibly 00:30:49.600 |
useful functionality for vector similarity search. Now, we were just using a very small data set 00:30:59.440 |
there. So, I couldn't really show you how impressive the speedup can be when we're applying 00:31:06.960 |
filters. But I do have this other index. Now, I'm not going to go through just coding everything, 00:31:12.720 |
because it's pretty straightforward. We have an index here, which is 1.2 million vectors, 00:31:21.680 |
and it has a single metadata field in there, which I've called Tag1. And that's just a 00:31:27.840 |
randomly generated number or integer from 0 to 100. So, we, of course, initialize the connection 00:31:36.560 |
to our index in the first cell up here. And then over here, I'm just creating a random query vector. 00:31:47.280 |
So, first, this here is our unfiltered search. So, we get this 79.2 milliseconds. Now, again, 00:31:57.840 |
most of this is network latency, rather than the search time in the actual index. 00:32:05.360 |
But we will see the search time decrease pretty dramatically here. So, first, we'll say, okay, 00:32:11.600 |
we want Tag1 to be greater than 30. So, we're going from 0 to 100. So, we're roughly removing 00:32:19.760 |
probably about 30% of the vectors from our search. And we can see, okay, we just shaved off 00:32:26.400 |
8 milliseconds, which is impressive. And then we take that even further. So, we say, okay, 00:32:33.840 |
we want it greater than 70. So, now, we're shaving off around 70% of our vectors. 00:32:38.880 |
And our search time goes down to 56.6 milliseconds. Do it even further. So, about 90% here, 00:32:47.840 |
we go down to 54 milliseconds. And then here, I'm using the equals sign here. So, 00:32:53.440 |
I'm only searching for about 1% of the index. And it goes down to 51.6 milliseconds. So, 00:33:02.720 |
incredibly impressive speed up there. And this is kind of what it looks like. 00:33:08.720 |
So, we have the Tag1 GT value or greater than value on the left. And as we increase that 00:33:18.080 |
up this way, our time, our search time in milliseconds goes down. 00:33:26.000 |
Now, it is a little bit bumpy. It goes up and down a lot. I've tried to showcase that. 00:33:33.120 |
In this graph. But the trend is quite clearly downwards. So, the more we filter, the faster 00:33:42.080 |
our search, which is incredible. Now, that's it for this video covering pre-filtering, 00:33:50.080 |
post-filtering, and Pinecone's new single stage filtering. I hope this has been useful and 00:33:57.680 |
insightful. If you are interested in testing Pinecone out yourself, there is a link to Pinecone's 00:34:05.840 |
website in the description. But we'll leave it there for now. Thank you very much for watching.