back to index

Metadata Filtering for Vector Search + Latest Filter Tech


Chapters

0:0 Intro
0:24 Vector Search Recap
2:3 Why Filter?
2:56 Metadata Filtering 101
7:48 Pre-filtering
9:37 Post-filtering
11:30 Single-Stage Filtering
12:22 Vectors and Metadata Code
13:58 Connecting to Pinecone
14:55 Building Query Vector
16:47 Querying
21:37 First Filter
24:40 Adding More Conditions
27:3 Filtering with Numbers
30:55 Search Speed and Filtering
33:44 Outro

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi, welcome to the video. We're going to be exploring
00:00:04.160 | two of the common methods that we can use to filter indexes in Vector Similarity Search.
00:00:12.800 | And then we're also going to explore Pinecone's new solution to filtering in Vector Search.
00:00:25.040 | Now, in Vector Similarity Search, what we do is build representations of data, so that could be
00:00:35.440 | text, images, or cooking recipes, and we convert them into vectors. We then store those vectors
00:00:45.200 | in an index, and what we typically want to do is perform some kind of search or comparison of all
00:00:52.560 | the vectors within that index. So, for example, if you found this video or article through Google or
00:01:01.200 | YouTube, you will have typed something, some sort of query, into one of those search engines. So,
00:01:10.080 | maybe something like, "How do I filter in Vector Similarity Search?" Then whichever search engine
00:01:19.120 | you use most likely converted your query into a vector representation, and then it took that
00:01:28.880 | vector representation and compared it to all of the other vectors in its vector index. So,
00:01:37.120 | those could be pages, videos, and so on. It could be anything, really. And out of all of those
00:01:45.120 | index vectors, it was this video or this article which seemed to be one of the most similar vectors
00:01:53.920 | to your query vector. And so, you were served this video or this article near the top of your search,
00:02:00.560 | you clicked on it, and now here we are. In search and recommender systems,
00:02:06.400 | there's almost always a need to apply some sort of filter. On Google, we can search based on
00:02:13.120 | categories such as news or shopping. We can search by date. We can search by language or
00:02:20.160 | region. And likewise, Netflix, Amazon, Spotify might want to compare users in specific regions.
00:02:29.520 | So, restricting the search scope to relevant vectors is, in many cases, an absolute necessity.
00:02:39.200 | And despite the very clear need for filtering, there isn't a particularly good approach for
00:02:50.000 | doing so. So, let's start having a look at the different types of filters available to us.
00:02:56.160 | So, during the video, we'll be covering each one of those. We have pre-filtering, post-filtering,
00:03:02.960 | and Pinecone's new single-stage filtering. Now, what I want to just do here is have a look at
00:03:11.040 | what metadata filtering actually is. So, when we have a vector index, each vector is going to be
00:03:18.960 | assigned some sort of metadata. And that can be anything. It could be a number, a date, text.
00:03:26.960 | It can really be anything that we can use to filter our search. And what we want to do is
00:03:35.440 | search where this or some condition is true. So, for example, say we have a big corporation,
00:03:43.920 | they have all these different departments, and there are loads of internal documents within
00:03:51.200 | that corporation. Some of those documents are assigned to the engineering department, some of
00:03:56.640 | them are assigned to HR, and so on. A user in that company might want to go into their search,
00:04:04.800 | and sometimes they may want to search across all departments, but sometimes they might want to
00:04:10.720 | apply some sort of filter. So, they might want to say, "I want the top K documents where the
00:04:18.480 | department is equal to engineering," or, "I want the top K where the department is not HR."
00:04:26.000 | And we can apply anything in these metadata filters. So, we may want documents that are
00:04:35.600 | quite recent. So, what we might say is we want the top K documents where the date is greater than or
00:04:43.920 | equal to 14 days ago. And then we can sort of mix and match all of those different metadata filters
00:04:52.080 | as well. Now, to implement a metadata filter, we need two things really. We need our vector index,
00:05:02.640 | which is what you can see at the top here, and we also need our metadata index. Now, each of these
00:05:11.120 | will be paired one by one. So, each vector will have its own metadata record, and what we would
00:05:19.520 | do is we'd apply a condition to our metadata index, and that would remove a few of these.
00:05:28.480 | So, we'd get rid of these, and then based on what we have removed here,
00:05:35.120 | we would also remove those equivalent vectors from our vector index. And that's how our
00:05:42.720 | filter would get applied. But there are different orders and different ways of doing this, as we
00:05:51.120 | will take a look at now. So, the first one of those is we could have a look at a post filter.
00:06:01.280 | So, a post filter is nothing more than we take our query vector, which is over here,
00:06:09.760 | and we also take our metadata query over here. Now, we start by performing a approximate search
00:06:20.160 | between our query vector and all of our index vectors over here, and we get the top K matches
00:06:28.320 | there. So, let's say we wanted maybe 10 of those. And then what we do is we then add in our metadata
00:06:37.600 | query. So, we add that in over here, and that creates a filter for us. So, then we filter those
00:06:46.640 | remaining vectors through our new filter, and that leaves us with the filtered top K. Now,
00:06:55.680 | in this case, top K is not going to be, usually is not going to be the number we asked for. So,
00:07:00.960 | say over here, we have 10 top K matches. Here, we filter some of those out, so we might end up with
00:07:08.000 | four. And then we also have pre-filtering. Now, pre-filtering, we change the order a little bit.
00:07:14.000 | We apply our filter before we do the search. So, we have our metadata query over here.
00:07:21.840 | We use that to create a filter over here. We then apply that to our full vector index over here,
00:07:31.920 | and that leaves us with so many of our vectors. And then what we do is we search based on that,
00:07:37.760 | but because we are not searching through the full data sets, we can't do approximate
00:07:44.320 | nearest neighbor search. So, we have to do an exhaustive search at this point.
00:07:48.640 | Now, let's start with the pre-filter process. So, here, like we saw before, we start with our
00:07:55.680 | metadata index. We apply our filter to this, identifying which positions satisfy our filter
00:08:03.440 | condition, and then we use this filtered metadata index to filter out the vectors which do not
00:08:11.680 | satisfy our condition. And as we kind of saw before, this is where the issue with pre-filtering
00:08:19.840 | comes in. Because we have just filtered out some of the vectors or many of the vectors in our index,
00:08:26.480 | we no longer have the same index as what we started with. And we need the full index to
00:08:33.600 | apply our approximate search on. But as soon as we filter, we change the structure of that index. So,
00:08:41.440 | we can no longer perform an approximate nearest neighbor search, which means we're just doing
00:08:46.960 | a brute force, exhaustive k-nearest neighbor search. Now, if our index is very small or the
00:08:55.360 | number of vectors that we have output after our filter is very small, this is probably okay.
00:09:04.160 | But as soon as we start working with big data sets, this is not going to be very manageable.
00:09:11.440 | And the only other alternative that we have here is to build an index for every possible
00:09:20.320 | filter outcome, which is not really an option because it's just simply not realistic to build
00:09:28.400 | that many indexes. So, pre-filtering, we have good accuracy, but it's very slow.
00:09:36.480 | Now, post-filtering is, of course, slightly different. So, in this case, we start with our
00:09:44.560 | vector index. Now, we can perform our approximate nearest neighbor search because we have the full
00:09:50.800 | index. We haven't filtered anything yet. And that returns the top k vectors that we want. So,
00:09:57.840 | say we want 10 vectors at this point. And then what we do is we find all the vectors
00:10:05.440 | through our metadata index that satisfy whatever metadata condition we have set.
00:10:10.640 | And then we apply the filter to those top k vectors, so 10 vectors maybe. And, of course,
00:10:18.560 | at this point, what we are doing is we're reducing the number of vectors that we get out. So, we
00:10:23.520 | don't actually get 10 vectors. We, for example, could get four vectors. And in the worst case
00:10:30.320 | scenario, that filter could rule out all of the vectors that we've returned. And in the end, we
00:10:36.800 | return nothing, even when in the index there could be some relevant vectors, which is obviously not
00:10:46.480 | very ideal. We can try and eliminate this problem by just increasing k a lot. So, of course, if we
00:10:54.640 | use a low k value, the chances of all of them being excluded when we apply our filter post
00:11:02.320 | search is reasonably high. But if we increase k up to 1 million, it's much lower. But the only
00:11:11.200 | problem with that is that our search becomes very slow. And the more we increase k to eliminate the
00:11:19.200 | problem, the slower it gets. So, in this case, we have unreliable accuracy or performance.
00:11:26.160 | But it is faster, unless we increase k. So, now let's introduce single-stage
00:11:34.800 | filtering by Pinecone. Now, we are going to go through some code and test this.
00:11:40.400 | But first, I just want to introduce what it is at a high level. So, it's a new filter built by
00:11:48.320 | Pinecone. And at a high level, it works by merging the vector and metadata indexes. And it allows us
00:11:56.960 | to filter and then do an approximate nearest neighbor search. So, what we get there is the
00:12:04.960 | accuracy of pre-filtering. And at the same time, the search speed is often faster than post-filtering
00:12:12.640 | as well. So, we really do get the best of both with this new filter. But let's go and actually
00:12:20.800 | try it out. Okay. So, we're going to be using Pinecone here. And all I've done here is imported
00:12:29.120 | Pinecone, import JSON, and I've imported my data here. So, this data, I've already uploaded or
00:12:38.240 | asserted to my Pinecone client. And what it is, is just the squad data set in both English and
00:12:49.040 | Italian. Now, in there, we have a few different items. So, we have the record ID, we have the
00:13:00.560 | text, although I've just sold this locally. We have the vector, which has been asserted to Pinecone.
00:13:08.560 | And then we also have the metadata, which is with Pinecone as well. Now, if we take a look at what
00:13:13.840 | we have in the metadata, we see that we have the language. So, we have either English or Italian.
00:13:26.000 | And then we also have the topic. Now, what I want to do is just test the new filtering. So,
00:13:34.880 | we're going to be filtering based on language, topic. And we also have another metadata item here,
00:13:43.440 | which I don't have locally, which is just a randomly generated date. So, we can have a look
00:13:49.920 | at using some of the greater than or equals to, less than equals to filters that we can use in
00:13:57.440 | Pinecone. So, the first thing I'm going to do is initialize my connection to Pinecone. So,
00:14:06.240 | I write pinecone.init. I need to pass my API key, which I've loaded above,
00:14:13.520 | and also the environment that I am working in. Now, of course, this will be different and depend
00:14:24.000 | on which environment you are using. So, I've initialized there, and what I can do is I can now
00:14:34.000 | create a direct connection to a specific index within my Pinecone environment. Now,
00:14:40.320 | I'm going to be connecting to one that I've already made, which is called squad test.
00:14:47.200 | And now what I'll be able to do is use this index object to perform my queries.
00:14:54.800 | So, we're going to be performing a vector search here. So, what we need first is a query vector
00:15:03.120 | to perform our search with. Now, I use the sentence transformers library to encode the
00:15:11.920 | already indexed vectors. So, what we're going to do is use the same model to encode our query vector.
00:15:21.760 | So, write sentence transformer, and that embedding model is sentence transformer.
00:15:30.240 | And I use the stsb xlm-r-multilingual model. So, I will need to download this.
00:15:50.480 | Okay, so that is downloading now.
00:15:53.120 | And then what we want to do is create our query vector. So, I'm going to assign it to xq,
00:16:00.960 | and all I need to do is write embedder,
00:16:04.000 | adopt encode, and then I pass in the query that I would like to perform. So, in this case,
00:16:17.760 | I'm going to search for context in our dataset, which mentions something along the lines of early
00:16:27.600 | engineering courses provided by American universities in the 1870s. So, I will execute that.
00:16:35.360 | And note that we're using a multilingual model here. So, we should find that we will return
00:16:44.960 | both English and Italian results, but both of them should be something similar to this topic.
00:16:53.040 | So, we will return our results. We just write index.query,
00:16:59.600 | and what we can also do before we even do that is we'll just convert xq into the format that we need.
00:17:12.000 | So, like this. And then in here, we just pass xq, and I'll say that I want my top k value to be
00:17:20.960 | three. Now, remember, if we were using post-filtering here, if we set top k value of
00:17:30.080 | three, we would probably return less than three. So, with post-filtering, we would want to set
00:17:35.680 | something stupidly high here, just to get maybe three samples, if we're lucky.
00:17:42.000 | But, as we're using single-stage filtering, we only need to set top k equal to three.
00:17:48.720 | So, we'll execute that, and let's return the results.
00:17:54.320 | And you see here that we get our IDs. So, we get this ID, one, two, and three.
00:18:04.240 | So, what we now want to do is we want to map that back to the data that we have stored locally.
00:18:09.360 | To do that, we're going to write IDs equals iid for i in the results. So, we're just
00:18:19.600 | getting these IDs here. So, we're going results. We need to enter into the results key.
00:18:30.080 | We want to access the first position in that list, and then, in there, we want to access the matches.
00:18:38.240 | And then, from there, we'll print IDs, see what we get. Okay, we get those three
00:18:48.000 | IDs. And now, what we want to do is use the data that I imported up here.
00:18:58.240 | Just here. And we're going to use that to print out whatever it is that these IDs are referring
00:19:07.440 | to. Now, what we have in our data at the moment is a big list, which is not that useful. So,
00:19:15.120 | it would be more useful if we just reformat that into a dictionary. So, I'll do that quickly. We're
00:19:22.160 | just going to write getSample is equal to xid. And then, in here, we will store our context
00:19:34.800 | and metadata. So, context and metadata. We don't need to store the vector in here,
00:19:48.960 | because we can't read that anyway. So, it's not that useful for us. Let me say, for x in data.
00:19:58.880 | Okay, so it's not context. Let me come up here and have a look at what we have.
00:20:12.960 | I think it's text. Yeah, okay. So, let's change that to text here and here.
00:20:19.360 | Okay, so now we can do, for i in IDs, I want to get the sample. So, getSample i.
00:20:35.680 | And we'll just print that out. So, we see here that the first one we get is Italian.
00:20:46.480 | And the translation for this is something to do with the College of Engineering
00:20:53.840 | was instituted in 1920. So, we have college, engineering, that's good.
00:21:00.160 | And then we also have something along the lines of the College of Science from the 1870s. So,
00:21:07.840 | generally, this looks pretty relevant, I think. And then, down here, we have, this is Italian
00:21:16.640 | again, but we also have the English translation of this here as well. So, we can see straight away,
00:21:23.920 | School of Engineering, Public Engineering School, founded 1891, and it offered engineering degrees
00:21:31.520 | as early as 1873. So, that's, again, pretty relevant. Now, I don't understand Italian.
00:21:41.120 | And so, my first filter here would probably be, okay, I only want to return the English results.
00:21:49.440 | So, let's go ahead and do that. So, I'm going to say results equals, and let's just copy what we
00:21:56.640 | had up here. So, we're not repeating ourselves. We just want to take the index query. We can
00:22:04.240 | include result. No, we don't need it. Let's just take that. And all we need to do now is add our
00:22:15.920 | filter. So, we just write filter. And then, in here, we want to write. So, we have our metadata,
00:22:23.120 | and we have our language. And we want to say that this must be equal to, so we use EQEN,
00:22:31.360 | which is English. So, we get our results, and we're going to want to do the same thing again.
00:22:40.800 | So, we want to get our IDs, and we want to print those out.
00:22:50.080 | And there we go. So, now, we're just getting English results. Now, that was pretty fast. So,
00:22:59.200 | I think what is quite useful is to see how fast those two searches were.
00:23:06.640 | Now, obviously, we're getting pretty relevant results here, where, again, we're returning three
00:23:11.440 | results, even with our filter applied. So, that's good. So, it seems like we're getting the accuracy
00:23:19.360 | of pre-filtering here. And let's have a look at the speed difference between the two approaches.
00:23:27.840 | Now, we shouldn't see anything particularly major, because this is a very small index. We only have,
00:23:33.920 | I think, 40,000 vectors here. So, we won't see anything significant.
00:23:40.000 | But at least we can check that we're not getting anything slow.
00:23:46.080 | So, let's have a look. And you see here that we're actually getting a slightly faster response
00:24:02.640 | when we filter. And this is typical with Pinecone single-stage filtering. When we
00:24:09.760 | add a filter, usually, we'll actually get faster results, which is pretty insane. So,
00:24:16.080 | not only are we getting good speed, like post-filtering, but we're actually
00:24:21.600 | making our search faster by adding a filter, which is neither post-filtering nor pre-filtering can do
00:24:30.560 | that. And again, at the same time, we're still getting that accuracy of pre-filtering. So,
00:24:35.040 | this is, in my opinion, pretty impressive. Now, we might also want to add another filter. So,
00:24:46.480 | at the moment, we're just adding one filter, which is fine. It works. But let's say I look
00:24:54.240 | at my results. And I know this is hard to read. But in here, we have the topic. We have University
00:25:00.400 | of Kansas. OK, fine. Maybe I'm not interested in the University of Kansas. So, how about here?
00:25:06.560 | We have University of Notre Dame. Let's say I'm not even interested in these guys either.
00:25:10.000 | Institute of Technology, let's say, OK, yeah, we can keep them. That's fine.
00:25:16.240 | So, I want to say, OK, I want everything that is one, in English, and two,
00:25:22.880 | not from the University of Kansas and not from the University of Notre Dame. So, to do that,
00:25:30.880 | I need to add another condition to my filter. So, to do that, all I need to do is say topic
00:25:41.520 | is not in this time. So, we're going to say not in. And then we pass a list here. So,
00:25:48.320 | a list of what we don't want to see, which was the University of Notre Dame
00:25:57.120 | and also the University of Kansas. OK, so let's add those two. And let's see what we get.
00:26:10.480 | So, again, it seemed pretty fast. And we're getting University of Kansas here. So,
00:26:15.760 | that must mean that I have written something wrong. So, I think here, the topic filter
00:26:26.240 | in my pine cone index is actually maybe called title. Let's see. And this is also wrong. So,
00:26:35.440 | let's correct that. OK, so now we're getting something different. So, yes, this should be
00:26:45.440 | title in reality. So, Institute of Technology. Institute of Technology. And where is our other
00:26:53.440 | one? Institute of Technology here. Now, we're not returning University of Kansas and we're
00:26:58.560 | not returning University of Notre Dame, which is what we wanted. Now, there was also the
00:27:05.600 | date filter that I wanted to show you as well. So, we don't only need to filter based on strings,
00:27:12.240 | we can also filter based on numeric data times. And for me to show you this, I think it's best
00:27:20.640 | if we ... It would also be better if we include our metadata here as well. So,
00:27:29.760 | we can just see it directly from our results. So, we know that we're returning relevant text.
00:27:36.000 | So, now let's just have a look at the metadata. So, I'm going to include metadata there.
00:27:47.200 | Let's just see what we get. So, we see now that we actually include the metadata in our results,
00:27:55.280 | which is also pretty cool. Now, we have a date, which is just a numeric value here. It's just
00:28:02.960 | something very simple. It just randomly generates. There's no actual relation between the date and
00:28:08.960 | this record. It's completely random. And we can see, okay, we have a date from 2016, 2008,
00:28:18.000 | and we also have 2020 here as well. Now, the first thing that I might want to do is say,
00:28:24.960 | okay, I want to return only the more recent date. So, let's say, okay, we add, we keep all of that,
00:28:33.440 | all the other filters in there. And we might say, okay, but we also want date
00:28:41.040 | to be greater than or equal to, let's say, what do we have here? We have
00:28:54.960 | what is the most recent. So, we have this 2021. Let's say we want to go for ones that are,
00:29:01.280 | let's say, 2018 onwards for now. So, 2018, 001. Okay. So, the very first day of 2018.
00:29:13.120 | Let's search and see what we get. And we can see, yep, it's definitely filtering correctly there.
00:29:22.800 | Now, let's have a look at what the search time is for that.
00:29:25.680 | So, adding quite a few filter conditions here. So, let's just see what we get.
00:29:36.720 | We should also exclude that.
00:29:40.800 | And you see, it's actually slightly faster again, which is, again, it's pretty cool.
00:29:51.520 | But like I said, it's a small data set. When we do this on bigger data sets, the
00:29:57.760 | difference can be huge. Now, what we can also do is we can actually add another condition
00:30:05.520 | within our date here. So, we can say, okay, we want it to be greater than or equals to 2018.
00:30:12.400 | But let's say we want to search for records only in 2018. So, we might also say, okay,
00:30:19.920 | we want it to be greater than 2018, or the first day of 2018. We also want it to be less than or
00:30:26.400 | equal to the very last day in 2018. So, 2018, 12/31. And we will filter. And we see that now
00:30:39.920 | we're only returning records from 2018. So, again, super cool, and I think an incredibly
00:30:49.600 | useful functionality for vector similarity search. Now, we were just using a very small data set
00:30:59.440 | there. So, I couldn't really show you how impressive the speedup can be when we're applying
00:31:06.960 | filters. But I do have this other index. Now, I'm not going to go through just coding everything,
00:31:12.720 | because it's pretty straightforward. We have an index here, which is 1.2 million vectors,
00:31:21.680 | and it has a single metadata field in there, which I've called Tag1. And that's just a
00:31:27.840 | randomly generated number or integer from 0 to 100. So, we, of course, initialize the connection
00:31:36.560 | to our index in the first cell up here. And then over here, I'm just creating a random query vector.
00:31:47.280 | So, first, this here is our unfiltered search. So, we get this 79.2 milliseconds. Now, again,
00:31:57.840 | most of this is network latency, rather than the search time in the actual index.
00:32:05.360 | But we will see the search time decrease pretty dramatically here. So, first, we'll say, okay,
00:32:11.600 | we want Tag1 to be greater than 30. So, we're going from 0 to 100. So, we're roughly removing
00:32:19.760 | probably about 30% of the vectors from our search. And we can see, okay, we just shaved off
00:32:26.400 | 8 milliseconds, which is impressive. And then we take that even further. So, we say, okay,
00:32:33.840 | we want it greater than 70. So, now, we're shaving off around 70% of our vectors.
00:32:38.880 | And our search time goes down to 56.6 milliseconds. Do it even further. So, about 90% here,
00:32:47.840 | we go down to 54 milliseconds. And then here, I'm using the equals sign here. So,
00:32:53.440 | I'm only searching for about 1% of the index. And it goes down to 51.6 milliseconds. So,
00:33:02.720 | incredibly impressive speed up there. And this is kind of what it looks like.
00:33:08.720 | So, we have the Tag1 GT value or greater than value on the left. And as we increase that
00:33:18.080 | up this way, our time, our search time in milliseconds goes down.
00:33:26.000 | Now, it is a little bit bumpy. It goes up and down a lot. I've tried to showcase that.
00:33:33.120 | In this graph. But the trend is quite clearly downwards. So, the more we filter, the faster
00:33:42.080 | our search, which is incredible. Now, that's it for this video covering pre-filtering,
00:33:50.080 | post-filtering, and Pinecone's new single stage filtering. I hope this has been useful and
00:33:57.680 | insightful. If you are interested in testing Pinecone out yourself, there is a link to Pinecone's
00:34:05.840 | website in the description. But we'll leave it there for now. Thank you very much for watching.
00:34:11.520 | And I'll see you in the next one.