Metadata Filtering for Vector Search + Latest Filter Tech

00:00:00.000 | Hi, welcome to the video. We're going to be exploring

00:00:04.160 | two of the common methods that we can use to filter indexes in Vector Similarity Search.

00:00:12.800 | And then we're also going to explore Pinecone's new solution to filtering in Vector Search.

00:00:25.040 | Now, in Vector Similarity Search, what we do is build representations of data, so that could be

00:00:35.440 | text, images, or cooking recipes, and we convert them into vectors. We then store those vectors

00:00:45.200 | in an index, and what we typically want to do is perform some kind of search or comparison of all

00:00:52.560 | the vectors within that index. So, for example, if you found this video or article through Google or

00:01:01.200 | YouTube, you will have typed something, some sort of query, into one of those search engines. So,

00:01:10.080 | maybe something like, "How do I filter in Vector Similarity Search?" Then whichever search engine

00:01:19.120 | you use most likely converted your query into a vector representation, and then it took that

00:01:28.880 | vector representation and compared it to all of the other vectors in its vector index. So,

00:01:37.120 | those could be pages, videos, and so on. It could be anything, really. And out of all of those

00:01:45.120 | index vectors, it was this video or this article which seemed to be one of the most similar vectors

00:01:53.920 | to your query vector. And so, you were served this video or this article near the top of your search,

00:02:00.560 | you clicked on it, and now here we are. In search and recommender systems,

00:02:06.400 | there's almost always a need to apply some sort of filter. On Google, we can search based on

00:02:13.120 | categories such as news or shopping. We can search by date. We can search by language or

00:02:20.160 | region. And likewise, Netflix, Amazon, Spotify might want to compare users in specific regions.

00:02:29.520 | So, restricting the search scope to relevant vectors is, in many cases, an absolute necessity.

00:02:39.200 | And despite the very clear need for filtering, there isn't a particularly good approach for

00:02:50.000 | doing so. So, let's start having a look at the different types of filters available to us.

00:02:56.160 | So, during the video, we'll be covering each one of those. We have pre-filtering, post-filtering,

00:03:02.960 | and Pinecone's new single-stage filtering. Now, what I want to just do here is have a look at

00:03:11.040 | what metadata filtering actually is. So, when we have a vector index, each vector is going to be

00:03:18.960 | assigned some sort of metadata. And that can be anything. It could be a number, a date, text.

00:03:26.960 | It can really be anything that we can use to filter our search. And what we want to do is

00:03:35.440 | search where this or some condition is true. So, for example, say we have a big corporation,

00:03:43.920 | they have all these different departments, and there are loads of internal documents within

00:03:51.200 | that corporation. Some of those documents are assigned to the engineering department, some of

00:03:56.640 | them are assigned to HR, and so on. A user in that company might want to go into their search,

00:04:04.800 | and sometimes they may want to search across all departments, but sometimes they might want to

00:04:10.720 | apply some sort of filter. So, they might want to say, "I want the top K documents where the

00:04:18.480 | department is equal to engineering," or, "I want the top K where the department is not HR."

00:04:26.000 | And we can apply anything in these metadata filters. So, we may want documents that are

00:04:35.600 | quite recent. So, what we might say is we want the top K documents where the date is greater than or

00:04:43.920 | equal to 14 days ago. And then we can sort of mix and match all of those different metadata filters

00:04:52.080 | as well. Now, to implement a metadata filter, we need two things really. We need our vector index,

00:05:02.640 | which is what you can see at the top here, and we also need our metadata index. Now, each of these

00:05:11.120 | will be paired one by one. So, each vector will have its own metadata record, and what we would

00:05:19.520 | do is we'd apply a condition to our metadata index, and that would remove a few of these.

00:05:28.480 | So, we'd get rid of these, and then based on what we have removed here,

00:05:35.120 | we would also remove those equivalent vectors from our vector index. And that's how our

00:05:42.720 | filter would get applied. But there are different orders and different ways of doing this, as we

00:05:51.120 | will take a look at now. So, the first one of those is we could have a look at a post filter.

00:06:01.280 | So, a post filter is nothing more than we take our query vector, which is over here,

00:06:09.760 | and we also take our metadata query over here. Now, we start by performing a approximate search

00:06:20.160 | between our query vector and all of our index vectors over here, and we get the top K matches

00:06:28.320 | there. So, let's say we wanted maybe 10 of those. And then what we do is we then add in our metadata

00:06:37.600 | query. So, we add that in over here, and that creates a filter for us. So, then we filter those

00:06:46.640 | remaining vectors through our new filter, and that leaves us with the filtered top K. Now,

00:06:55.680 | in this case, top K is not going to be, usually is not going to be the number we asked for. So,

00:07:00.960 | say over here, we have 10 top K matches. Here, we filter some of those out, so we might end up with

00:07:08.000 | four. And then we also have pre-filtering. Now, pre-filtering, we change the order a little bit.

00:07:14.000 | We apply our filter before we do the search. So, we have our metadata query over here.

00:07:21.840 | We use that to create a filter over here. We then apply that to our full vector index over here,

00:07:31.920 | and that leaves us with so many of our vectors. And then what we do is we search based on that,

00:07:37.760 | but because we are not searching through the full data sets, we can't do approximate

00:07:44.320 | nearest neighbor search. So, we have to do an exhaustive search at this point.

00:07:48.640 | Now, let's start with the pre-filter process. So, here, like we saw before, we start with our

00:07:55.680 | metadata index. We apply our filter to this, identifying which positions satisfy our filter

00:08:03.440 | condition, and then we use this filtered metadata index to filter out the vectors which do not

00:08:11.680 | satisfy our condition. And as we kind of saw before, this is where the issue with pre-filtering

00:08:19.840 | comes in. Because we have just filtered out some of the vectors or many of the vectors in our index,

00:08:26.480 | we no longer have the same index as what we started with. And we need the full index to

00:08:33.600 | apply our approximate search on. But as soon as we filter, we change the structure of that index. So,

00:08:41.440 | we can no longer perform an approximate nearest neighbor search, which means we're just doing

00:08:46.960 | a brute force, exhaustive k-nearest neighbor search. Now, if our index is very small or the

00:08:55.360 | number of vectors that we have output after our filter is very small, this is probably okay.

00:09:04.160 | But as soon as we start working with big data sets, this is not going to be very manageable.

00:09:11.440 | And the only other alternative that we have here is to build an index for every possible

00:09:20.320 | filter outcome, which is not really an option because it's just simply not realistic to build

00:09:28.400 | that many indexes. So, pre-filtering, we have good accuracy, but it's very slow.

00:09:36.480 | Now, post-filtering is, of course, slightly different. So, in this case, we start with our

00:09:44.560 | vector index. Now, we can perform our approximate nearest neighbor search because we have the full

00:09:50.800 | index. We haven't filtered anything yet. And that returns the top k vectors that we want. So,

00:09:57.840 | say we want 10 vectors at this point. And then what we do is we find all the vectors

00:10:05.440 | through our metadata index that satisfy whatever metadata condition we have set.

00:10:10.640 | And then we apply the filter to those top k vectors, so 10 vectors maybe. And, of course,

00:10:18.560 | at this point, what we are doing is we're reducing the number of vectors that we get out. So, we

00:10:23.520 | don't actually get 10 vectors. We, for example, could get four vectors. And in the worst case

00:10:30.320 | scenario, that filter could rule out all of the vectors that we've returned. And in the end, we

00:10:36.800 | return nothing, even when in the index there could be some relevant vectors, which is obviously not

00:10:46.480 | very ideal. We can try and eliminate this problem by just increasing k a lot. So, of course, if we

00:10:54.640 | use a low k value, the chances of all of them being excluded when we apply our filter post

00:11:02.320 | search is reasonably high. But if we increase k up to 1 million, it's much lower. But the only

00:11:11.200 | problem with that is that our search becomes very slow. And the more we increase k to eliminate the

00:11:19.200 | problem, the slower it gets. So, in this case, we have unreliable accuracy or performance.

00:11:26.160 | But it is faster, unless we increase k. So, now let's introduce single-stage

00:11:34.800 | filtering by Pinecone. Now, we are going to go through some code and test this.

00:11:40.400 | But first, I just want to introduce what it is at a high level. So, it's a new filter built by

00:11:48.320 | Pinecone. And at a high level, it works by merging the vector and metadata indexes. And it allows us

00:11:56.960 | to filter and then do an approximate nearest neighbor search. So, what we get there is the

00:12:04.960 | accuracy of pre-filtering. And at the same time, the search speed is often faster than post-filtering

00:12:12.640 | as well. So, we really do get the best of both with this new filter. But let's go and actually

00:12:20.800 | try it out. Okay. So, we're going to be using Pinecone here. And all I've done here is imported

00:12:29.120 | Pinecone, import JSON, and I've imported my data here. So, this data, I've already uploaded or

00:12:38.240 | asserted to my Pinecone client. And what it is, is just the squad data set in both English and

00:12:49.040 | Italian. Now, in there, we have a few different items. So, we have the record ID, we have the

00:13:00.560 | text, although I've just sold this locally. We have the vector, which has been asserted to Pinecone.

00:13:08.560 | And then we also have the metadata, which is with Pinecone as well. Now, if we take a look at what

00:13:13.840 | we have in the metadata, we see that we have the language. So, we have either English or Italian.

00:13:26.000 | And then we also have the topic. Now, what I want to do is just test the new filtering. So,

00:13:34.880 | we're going to be filtering based on language, topic. And we also have another metadata item here,

00:13:43.440 | which I don't have locally, which is just a randomly generated date. So, we can have a look

00:13:49.920 | at using some of the greater than or equals to, less than equals to filters that we can use in

00:13:57.440 | Pinecone. So, the first thing I'm going to do is initialize my connection to Pinecone. So,

00:14:06.240 | I write pinecone.init. I need to pass my API key, which I've loaded above,

00:14:13.520 | and also the environment that I am working in. Now, of course, this will be different and depend

00:14:24.000 | on which environment you are using. So, I've initialized there, and what I can do is I can now

00:14:34.000 | create a direct connection to a specific index within my Pinecone environment. Now,

00:14:40.320 | I'm going to be connecting to one that I've already made, which is called squad test.

00:14:47.200 | And now what I'll be able to do is use this index object to perform my queries.

00:14:54.800 | So, we're going to be performing a vector search here. So, what we need first is a query vector

00:15:03.120 | to perform our search with. Now, I use the sentence transformers library to encode the

00:15:11.920 | already indexed vectors. So, what we're going to do is use the same model to encode our query vector.

00:15:21.760 | So, write sentence transformer, and that embedding model is sentence transformer.

00:15:30.240 | And I use the stsb xlm-r-multilingual model. So, I will need to download this.

00:15:50.480 | Okay, so that is downloading now.

00:15:53.120 | And then what we want to do is create our query vector. So, I'm going to assign it to xq,

00:16:00.960 | and all I need to do is write embedder,

00:16:04.000 | adopt encode, and then I pass in the query that I would like to perform. So, in this case,

00:16:17.760 | I'm going to search for context in our dataset, which mentions something along the lines of early

00:16:27.600 | engineering courses provided by American universities in the 1870s. So, I will execute that.

00:16:35.360 | And note that we're using a multilingual model here. So, we should find that we will return

00:16:44.960 | both English and Italian results, but both of them should be something similar to this topic.

00:16:53.040 | So, we will return our results. We just write index.query,

00:16:59.600 | and what we can also do before we even do that is we'll just convert xq into the format that we need.

00:17:12.000 | So, like this. And then in here, we just pass xq, and I'll say that I want my top k value to be

00:17:20.960 | three. Now, remember, if we were using post-filtering here, if we set top k value of

00:17:30.080 | three, we would probably return less than three. So, with post-filtering, we would want to set

00:17:35.680 | something stupidly high here, just to get maybe three samples, if we're lucky.

00:17:42.000 | But, as we're using single-stage filtering, we only need to set top k equal to three.

00:17:48.720 | So, we'll execute that, and let's return the results.

00:17:54.320 | And you see here that we get our IDs. So, we get this ID, one, two, and three.

00:18:04.240 | So, what we now want to do is we want to map that back to the data that we have stored locally.

00:18:09.360 | To do that, we're going to write IDs equals iid for i in the results. So, we're just

00:18:19.600 | getting these IDs here. So, we're going results. We need to enter into the results key.

00:18:30.080 | We want to access the first position in that list, and then, in there, we want to access the matches.

00:18:38.240 | And then, from there, we'll print IDs, see what we get. Okay, we get those three

00:18:48.000 | IDs. And now, what we want to do is use the data that I imported up here.

00:18:58.240 | Just here. And we're going to use that to print out whatever it is that these IDs are referring

00:19:07.440 | to. Now, what we have in our data at the moment is a big list, which is not that useful. So,

00:19:15.120 | it would be more useful if we just reformat that into a dictionary. So, I'll do that quickly. We're

00:19:22.160 | just going to write getSample is equal to xid. And then, in here, we will store our context

00:19:34.800 | and metadata. So, context and metadata. We don't need to store the vector in here,

00:19:48.960 | because we can't read that anyway. So, it's not that useful for us. Let me say, for x in data.

00:19:58.880 | Okay, so it's not context. Let me come up here and have a look at what we have.

00:20:12.960 | I think it's text. Yeah, okay. So, let's change that to text here and here.

00:20:19.360 | Okay, so now we can do, for i in IDs, I want to get the sample. So, getSample i.

00:20:35.680 | And we'll just print that out. So, we see here that the first one we get is Italian.

00:20:46.480 | And the translation for this is something to do with the College of Engineering

00:20:53.840 | was instituted in 1920. So, we have college, engineering, that's good.

00:21:00.160 | And then we also have something along the lines of the College of Science from the 1870s. So,

00:21:07.840 | generally, this looks pretty relevant, I think. And then, down here, we have, this is Italian

00:21:16.640 | again, but we also have the English translation of this here as well. So, we can see straight away,

00:21:23.920 | School of Engineering, Public Engineering School, founded 1891, and it offered engineering degrees

00:21:31.520 | as early as 1873. So, that's, again, pretty relevant. Now, I don't understand Italian.

00:21:41.120 | And so, my first filter here would probably be, okay, I only want to return the English results.

00:21:49.440 | So, let's go ahead and do that. So, I'm going to say results equals, and let's just copy what we

00:21:56.640 | had up here. So, we're not repeating ourselves. We just want to take the index query. We can

00:22:04.240 | include result. No, we don't need it. Let's just take that. And all we need to do now is add our

00:22:15.920 | filter. So, we just write filter. And then, in here, we want to write. So, we have our metadata,

00:22:23.120 | and we have our language. And we want to say that this must be equal to, so we use EQEN,

00:22:31.360 | which is English. So, we get our results, and we're going to want to do the same thing again.

00:22:40.800 | So, we want to get our IDs, and we want to print those out.

00:22:50.080 | And there we go. So, now, we're just getting English results. Now, that was pretty fast. So,

00:22:59.200 | I think what is quite useful is to see how fast those two searches were.

00:23:06.640 | Now, obviously, we're getting pretty relevant results here, where, again, we're returning three

00:23:11.440 | results, even with our filter applied. So, that's good. So, it seems like we're getting the accuracy

00:23:19.360 | of pre-filtering here. And let's have a look at the speed difference between the two approaches.

00:23:27.840 | Now, we shouldn't see anything particularly major, because this is a very small index. We only have,

00:23:33.920 | I think, 40,000 vectors here. So, we won't see anything significant.

00:23:40.000 | But at least we can check that we're not getting anything slow.

00:23:46.080 | So, let's have a look. And you see here that we're actually getting a slightly faster response

00:24:02.640 | when we filter. And this is typical with Pinecone single-stage filtering. When we

00:24:09.760 | add a filter, usually, we'll actually get faster results, which is pretty insane. So,

00:24:16.080 | not only are we getting good speed, like post-filtering, but we're actually

00:24:21.600 | making our search faster by adding a filter, which is neither post-filtering nor pre-filtering can do

00:24:30.560 | that. And again, at the same time, we're still getting that accuracy of pre-filtering. So,

00:24:35.040 | this is, in my opinion, pretty impressive. Now, we might also want to add another filter. So,

00:24:46.480 | at the moment, we're just adding one filter, which is fine. It works. But let's say I look

00:24:54.240 | at my results. And I know this is hard to read. But in here, we have the topic. We have University

00:25:00.400 | of Kansas. OK, fine. Maybe I'm not interested in the University of Kansas. So, how about here?

00:25:06.560 | We have University of Notre Dame. Let's say I'm not even interested in these guys either.

00:25:10.000 | Institute of Technology, let's say, OK, yeah, we can keep them. That's fine.

00:25:16.240 | So, I want to say, OK, I want everything that is one, in English, and two,

00:25:22.880 | not from the University of Kansas and not from the University of Notre Dame. So, to do that,

00:25:30.880 | I need to add another condition to my filter. So, to do that, all I need to do is say topic

00:25:41.520 | is not in this time. So, we're going to say not in. And then we pass a list here. So,

00:25:48.320 | a list of what we don't want to see, which was the University of Notre Dame

00:25:57.120 | and also the University of Kansas. OK, so let's add those two. And let's see what we get.

00:26:10.480 | So, again, it seemed pretty fast. And we're getting University of Kansas here. So,

00:26:15.760 | that must mean that I have written something wrong. So, I think here, the topic filter

00:26:26.240 | in my pine cone index is actually maybe called title. Let's see. And this is also wrong. So,

00:26:35.440 | let's correct that. OK, so now we're getting something different. So, yes, this should be

00:26:45.440 | title in reality. So, Institute of Technology. Institute of Technology. And where is our other

00:26:53.440 | one? Institute of Technology here. Now, we're not returning University of Kansas and we're

00:26:58.560 | not returning University of Notre Dame, which is what we wanted. Now, there was also the

00:27:05.600 | date filter that I wanted to show you as well. So, we don't only need to filter based on strings,

00:27:12.240 | we can also filter based on numeric data times. And for me to show you this, I think it's best

00:27:20.640 | if we ... It would also be better if we include our metadata here as well. So,

00:27:29.760 | we can just see it directly from our results. So, we know that we're returning relevant text.

00:27:36.000 | So, now let's just have a look at the metadata. So, I'm going to include metadata there.

00:27:47.200 | Let's just see what we get. So, we see now that we actually include the metadata in our results,

00:27:55.280 | which is also pretty cool. Now, we have a date, which is just a numeric value here. It's just

00:28:02.960 | something very simple. It just randomly generates. There's no actual relation between the date and

00:28:08.960 | this record. It's completely random. And we can see, okay, we have a date from 2016, 2008,

00:28:18.000 | and we also have 2020 here as well. Now, the first thing that I might want to do is say,

00:28:24.960 | okay, I want to return only the more recent date. So, let's say, okay, we add, we keep all of that,

00:28:33.440 | all the other filters in there. And we might say, okay, but we also want date

00:28:41.040 | to be greater than or equal to, let's say, what do we have here? We have

00:28:54.960 | what is the most recent. So, we have this 2021. Let's say we want to go for ones that are,

00:29:01.280 | let's say, 2018 onwards for now. So, 2018, 001. Okay. So, the very first day of 2018.

00:29:13.120 | Let's search and see what we get. And we can see, yep, it's definitely filtering correctly there.

00:29:22.800 | Now, let's have a look at what the search time is for that.

00:29:25.680 | So, adding quite a few filter conditions here. So, let's just see what we get.

00:29:36.720 | We should also exclude that.

00:29:40.800 | And you see, it's actually slightly faster again, which is, again, it's pretty cool.

00:29:51.520 | But like I said, it's a small data set. When we do this on bigger data sets, the

00:29:57.760 | difference can be huge. Now, what we can also do is we can actually add another condition

00:30:05.520 | within our date here. So, we can say, okay, we want it to be greater than or equals to 2018.

00:30:12.400 | But let's say we want to search for records only in 2018. So, we might also say, okay,

00:30:19.920 | we want it to be greater than 2018, or the first day of 2018. We also want it to be less than or

00:30:26.400 | equal to the very last day in 2018. So, 2018, 12/31. And we will filter. And we see that now

00:30:39.920 | we're only returning records from 2018. So, again, super cool, and I think an incredibly

00:30:49.600 | useful functionality for vector similarity search. Now, we were just using a very small data set

00:30:59.440 | there. So, I couldn't really show you how impressive the speedup can be when we're applying

00:31:06.960 | filters. But I do have this other index. Now, I'm not going to go through just coding everything,

00:31:12.720 | because it's pretty straightforward. We have an index here, which is 1.2 million vectors,

00:31:21.680 | and it has a single metadata field in there, which I've called Tag1. And that's just a

00:31:27.840 | randomly generated number or integer from 0 to 100. So, we, of course, initialize the connection

00:31:36.560 | to our index in the first cell up here. And then over here, I'm just creating a random query vector.

00:31:47.280 | So, first, this here is our unfiltered search. So, we get this 79.2 milliseconds. Now, again,

00:31:57.840 | most of this is network latency, rather than the search time in the actual index.

00:32:05.360 | But we will see the search time decrease pretty dramatically here. So, first, we'll say, okay,

00:32:11.600 | we want Tag1 to be greater than 30. So, we're going from 0 to 100. So, we're roughly removing

00:32:19.760 | probably about 30% of the vectors from our search. And we can see, okay, we just shaved off

00:32:26.400 | 8 milliseconds, which is impressive. And then we take that even further. So, we say, okay,

00:32:33.840 | we want it greater than 70. So, now, we're shaving off around 70% of our vectors.

00:32:38.880 | And our search time goes down to 56.6 milliseconds. Do it even further. So, about 90% here,

00:32:47.840 | we go down to 54 milliseconds. And then here, I'm using the equals sign here. So,

00:32:53.440 | I'm only searching for about 1% of the index. And it goes down to 51.6 milliseconds. So,

00:33:02.720 | incredibly impressive speed up there. And this is kind of what it looks like.

00:33:08.720 | So, we have the Tag1 GT value or greater than value on the left. And as we increase that

00:33:18.080 | up this way, our time, our search time in milliseconds goes down.

00:33:26.000 | Now, it is a little bit bumpy. It goes up and down a lot. I've tried to showcase that.

00:33:33.120 | In this graph. But the trend is quite clearly downwards. So, the more we filter, the faster

00:33:42.080 | our search, which is incredible. Now, that's it for this video covering pre-filtering,

00:33:50.080 | post-filtering, and Pinecone's new single stage filtering. I hope this has been useful and

00:33:57.680 | insightful. If you are interested in testing Pinecone out yourself, there is a link to Pinecone's

00:34:05.840 | website in the description. But we'll leave it there for now. Thank you very much for watching.

00:34:11.520 | And I'll see you in the next one.

Metadata Filtering for Vector Search + Latest Filter Tech

Chapters