Metadata Filtering for Vector Search + Latest Filter Tech

Hi, welcome to the video. We're going to be exploring two of the common methods that we can use to filter indexes in Vector Similarity Search. And then we're also going to explore Pinecone's new solution to filtering in Vector Search. Now, in Vector Similarity Search, what we do is build representations of data, so that could be text, images, or cooking recipes, and we convert them into vectors.

We then store those vectors in an index, and what we typically want to do is perform some kind of search or comparison of all the vectors within that index. So, for example, if you found this video or article through Google or YouTube, you will have typed something, some sort of query, into one of those search engines.

So, maybe something like, "How do I filter in Vector Similarity Search?" Then whichever search engine you use most likely converted your query into a vector representation, and then it took that vector representation and compared it to all of the other vectors in its vector index. So, those could be pages, videos, and so on.

It could be anything, really. And out of all of those index vectors, it was this video or this article which seemed to be one of the most similar vectors to your query vector. And so, you were served this video or this article near the top of your search, you clicked on it, and now here we are.

In search and recommender systems, there's almost always a need to apply some sort of filter. On Google, we can search based on categories such as news or shopping. We can search by date. We can search by language or region. And likewise, Netflix, Amazon, Spotify might want to compare users in specific regions.

So, restricting the search scope to relevant vectors is, in many cases, an absolute necessity. And despite the very clear need for filtering, there isn't a particularly good approach for doing so. So, let's start having a look at the different types of filters available to us. So, during the video, we'll be covering each one of those.

We have pre-filtering, post-filtering, and Pinecone's new single-stage filtering. Now, what I want to just do here is have a look at what metadata filtering actually is. So, when we have a vector index, each vector is going to be assigned some sort of metadata. And that can be anything. It could be a number, a date, text.

It can really be anything that we can use to filter our search. And what we want to do is search where this or some condition is true. So, for example, say we have a big corporation, they have all these different departments, and there are loads of internal documents within that corporation.

Some of those documents are assigned to the engineering department, some of them are assigned to HR, and so on. A user in that company might want to go into their search, and sometimes they may want to search across all departments, but sometimes they might want to apply some sort of filter.

So, they might want to say, "I want the top K documents where the department is equal to engineering," or, "I want the top K where the department is not HR." And we can apply anything in these metadata filters. So, we may want documents that are quite recent. So, what we might say is we want the top K documents where the date is greater than or equal to 14 days ago.

And then we can sort of mix and match all of those different metadata filters as well. Now, to implement a metadata filter, we need two things really. We need our vector index, which is what you can see at the top here, and we also need our metadata index. Now, each of these will be paired one by one.

So, each vector will have its own metadata record, and what we would do is we'd apply a condition to our metadata index, and that would remove a few of these. So, we'd get rid of these, and then based on what we have removed here, we would also remove those equivalent vectors from our vector index.

And that's how our filter would get applied. But there are different orders and different ways of doing this, as we will take a look at now. So, the first one of those is we could have a look at a post filter. So, a post filter is nothing more than we take our query vector, which is over here, and we also take our metadata query over here.

Now, we start by performing a approximate search between our query vector and all of our index vectors over here, and we get the top K matches there. So, let's say we wanted maybe 10 of those. And then what we do is we then add in our metadata query. So, we add that in over here, and that creates a filter for us.

So, then we filter those remaining vectors through our new filter, and that leaves us with the filtered top K. Now, in this case, top K is not going to be, usually is not going to be the number we asked for. So, say over here, we have 10 top K matches.

Here, we filter some of those out, so we might end up with four. And then we also have pre-filtering. Now, pre-filtering, we change the order a little bit. We apply our filter before we do the search. So, we have our metadata query over here. We use that to create a filter over here.

We then apply that to our full vector index over here, and that leaves us with so many of our vectors. And then what we do is we search based on that, but because we are not searching through the full data sets, we can't do approximate nearest neighbor search. So, we have to do an exhaustive search at this point.

Now, let's start with the pre-filter process. So, here, like we saw before, we start with our metadata index. We apply our filter to this, identifying which positions satisfy our filter condition, and then we use this filtered metadata index to filter out the vectors which do not satisfy our condition.

And as we kind of saw before, this is where the issue with pre-filtering comes in. Because we have just filtered out some of the vectors or many of the vectors in our index, we no longer have the same index as what we started with. And we need the full index to apply our approximate search on.

But as soon as we filter, we change the structure of that index. So, we can no longer perform an approximate nearest neighbor search, which means we're just doing a brute force, exhaustive k-nearest neighbor search. Now, if our index is very small or the number of vectors that we have output after our filter is very small, this is probably okay.

But as soon as we start working with big data sets, this is not going to be very manageable. And the only other alternative that we have here is to build an index for every possible filter outcome, which is not really an option because it's just simply not realistic to build that many indexes.

So, pre-filtering, we have good accuracy, but it's very slow. Now, post-filtering is, of course, slightly different. So, in this case, we start with our vector index. Now, we can perform our approximate nearest neighbor search because we have the full index. We haven't filtered anything yet. And that returns the top k vectors that we want.

So, say we want 10 vectors at this point. And then what we do is we find all the vectors through our metadata index that satisfy whatever metadata condition we have set. And then we apply the filter to those top k vectors, so 10 vectors maybe. And, of course, at this point, what we are doing is we're reducing the number of vectors that we get out.

So, we don't actually get 10 vectors. We, for example, could get four vectors. And in the worst case scenario, that filter could rule out all of the vectors that we've returned. And in the end, we return nothing, even when in the index there could be some relevant vectors, which is obviously not very ideal.

We can try and eliminate this problem by just increasing k a lot. So, of course, if we use a low k value, the chances of all of them being excluded when we apply our filter post search is reasonably high. But if we increase k up to 1 million, it's much lower.

But the only problem with that is that our search becomes very slow. And the more we increase k to eliminate the problem, the slower it gets. So, in this case, we have unreliable accuracy or performance. But it is faster, unless we increase k. So, now let's introduce single-stage filtering by Pinecone.

Now, we are going to go through some code and test this. But first, I just want to introduce what it is at a high level. So, it's a new filter built by Pinecone. And at a high level, it works by merging the vector and metadata indexes. And it allows us to filter and then do an approximate nearest neighbor search.

So, what we get there is the accuracy of pre-filtering. And at the same time, the search speed is often faster than post-filtering as well. So, we really do get the best of both with this new filter. But let's go and actually try it out. Okay. So, we're going to be using Pinecone here.

And all I've done here is imported Pinecone, import JSON, and I've imported my data here. So, this data, I've already uploaded or asserted to my Pinecone client. And what it is, is just the squad data set in both English and Italian. Now, in there, we have a few different items.

So, we have the record ID, we have the text, although I've just sold this locally. We have the vector, which has been asserted to Pinecone. And then we also have the metadata, which is with Pinecone as well. Now, if we take a look at what we have in the metadata, we see that we have the language.

So, we have either English or Italian. And then we also have the topic. Now, what I want to do is just test the new filtering. So, we're going to be filtering based on language, topic. And we also have another metadata item here, which I don't have locally, which is just a randomly generated date.

So, we can have a look at using some of the greater than or equals to, less than equals to filters that we can use in Pinecone. So, the first thing I'm going to do is initialize my connection to Pinecone. So, I write pinecone.init. I need to pass my API key, which I've loaded above, and also the environment that I am working in.

Now, of course, this will be different and depend on which environment you are using. So, I've initialized there, and what I can do is I can now create a direct connection to a specific index within my Pinecone environment. Now, I'm going to be connecting to one that I've already made, which is called squad test.

And now what I'll be able to do is use this index object to perform my queries. So, we're going to be performing a vector search here. So, what we need first is a query vector to perform our search with. Now, I use the sentence transformers library to encode the already indexed vectors.

So, what we're going to do is use the same model to encode our query vector. So, write sentence transformer, and that embedding model is sentence transformer. And I use the stsb xlm-r-multilingual model. So, I will need to download this. Okay, so that is downloading now. And then what we want to do is create our query vector.

So, I'm going to assign it to xq, and all I need to do is write embedder, adopt encode, and then I pass in the query that I would like to perform. So, in this case, I'm going to search for context in our dataset, which mentions something along the lines of early engineering courses provided by American universities in the 1870s.

So, I will execute that. And note that we're using a multilingual model here. So, we should find that we will return both English and Italian results, but both of them should be something similar to this topic. So, we will return our results. We just write index.query, and what we can also do before we even do that is we'll just convert xq into the format that we need.

So, like this. And then in here, we just pass xq, and I'll say that I want my top k value to be three. Now, remember, if we were using post-filtering here, if we set top k value of three, we would probably return less than three. So, with post-filtering, we would want to set something stupidly high here, just to get maybe three samples, if we're lucky.

But, as we're using single-stage filtering, we only need to set top k equal to three. So, we'll execute that, and let's return the results. And you see here that we get our IDs. So, we get this ID, one, two, and three. So, what we now want to do is we want to map that back to the data that we have stored locally.

To do that, we're going to write IDs equals iid for i in the results. So, we're just getting these IDs here. So, we're going results. We need to enter into the results key. We want to access the first position in that list, and then, in there, we want to access the matches.

And then, from there, we'll print IDs, see what we get. Okay, we get those three IDs. And now, what we want to do is use the data that I imported up here. Just here. And we're going to use that to print out whatever it is that these IDs are referring to.

Now, what we have in our data at the moment is a big list, which is not that useful. So, it would be more useful if we just reformat that into a dictionary. So, I'll do that quickly. We're just going to write getSample is equal to xid. And then, in here, we will store our context and metadata.

So, context and metadata. We don't need to store the vector in here, because we can't read that anyway. So, it's not that useful for us. Let me say, for x in data. Okay, so it's not context. Let me come up here and have a look at what we have.

I think it's text. Yeah, okay. So, let's change that to text here and here. Okay, so now we can do, for i in IDs, I want to get the sample. So, getSample i. And we'll just print that out. So, we see here that the first one we get is Italian.

And the translation for this is something to do with the College of Engineering was instituted in 1920. So, we have college, engineering, that's good. And then we also have something along the lines of the College of Science from the 1870s. So, generally, this looks pretty relevant, I think. And then, down here, we have, this is Italian again, but we also have the English translation of this here as well.

So, we can see straight away, School of Engineering, Public Engineering School, founded 1891, and it offered engineering degrees as early as 1873. So, that's, again, pretty relevant. Now, I don't understand Italian. And so, my first filter here would probably be, okay, I only want to return the English results.

So, let's go ahead and do that. So, I'm going to say results equals, and let's just copy what we had up here. So, we're not repeating ourselves. We just want to take the index query. We can include result. No, we don't need it. Let's just take that. And all we need to do now is add our filter.

So, we just write filter. And then, in here, we want to write. So, we have our metadata, and we have our language. And we want to say that this must be equal to, so we use EQEN, which is English. So, we get our results, and we're going to want to do the same thing again.

So, we want to get our IDs, and we want to print those out. And there we go. So, now, we're just getting English results. Now, that was pretty fast. So, I think what is quite useful is to see how fast those two searches were. Now, obviously, we're getting pretty relevant results here, where, again, we're returning three results, even with our filter applied.

So, that's good. So, it seems like we're getting the accuracy of pre-filtering here. And let's have a look at the speed difference between the two approaches. Now, we shouldn't see anything particularly major, because this is a very small index. We only have, I think, 40,000 vectors here. So, we won't see anything significant.

But at least we can check that we're not getting anything slow. So, let's have a look. And you see here that we're actually getting a slightly faster response when we filter. And this is typical with Pinecone single-stage filtering. When we add a filter, usually, we'll actually get faster results, which is pretty insane.

So, not only are we getting good speed, like post-filtering, but we're actually making our search faster by adding a filter, which is neither post-filtering nor pre-filtering can do that. And again, at the same time, we're still getting that accuracy of pre-filtering. So, this is, in my opinion, pretty impressive.

Now, we might also want to add another filter. So, at the moment, we're just adding one filter, which is fine. It works. But let's say I look at my results. And I know this is hard to read. But in here, we have the topic. We have University of Kansas.

OK, fine. Maybe I'm not interested in the University of Kansas. So, how about here? We have University of Notre Dame. Let's say I'm not even interested in these guys either. Institute of Technology, let's say, OK, yeah, we can keep them. That's fine. So, I want to say, OK, I want everything that is one, in English, and two, not from the University of Kansas and not from the University of Notre Dame.

So, to do that, I need to add another condition to my filter. So, to do that, all I need to do is say topic is not in this time. So, we're going to say not in. And then we pass a list here. So, a list of what we don't want to see, which was the University of Notre Dame and also the University of Kansas.

OK, so let's add those two. And let's see what we get. So, again, it seemed pretty fast. And we're getting University of Kansas here. So, that must mean that I have written something wrong. So, I think here, the topic filter in my pine cone index is actually maybe called title.

Let's see. And this is also wrong. So, let's correct that. OK, so now we're getting something different. So, yes, this should be title in reality. So, Institute of Technology. Institute of Technology. And where is our other one? Institute of Technology here. Now, we're not returning University of Kansas and we're not returning University of Notre Dame, which is what we wanted.

Now, there was also the date filter that I wanted to show you as well. So, we don't only need to filter based on strings, we can also filter based on numeric data times. And for me to show you this, I think it's best if we ... It would also be better if we include our metadata here as well.

So, we can just see it directly from our results. So, we know that we're returning relevant text. So, now let's just have a look at the metadata. So, I'm going to include metadata there. Let's just see what we get. So, we see now that we actually include the metadata in our results, which is also pretty cool.

Now, we have a date, which is just a numeric value here. It's just something very simple. It just randomly generates. There's no actual relation between the date and this record. It's completely random. And we can see, okay, we have a date from 2016, 2008, and we also have 2020 here as well.

Now, the first thing that I might want to do is say, okay, I want to return only the more recent date. So, let's say, okay, we add, we keep all of that, all the other filters in there. And we might say, okay, but we also want date to be greater than or equal to, let's say, what do we have here?

We have what is the most recent. So, we have this 2021. Let's say we want to go for ones that are, let's say, 2018 onwards for now. So, 2018, 001. Okay. So, the very first day of 2018. Let's search and see what we get. And we can see, yep, it's definitely filtering correctly there.

Now, let's have a look at what the search time is for that. So, adding quite a few filter conditions here. So, let's just see what we get. We should also exclude that. And you see, it's actually slightly faster again, which is, again, it's pretty cool. But like I said, it's a small data set.

When we do this on bigger data sets, the difference can be huge. Now, what we can also do is we can actually add another condition within our date here. So, we can say, okay, we want it to be greater than or equals to 2018. But let's say we want to search for records only in 2018.

So, we might also say, okay, we want it to be greater than 2018, or the first day of 2018. We also want it to be less than or equal to the very last day in 2018. So, 2018, 12/31. And we will filter. And we see that now we're only returning records from 2018.

So, again, super cool, and I think an incredibly useful functionality for vector similarity search. Now, we were just using a very small data set there. So, I couldn't really show you how impressive the speedup can be when we're applying filters. But I do have this other index. Now, I'm not going to go through just coding everything, because it's pretty straightforward.

We have an index here, which is 1.2 million vectors, and it has a single metadata field in there, which I've called Tag1. And that's just a randomly generated number or integer from 0 to 100. So, we, of course, initialize the connection to our index in the first cell up here.

And then over here, I'm just creating a random query vector. So, first, this here is our unfiltered search. So, we get this 79.2 milliseconds. Now, again, most of this is network latency, rather than the search time in the actual index. But we will see the search time decrease pretty dramatically here.

So, first, we'll say, okay, we want Tag1 to be greater than 30. So, we're going from 0 to 100. So, we're roughly removing probably about 30% of the vectors from our search. And we can see, okay, we just shaved off 8 milliseconds, which is impressive. And then we take that even further.

So, we say, okay, we want it greater than 70. So, now, we're shaving off around 70% of our vectors. And our search time goes down to 56.6 milliseconds. Do it even further. So, about 90% here, we go down to 54 milliseconds. And then here, I'm using the equals sign here.

So, I'm only searching for about 1% of the index. And it goes down to 51.6 milliseconds. So, incredibly impressive speed up there. And this is kind of what it looks like. So, we have the Tag1 GT value or greater than value on the left. And as we increase that up this way, our time, our search time in milliseconds goes down.

Now, it is a little bit bumpy. It goes up and down a lot. I've tried to showcase that. In this graph. But the trend is quite clearly downwards. So, the more we filter, the faster our search, which is incredible. Now, that's it for this video covering pre-filtering, post-filtering, and Pinecone's new single stage filtering.

I hope this has been useful and insightful. If you are interested in testing Pinecone out yourself, there is a link to Pinecone's website in the description. But we'll leave it there for now. Thank you very much for watching. And I'll see you in the next one.

Metadata Filtering for Vector Search + Latest Filter Tech

Chapters

Transcript