back to indexImplementing Filters in the New Haystack Doc Store
Chapters
0:0 Intro
2:41 Filtering
5:36 Testing Existing Filter Utils
7:57 Making Sense of Filter Utils
10:35 Writing the First Filter
16:26 First Working Filter
18:24 Testing New Filters
21:27 Implementing in the Doc Store
24:2 Testing Pipeline Filters
27:11 Final Issue and Outro
00:00:00.000 |
Today we're going to have a look at the current state of the PR for the Haystack Pinecone 00:00:07.680 |
document store and we're going to try and work through and maybe solve all of the issues 00:00:15.120 |
although I tend to be quite optimistic so maybe that's not going to be realistic but we're going 00:00:21.680 |
to give it a go and we're going to go through and try and figure out what we need to do to actually 00:00:27.280 |
merge this into the Haystack library. So we'll just scroll down and 00:00:35.280 |
so I have Bogdan who has given a lot of feedback on this and the first thing or okay not the first 00:00:46.880 |
thing because I need to figure this out as well is make sure typing is compliant with mypy. 00:00:54.480 |
I mean for most of it I was just taking methods from other parts of the framework 00:01:00.560 |
so I'm not sure if there's been some updates since I started doing that so my types are just 00:01:07.120 |
kind of out of date or if it's some other types I've kind of put in there not realized they were 00:01:11.600 |
different I'm not sure so we'll figure that out later it's fine. So one big one is to make use 00:01:23.520 |
of the filter utils for converting the filters at the moment it's just a method inside the 00:01:30.320 |
document store called I think build filter clause because this handle is converting from a 00:01:37.520 |
from Haystack syntax of your so Haystack filter syntax to Pinecone filter syntax 00:01:47.760 |
so we need to just solve that and what do you think so it's just a load of these like little 00:01:54.480 |
ones that we're going to go through so for the so vector dim this is what makes me think I was 00:02:01.600 |
looking at some older version this has been deprecated now and now we're just using vector 00:02:08.640 |
or embedding dim so let's fix that so anywhere it says vector dim I just need to remove that 00:02:26.960 |
and just replace all of these so let's replace all of those so embedding dim 00:02:37.840 |
okay it's number one easy so this is the more difficult thing I think we need to to figure out 00:02:46.720 |
so there's this filter utils script and when I looked it was kind of confusing before so I'm 00:02:56.960 |
going to let's take a look now and see if it see if we can figure out so filter utils I think it's 00:03:05.120 |
here so it's still in the document store directory and we have this logical filter clause come down 00:03:15.200 |
here and ah okay okay this yeah this is why it's so confusing to me so we have convert to elastic 00:03:25.760 |
search convert to sql and they're all just empty so maybe I just need to I'm just going to try and 00:03:34.320 |
add the convert to pinecone method in here like pretty much as it is already 00:03:41.040 |
and hopefully I don't know figure that out but at the same time we have this logical filter clause 00:03:51.440 |
object and I'm not sure what that looks like exactly so if we come here class that is able 00:04:00.640 |
to pass a filter and convert it into the format that the underlying databases of the document 00:04:05.840 |
stores require filters are defined as nested dictionaries keys of dictionaries can be logical 00:04:13.840 |
operators comparison operator and so on okay so the logical filter clause is here 00:04:28.400 |
then we have the comparison operators as well okay 00:04:31.280 |
logical operator keys might take a dictionary metadata field names 00:04:38.080 |
so at the moment we are converting these correctly as far as I know 00:04:47.760 |
but I'm just not sure how to integrate that with this filter utils so let's have a look at 00:04:58.080 |
another document store let's have a look at we aviate and see how they use use that so it's 00:05:07.360 |
filter utils logical filter clause so they only use a logical filter clause here they don't use 00:05:14.880 |
the other one as order comparison so let's have a look logical filter clause see where it's being 00:05:22.080 |
used so if there's filters this is just being called so logical filter clause pass filters 00:05:29.280 |
convert to okay kind of need to see what this is so if we if we take 00:05:39.760 |
okay let's take this and we'll put it into a document into another file 00:05:50.800 |
okay let me create a file here and host like test 00:05:58.560 |
okay so I'm going to test this but we need to import it so where is 00:06:07.680 |
whereas we aviate importing this from here so okay let's take that 00:06:18.240 |
import it here should work I'll select the kernel haystack 00:06:31.760 |
take a filter from the filter test I created before 00:06:51.760 |
or operation wow super weird um convert to elastic convert to elastic search 00:07:10.000 |
right so I need to figure this out a little bit 00:07:41.520 |
so parse here okay not operation parse value okay not and 00:07:55.280 |
so it's like we need to modify the current filter so so we need to modify this to not 00:08:08.240 |
take a look directly at the keys but instead look at the objects 00:08:15.520 |
created over here so this and operation or or so on we also need to figure out what those are so 00:08:26.080 |
they in here they're actually in this script and operation 00:08:42.560 |
ah ah cool okay so this is where that is happening 00:08:54.080 |
so condition convert to elastic search where's condition coming from here um condition 00:09:03.200 |
self-lock okay it's just it's stored in there from the logical filter clause 00:09:11.680 |
okay so you initialize this logical filter clause which consumes 00:09:23.840 |
which consumes where is it um up here somewhere conditions here okay 00:09:32.640 |
self-conditions are added to the logical uh filter what's it called again logical filter clause 00:09:42.800 |
and then later on when we actually use this and operation 00:09:50.640 |
that is inheriting that value from the logical filter clause so 00:09:55.360 |
the condition self.conditions is already in here 00:09:58.320 |
okay and then in here we need to specify how to deal with each of those items 00:10:14.660 |
but then it's a condition convert to we aviate it's a condition in the 00:10:40.320 |
let's just take let's take the we aviate one if we do something 00:11:00.240 |
and the operator is not and is that what it is okay so the operator in this case is 00:11:09.120 |
and is that right can't remember now so or um let me filter this 00:11:18.480 |
where is um okay yeah it's and so and then the operands what is the operands here 00:11:34.320 |
all right let's see what that returns to us so first I think we also need to pass convert to 00:11:51.920 |
convert off in the original or in the first um class so logical filter clause 00:12:17.280 |
and we just pass I'm not going to write the uh description in there yet 00:12:21.680 |
so let's go ahead and I'm just going to open store that maybe I can also 00:12:29.920 |
let me add it for the or as well so or operation okay oh um no no not that one let me copy that 00:12:43.120 |
out okay where's your operation oh here so convert to pine cone again here this time we are using or 00:12:59.280 |
I imagine in here maybe I need to create a list 00:13:06.000 |
so let's let's just test it first see what we get 00:13:10.960 |
so install now let's go back and actually test that so um restart 00:13:21.280 |
yeah run this should we should now have so we get the filter date um don't need to do that again 00:13:36.880 |
and then from that we should be able to so if we just convert to we aviate for example 00:13:43.120 |
we should see something uh okay what happened there did I interrupt 00:13:55.120 |
pass we get this filter utils or operation convert to we aviate let's see what happens 00:14:06.160 |
okay cool and then if we do same again but convert to pine cone attribute error 00:14:17.360 |
um okay so I need to add convert pine cone to a lot of things here I think 00:14:30.000 |
okay um let's go back and do that then now you just leave it as we aviate we're going to reform 00:14:37.360 |
everything I just want to make sure that everything is in place and I can just modify things in there 00:14:42.720 |
the thing I think realistically the conversion from to hate to pine cone should be super 00:14:51.680 |
straightforward because we don't have all these different syntaxes ours is very aligned to 00:15:00.240 |
pine cone so it shouldn't really be an issue or pine cones is very aligned to haystacks 00:15:07.200 |
and so on I'm just going to skip forward to this in this bit okay so that should be enough 00:15:15.280 |
for that and we can first need to put console again 00:15:24.880 |
and then we'll just rerun this and make sure that we're actually returning some sort of 00:15:31.360 |
dictionary or filter let's make sure it's everything is in place so restart 00:15:48.720 |
okay so some issues in there let's try and figure that out okay so we've got this now so we have like 00:15:57.760 |
a really basic dictionary coming through so all you really need to do now is modify the code that 00:16:05.840 |
we've added to filter utils to align to the logic in here so I think what we'll do is maybe have a 00:16:15.600 |
few example filters and kind of figure out how we're going to write that in there so let's start 00:16:24.000 |
with that okay so we've got this working now at least for this one filter query which is pretty 00:16:34.000 |
cool a lot easier than I was expecting so that's really good and this sort of logic of putting 00:16:43.040 |
things or separating everything out I think makes a lot more sense particularly if you consider 00:16:48.960 |
how complex some of these other querying syntax must get so before I was thinking oh this seems 00:16:58.080 |
a really overly complicated way of doing things but now I think I'm convinced this is probably 00:17:04.000 |
one of the best ways to do it so the only one that we don't have is this not operation so I 00:17:15.680 |
don't really know what to do with that yet so I'm going to for now I'm just going to leave it and 00:17:20.800 |
come back to it later maybe just add like a not implemented error or something like that the only 00:17:32.160 |
reason I say that is it would be great to say if you say not this maybe you could invert it by just 00:17:41.200 |
looking for all of the opposite values but to do that you'd have to pull out all of the metadata 00:17:46.480 |
from a pine cone and I don't think it's possible to do that but something I need to check so for 00:17:55.440 |
now I'm just going to ignore this one we'll deal with it later so yeah let's that looks good I 00:18:05.360 |
think the only thing now I want to test it on some other filters this is just one filter maybe 00:18:11.680 |
it doesn't work on the others so let me find some of those so I think which notebook were those 00:18:21.120 |
testing filters and yeah we come down here we have a few other filters as well so let's try those 00:18:29.600 |
hopefully it works so filter down here we pass them and then we convert them to pine cone 00:18:54.400 |
let me just remove that so we see it straight away 00:18:59.760 |
okay so we have um we have a lot of stuff in here let me just remove this 00:19:09.040 |
okay um yeah I mean that looks looks right to me let's try another 00:19:23.360 |
there's a testing okay yeah uh we we just did that one so let's try this one 00:19:44.240 |
let's see what we get so convert to pine cone just this again 00:20:06.800 |
so I think I think we're probably ready to actually go ahead and test that in here so 00:20:15.040 |
let me take one of these um I'll take some of these filters I'm just going to apply directly 00:20:23.040 |
to pine cone to make sure it's actually working so to do this I have to create a load of vectors 00:20:30.400 |
and I don't want me to do that I just want to create like a couple of vectors so if I go to 00:20:40.640 |
pine cone demo come here it's okay so this one this one's easier 00:20:46.880 |
yeah so this one only does it with like six um so I'm gonna run through this and see if it works 00:20:56.160 |
so initialize I think we've already created this dictionary this document store um oh the other 00:21:04.960 |
thing we need to do is actually implement it in pine cone so that's important so in here at the 00:21:13.360 |
moment I'm going to go to filter I'm using this uh build filter clause I don't want that anymore 00:21:23.920 |
and then let's find where we're applying filters 00:21:30.800 |
okay let's have a look at right document see what I do 00:21:34.800 |
okay so here we're using self build filter clause so we need to change this to use 00:21:44.320 |
the same um logic as the other document source so so I look at um we aviate and I think 00:21:52.480 |
okay uh ignore that we aviate filters see where they okay here this is what we want to do 00:22:05.040 |
so the filter this just use this nothing more than that so it's pretty simple 00:22:13.440 |
so the filters are just going to be converted into that so but they we use filter dict here but 00:22:26.960 |
and I'm just going to search for this that will tell us where else we are 00:22:36.160 |
using that so search for this oh it's only the only here 00:22:41.760 |
oh really okay so that's that's it it's the only bit we need to change we also need to import this 00:22:48.880 |
so it's what is it logical filter okay import this 00:23:09.840 |
and then we can test it and hopefully it will work okay so come over here pine cone demo 00:23:26.320 |
and so we've already got a document store here so I don't think we need to 00:23:33.280 |
write these documents again yeah can ignore that dense passage retriever we do need 00:23:42.640 |
we don't need to update the embedding so let me just move this to a new cell 00:23:56.960 |
and then okay let's make sure it's working and there's answers and then we will try and apply 00:24:05.360 |
the metadata filter so filters just something really simple here and then maybe we can like 00:24:15.760 |
modify them a little bit to see what else you can get okay it looks good just going to run that 00:24:26.880 |
just so I can see maybe some of the metadata we have here it's just this one again I want to 00:24:34.320 |
return another name so I can add it like a crane or statement and see what that looks like but 00:24:41.680 |
for now it's fine okay let's run it okay cool so it looks like filtering actually worked 00:24:55.280 |
we're only returning two here because based on that filter there's only 00:24:59.920 |
two items in the out of the six so that that's really cool let's get all the documents so 00:25:08.640 |
and let's get some other names so we have this one 00:25:17.360 |
let's add that into our into our filter so we're going to go we're going to use all here so 00:25:43.920 |
okay let's run that I think that's in the right syntax 00:25:48.240 |
there we go so now we're returning more samples we're returning all of them now because they're 00:25:55.120 |
either from from this one or from this one which is really cool and we can we can double check that 00:26:04.800 |
is the case so I can't see anything there I return all documents and we'll go forward for duck in 00:26:17.440 |
is it duck can I do let's just check that this works no okay so 00:26:38.400 |
so let's just print all those out to make sure that is the case and just print 00:27:00.560 |
of the issues on the pull request I think there's one other maybe one other big one at least 00:27:09.120 |
okay so the typing with mypy so we can see that if we I'm not going to go through that now 00:27:19.040 |
but I can at least show you what that error is so 00:27:29.280 |
and we can see in the test with mypy there's a load of 00:27:35.280 |
typing issues like here all of these all typing issues so that's something to fix as well but 00:27:44.560 |
we're not going to go through that now I think it's not that interesting yeah I think that's it 00:27:50.800 |
for this video I hope it's been useful to just kind of go through that and like interesting to 00:27:57.760 |
see how haystack actually works and particularly haystack filtering works in the in the library 00:28:05.680 |
so yeah thank you very much for watching I hope it's been useful and I will see you in the next